1 =============================
2 User Guide for AMDGPU Backend
3 =============================
11 The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
12 R600 family up until the current GCN families. It lives in the
13 ``lib/Target/AMDGPU`` directory.
18 .. _amdgpu-target-triples:
23 Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
24 specify the target triple:
26 .. table:: AMDGPU Target Triples
27 :name: amdgpu-target-triples-table
29 ============ ======== ========= ===========
30 Architecture Vendor OS Environment
31 ============ ======== ========= ===========
32 r600 amd <empty> <empty>
33 amdgcn amd <empty> <empty>
34 amdgcn amd amdhsa <empty>
35 amdgcn amd amdhsa opencl
36 amdgcn amd amdhsa amdgizcl
37 amdgcn amd amdhsa amdgiz
39 ============ ======== ========= ===========
42 Supports AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders executed on
46 Supports AMD GPUs GCN GFX6 onwards for graphics and compute shaders executed on
49 ``amdgcn-amd-amdhsa-``
50 Supports AMD GCN GPUs GFX6 onwards for compute kernels executed on HSA [HSA]_
51 compatible runtimes such as AMD's ROCm [AMD-ROCm]_.
53 ``amdgcn-amd-amdhsa-opencl``
54 Supports AMD GCN GPUs GFX6 onwards for OpenCL compute kernels executed on HSA
55 [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. See
58 ``amdgcn-amd-amdhsa-amdgizcl``
59 Same as ``amdgcn-amd-amdhsa-opencl`` except a different address space mapping
60 is used (see :ref:`amdgpu-address-spaces`).
62 ``amdgcn-amd-amdhsa-amdgiz``
63 Same as ``amdgcn-amd-amdhsa-`` except a different address space mapping is
64 used (see :ref:`amdgpu-address-spaces`).
66 ``amdgcn-amd-amdhsa-hcc``
67 Supports AMD GCN GPUs GFX6 onwards for AMD HC language compute kernels
68 executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. See
71 .. _amdgpu-processors:
76 Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
77 names from both the *Processor* and *Alternative Processor* can be used.
79 .. table:: AMDGPU Processors
80 :name: amdgpu-processors-table
82 ========== =========== ============ ===== ======= ==================
83 Processor Alternative Target dGPU/ Runtime Example
84 Processor Triple APU Support Products
86 ========== =========== ============ ===== ======= ==================
88 --------------------------------------------------------------------
94 --------------------------------------------------------------------
98 **Evergreen** [AMD-Evergreen]_
99 --------------------------------------------------------------------
105 **Northern Islands** [AMD-Cayman-Trinity]_
106 --------------------------------------------------------------------
111 **GCN GFX6 (Southern Islands (SI))** [AMD-Souther-Islands]_
112 --------------------------------------------------------------------
113 gfx600 - tahiti amdgcn dGPU
114 gfx601 - pitcairn amdgcn dGPU
118 **GCN GFX7 (Sea Islands (CI))** [AMD-Sea-Islands]_
119 --------------------------------------------------------------------
120 gfx700 - bonaire amdgcn dGPU - Radeon HD 7790
124 \ - kaveri amdgcn APU - A6-7000
134 gfx701 - hawaii amdgcn dGPU ROCm - FirePro W8100
138 gfx702 dGPU ROCm - Radeon R9 290
142 gfx703 - kabini amdgcn APU - E1-2100
151 **GCN GFX8 (Volcanic Islands (VI))** [AMD-Volcanic-Islands]_
152 --------------------------------------------------------------------
153 gfx800 - iceland amdgcn dGPU - FirePro S7150
161 gfx801 - carrizo amdgcn APU - A6-8500P
167 \ amdgcn APU ROCm - A10-8700P
170 \ amdgcn APU - A10-9600P
176 \ amdgcn APU - E2-9010
179 gfx802 - tonga amdgcn dGPU ROCm Same as gfx800
180 gfx803 - fiji amdgcn dGPU ROCm - Radeon R9 Nano
185 - Radeon Instinct MI8
186 \ - polaris10 amdgcn dGPU ROCm - Radeon RX 470
188 - Radeon Instinct MI6
189 \ - polaris11 amdgcn dGPU ROCm - Radeon RX 460
190 gfx804 amdgcn dGPU Same as gfx803
191 gfx810 - stoney amdgcn APU
192 **GCN GFX9** [AMD-Vega]_
193 --------------------------------------------------------------------
194 gfx900 amdgcn dGPU - Radeon Vega
200 - Radeon Instinct MI25
201 gfx901 amdgcn dGPU ROCm Same as gfx900
204 gfx902 amdgcn APU *TBA*
209 gfx903 amdgcn APU Same as gfx902
212 ========== =========== ============ ===== ======= ==================
214 .. _amdgpu-address-spaces:
219 The AMDGPU backend uses the following address space mappings.
221 The memory space names used in the table, aside from the region memory space, is
222 from the OpenCL standard.
224 LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
226 .. table:: Address Space Mapping
227 :name: amdgpu-address-space-mapping-table
229 ================== ================= ================= ================= =================
230 LLVM Address Space Memory Space
231 ------------------ -----------------------------------------------------------------------
232 \ Current Default amdgiz/amdgizcl hcc Future Default
233 ================== ================= ================= ================= =================
234 0 Private (Scratch) Generic (Flat) Generic (Flat) Generic (Flat)
235 1 Global Global Global Global
236 2 Constant Constant Constant Region (GDS)
237 3 Local (group/LDS) Local (group/LDS) Local (group/LDS) Local (group/LDS)
238 4 Generic (Flat) Region (GDS) Region (GDS) Constant
239 5 Region (GDS) Private (Scratch) Private (Scratch) Private (Scratch)
240 ================== ================= ================= ================= =================
243 This is the current default address space mapping used for all languages
244 except hcc. This will shortly be deprecated.
247 This is the current address space mapping used when ``amdgiz`` or ``amdgizcl``
248 is specified as the target triple environment value.
251 This is the current address space mapping used when ``hcc`` is specified as
252 the target triple environment value.This will shortly be deprecated.
255 This will shortly be the only address space mapping for all languages using
258 .. _amdgpu-memory-scopes:
263 This section provides LLVM memory synchronization scopes supported by the AMDGPU
264 backend memory model when the target triple OS is ``amdhsa`` (see
265 :ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
267 The memory model supported is based on the HSA memory model [HSA]_ which is
268 based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
269 relation is transitive over the synchonizes-with relation independent of scope,
270 and synchonizes-with allows the memory scope instances to be inclusive (see
271 table :ref:`amdgpu-amdhsa-llvm-sync-scopes-amdhsa-table`).
273 This is different to the OpenCL [OpenCL]_ memory model which does not have scope
274 inclusion and requires the memory scopes to exactly match. However, this
275 is conservatively correct for OpenCL.
277 .. table:: AMDHSA LLVM Sync Scopes for AMDHSA
278 :name: amdgpu-amdhsa-llvm-sync-scopes-amdhsa-table
280 ================ ==========================================================
281 LLVM Sync Scope Description
282 ================ ==========================================================
283 *none* The default: ``system``.
285 Synchronizes with, and participates in modification and
286 seq_cst total orderings with, other operations (except
287 image operations) for all address spaces (except private,
288 or generic that accesses private) provided the other
289 operation's sync scope is:
292 - ``agent`` and executed by a thread on the same agent.
293 - ``workgroup`` and executed by a thread in the same
295 - ``wavefront`` and executed by a thread in the same
298 ``agent`` Synchronizes with, and participates in modification and
299 seq_cst total orderings with, other operations (except
300 image operations) for all address spaces (except private,
301 or generic that accesses private) provided the other
302 operation's sync scope is:
304 - ``system`` or ``agent`` and executed by a thread on the
306 - ``workgroup`` and executed by a thread in the same
308 - ``wavefront`` and executed by a thread in the same
311 ``workgroup`` Synchronizes with, and participates in modification and
312 seq_cst total orderings with, other operations (except
313 image operations) for all address spaces (except private,
314 or generic that accesses private) provided the other
315 operation's sync scope is:
317 - ``system``, ``agent`` or ``workgroup`` and executed by a
318 thread in the same workgroup.
319 - ``wavefront`` and executed by a thread in the same
322 ``wavefront`` Synchronizes with, and participates in modification and
323 seq_cst total orderings with, other operations (except
324 image operations) for all address spaces (except private,
325 or generic that accesses private) provided the other
326 operation's sync scope is:
328 - ``system``, ``agent``, ``workgroup`` or ``wavefront``
329 and executed by a thread in the same wavefront.
331 ``singlethread`` Only synchronizes with, and participates in modification
332 and seq_cst total orderings with, other operations (except
333 image operations) running in the same thread for all
334 address spaces (for example, in signal handlers).
335 ================ ==========================================================
340 The AMDGPU backend implements the following intrinsics.
342 *This section is WIP.*
345 List AMDGPU intrinsics
350 The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
351 can be linked by ``lld`` to produce a standard ELF shared code object which can
352 be loaded and executed on an AMDGPU target.
357 The AMDGPU backend uses the following ELF header:
359 .. table:: AMDGPU ELF Header
360 :name: amdgpu-elf-header-table
362 ========================== ===============================
364 ========================== ===============================
365 ``e_ident[EI_CLASS]`` ``ELFCLASS64``
366 ``e_ident[EI_DATA]`` ``ELFDATA2LSB``
367 ``e_ident[EI_OSABI]`` ``ELFOSABI_AMDGPU_HSA``,
368 ``ELFOSABI_AMDGPU_PAL`` or
369 ``ELFOSABI_AMDGPU_MESA3D``
370 ``e_ident[EI_ABIVERSION]`` ``ELFABIVERSION_AMDGPU_HSA``,
371 ``ELFABIVERSION_AMDGPU_PAL`` or
372 ``ELFABIVERSION_AMDGPU_MESA3D``
373 ``e_type`` ``ET_REL`` or ``ET_DYN``
374 ``e_machine`` ``EM_AMDGPU``
377 ========================== ===============================
381 .. table:: AMDGPU ELF Header Enumeration Values
382 :name: amdgpu-elf-header-enumeration-values-table
384 =============================== =====
386 =============================== =====
388 ``ELFOSABI_AMDGPU_HSA`` 64
389 ``ELFOSABI_AMDGPU_PAL`` 65
390 ``ELFOSABI_AMDGPU_MESA3D`` 66
391 ``ELFABIVERSION_AMDGPU_HSA`` 1
392 ``ELFABIVERSION_AMDGPU_PAL`` 0
393 ``ELFABIVERSION_AMDGPU_MESA3D`` 0
394 =============================== =====
396 ``e_ident[EI_CLASS]``
397 The ELF class is always ``ELFCLASS64``. The AMDGPU backend only supports 64
401 All AMDGPU targets use ELFDATA2LSB for little-endian byte ordering.
403 ``e_ident[EI_OSABI]``
404 One of the following AMD GPU architecture specific OS ABIs:
406 * ``ELFOSABI_AMDGPU_HSA`` is used to specify that the code object conforms to
407 the AMD HSA runtime ABI [HSA]_.
409 * ``ELFOSABI_AMDGPU_PAL`` is used to specify that the code object conforms to
410 the AMD PAL runtime ABI.
412 * ``ELFOSABI_AMDGPU_MESA3D`` is used to specify that the code object conforms
413 to the AMD MESA runtime ABI.
415 ``e_ident[EI_ABIVERSION]``
416 The ABI version of the AMD GPU architecture specific OS ABI to which the code
419 * ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
422 * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
425 * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
429 Can be one of the following values:
433 The type produced by the AMD GPU backend compiler as it is relocatable code
437 The type produced by the linker as it is a shared code object.
439 The AMD HSA runtime loader requires a ``ET_DYN`` code object.
442 The value ``EM_AMDGPU`` is used for the machine for all members of the AMD GPU
443 architecture family. The specific member is specified in the
444 ``NT_AMD_AMDGPU_ISA`` entry in the ``.note`` section (see
445 :ref:`amdgpu-note-records`).
448 The entry point is 0 as the entry points for individual kernels must be
449 selected in order to invoke them through AQL packets.
452 The value is 0 as no flags are used.
457 An AMDGPU target ELF code object has the standard ELF sections which include:
459 .. table:: AMDGPU ELF Sections
460 :name: amdgpu-elf-sections-table
462 ================== ================ =================================
464 ================== ================ =================================
465 ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
466 ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
467 ``.debug_``\ *\** ``SHT_PROGBITS`` *none*
468 ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
469 ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
470 ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
471 ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
472 ``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
473 ``.note`` ``SHT_NOTE`` *none*
474 ``.rela``\ *name* ``SHT_RELA`` *none*
475 ``.rela.dyn`` ``SHT_RELA`` *none*
476 ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
477 ``.shstrtab`` ``SHT_STRTAB`` *none*
478 ``.strtab`` ``SHT_STRTAB`` *none*
479 ``.symtab`` ``SHT_SYMTAB`` *none*
480 ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
481 ================== ================ =================================
483 These sections have their standard meanings (see [ELF]_) and are only generated
487 The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
488 DWARF produced by the AMDGPU backend.
490 ``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
491 The standard sections used by a dynamic loader.
494 See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
497 ``.rela``\ *name*, ``.rela.dyn``
498 For relocatable code objects, *name* is the name of the section that the
499 relocation records apply. For example, ``.rela.text`` is the section name for
500 relocation records associated with the ``.text`` section.
502 For linked shared code objects, ``.rela.dyn`` contains all the relocation
503 records from each of the relocatable code object's ``.rela``\ *name* sections.
505 See :ref:`amdgpu-relocation-records` for the relocation records supported by
509 The executable machine code for the kernels and functions they call. Generated
510 as position independent code. See :ref:`amdgpu-code-conventions` for
511 information on conventions used in the isa generation.
513 .. _amdgpu-note-records:
518 As required by ``ELFCLASS64``, minimal zero byte padding must be generated after
519 the ``name`` field to ensure the ``desc`` field is 4 byte aligned. In addition,
520 minimal zero byte padding must be generated to ensure the ``desc`` field size is
521 a multiple of 4 bytes. The ``sh_addralign`` field of the ``.note`` section must
522 be at least 4 to indicate at least 8 byte alignment.
524 The AMDGPU backend code object uses the following ELF note records in the
525 ``.note`` section. The *Description* column specifies the layout of the note
526 record’s ``desc`` field. All fields are consecutive bytes. Note records with
527 variable size strings have a corresponding ``*_size`` field that specifies the
528 number of bytes, including the terminating null character, in the string. The
529 string(s) come immediately after the preceding fields.
531 Additional note records can be present.
533 .. table:: AMDGPU ELF Note Records
534 :name: amdgpu-elf-note-records-table
536 ===== ============================== ======================================
537 Name Type Description
538 ===== ============================== ======================================
539 "AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
540 "AMD" ``NT_AMD_AMDGPU_ISA`` <isa name null terminated string>
541 ===== ============================== ======================================
545 .. table:: AMDGPU ELF Note Record Enumeration Values
546 :name: amdgpu-elf-note-record-enumeration-values-table
548 ============================== =====
550 ============================== =====
552 ``NT_AMD_AMDGPU_HSA_METADATA`` 10
553 ``NT_AMD_AMDGPU_ISA`` 11
554 ============================== =====
556 ``NT_AMD_AMDGPU_ISA``
557 Specifies the instruction set architecture used by the machine code contained
560 This note record is required for code objects containing machine code for
561 processors matching the ``amdgcn`` architecture in table
562 :ref:`amdgpu-processors`.
564 The null terminated string has the following syntax:
566 *architecture*\ ``-``\ *vendor*\ ``-``\ *os*\ ``-``\ *environment*\ ``-``\ *processor*
571 The architecture from table :ref:`amdgpu-target-triples-table`.
573 This is always ``amdgcn`` when the target triple OS is ``amdhsa`` (see
574 :ref:`amdgpu-target-triples`).
577 The vendor from table :ref:`amdgpu-target-triples-table`.
579 For the AMDGPU backend this is always ``amd``.
582 The OS from table :ref:`amdgpu-target-triples-table`.
585 An environment from table :ref:`amdgpu-target-triples-table`, or blank if
586 the environment has no affect on the execution of the code object.
588 For the AMDGPU backend this is currently always blank.
590 The processor from table :ref:`amdgpu-processors-table`.
594 ``amdgcn-amd-amdhsa--gfx901``
596 ``NT_AMD_AMDGPU_HSA_METADATA``
597 Specifies extensible metadata associated with the code objects executed on HSA
598 [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
599 the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
600 :ref:`amdgpu-amdhsa-hsa-code-object-metadata` for the syntax of the code
601 object metadata string.
608 Symbols include the following:
610 .. table:: AMDGPU ELF Symbols
611 :name: amdgpu-elf-symbols-table
613 ===================== ============== ============= ==================
614 Name Type Section Description
615 ===================== ============== ============= ==================
616 *link-name* ``STT_OBJECT`` - ``.data`` Global variable
619 *link-name*\ ``@kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
620 *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point
621 ===================== ============== ============= ==================
624 Global variables both used and defined by the compilation unit.
626 If the symbol is defined in the compilation unit then it is allocated in the
627 appropriate section according to if it has initialized data or is readonly.
629 If the symbol is external then its section is ``STN_UNDEF`` and the loader
630 will resolve relocations using the definition provided by another code object
631 or explicitly defined by the runtime.
633 All global symbols, whether defined in the compilation unit or external, are
634 accessed by the machine code indirectly through a GOT table entry. This
635 allows them to be preemptable. The GOT table is only supported when the target
636 triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
639 Add description of linked shared object symbols. Seems undefined symbols
640 are marked as STT_NOTYPE.
643 Every HSA kernel has an associated kernel descriptor. It is the address of the
644 kernel descriptor that is used in the AQL dispatch packet used to invoke the
645 kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
646 defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
649 Every HSA kernel also has a symbol for its machine code entry point.
651 .. _amdgpu-relocation-records:
656 AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
657 relocatable fields are:
660 This specifies a 32-bit field occupying 4 bytes with arbitrary byte
661 alignment. These values use the same byte order as other word values in the
662 AMD GPU architecture.
665 This specifies a 64-bit field occupying 8 bytes with arbitrary byte
666 alignment. These values use the same byte order as other word values in the
667 AMD GPU architecture.
669 Following notations are used for specifying relocation calculations:
672 Represents the addend used to compute the value of the relocatable field.
675 Represents the offset into the global offset table at which the relocation
676 entry’s symbol will reside during execution.
679 Represents the address of the global offset table.
682 Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
683 of the storage unit being relocated (computed using ``r_offset``).
686 Represents the value of the symbol whose index resides in the relocation
689 The following relocation types are supported:
691 .. table:: AMDGPU ELF Relocation Records
692 :name: amdgpu-elf-relocation-records-table
694 ========================== ===== ========== ==============================
695 Relocation Type Value Field Calculation
696 ========================== ===== ========== ==============================
697 ``R_AMDGPU_NONE`` 0 *none* *none*
698 ``R_AMDGPU_ABS32_LO`` 1 ``word32`` (S + A) & 0xFFFFFFFF
699 ``R_AMDGPU_ABS32_HI`` 2 ``word32`` (S + A) >> 32
700 ``R_AMDGPU_ABS64`` 3 ``word64`` S + A
701 ``R_AMDGPU_REL32`` 4 ``word32`` S + A - P
702 ``R_AMDGPU_REL64`` 5 ``word64`` S + A - P
703 ``R_AMDGPU_ABS32`` 6 ``word32`` S + A
704 ``R_AMDGPU_GOTPCREL`` 7 ``word32`` G + GOT + A - P
705 ``R_AMDGPU_GOTPCREL32_LO`` 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
706 ``R_AMDGPU_GOTPCREL32_HI`` 9 ``word32`` (G + GOT + A - P) >> 32
707 ``R_AMDGPU_REL32_LO`` 10 ``word32`` (S + A - P) & 0xFFFFFFFF
708 ``R_AMDGPU_REL32_HI`` 11 ``word32`` (S + A - P) >> 32
709 ========================== ===== ========== ==============================
716 Standard DWARF [DWARF]_ Version 2 sections can be generated. These contain
717 information that maps the code object executable code and data to the source
718 language constructs. It can be used by tools such as debuggers and profilers.
720 Address Space Mapping
721 ~~~~~~~~~~~~~~~~~~~~~
723 The following address space mapping is used:
725 .. table:: AMDGPU DWARF Address Space Mapping
726 :name: amdgpu-dwarf-address-space-mapping-table
728 =================== =================
729 DWARF Address Space Memory Space
730 =================== =================
735 *omitted* Generic (Flat)
736 *not supported* Region (GDS)
737 =================== =================
739 See :ref:`amdgpu-address-spaces` for information on the memory space terminology
742 An ``address_class`` attribute is generated on pointer type DIEs to specify the
743 DWARF address space of the value of the pointer when it is in the *private* or
744 *local* address space. Otherwise the attribute is omitted.
746 An ``XDEREF`` operation is generated in location list expressions for variables
747 that are allocated in the *private* and *local* address space. Otherwise no
748 ``XDREF`` is omitted.
753 *This section is WIP.*
756 Define DWARF register enumeration.
758 If want to present a wavefront state then should expose vector registers as
759 64 wide (rather than per work-item view that LLVM uses). Either as separate
760 registers, or a 64x4 byte single register. In either case use a new LANE op
761 (akin to XDREF) to select the current lane usage in a location
762 expression. This would also allow scalar register spilling to vector register
763 lanes to be expressed (currently no debug information is being generated for
764 spilling). If choose a wide single register approach then use LANE in
765 conjunction with PIECE operation to select the dword part of the register for
766 the current lane. If the separate register approach then use LANE to select
772 *This section is WIP.*
775 DWARF extension to include runtime generated source text.
777 .. _amdgpu-code-conventions:
782 This section provides code conventions used for each supported target triple OS
783 (see :ref:`amdgpu-target-triples`).
788 This section provides code conventions used when the target triple OS is
789 ``amdhsa`` (see :ref:`amdgpu-target-triples`).
791 .. _amdgpu-amdhsa-hsa-code-object-metadata:
796 The code object metadata specifies extensible metadata associated with the code
797 objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
798 [AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record
799 (see :ref:`amdgpu-note-records`) and is required when the target triple OS is
800 ``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
801 information necessary to support the ROCM kernel queries. For example, the
802 segment sizes needed in a dispatch packet. In addition, a high level language
803 runtime may require other information to be included. For example, the AMD
804 OpenCL runtime records kernel argument information.
806 The metadata is specified as a YAML formatted string (see [YAML]_ and
810 Is the string null terminated? It probably should not if YAML allows it to
811 contain null characters, otherwise it should be.
813 The metadata is represented as a single YAML document comprised of the mapping
814 defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
817 For boolean values, the string values of ``false`` and ``true`` are used for
818 false and true respectively.
820 Additional information can be added to the mappings. To avoid conflicts, any
821 non-AMD key names should be prefixed by "*vendor-name*.".
823 .. table:: AMDHSA Code Object Metadata Mapping
824 :name: amdgpu-amdhsa-code-object-metadata-mapping-table
826 ========== ============== ========= =======================================
827 String Key Value Type Required? Description
828 ========== ============== ========= =======================================
829 "Version" sequence of Required - The first integer is the major
830 2 integers version. Currently 1.
831 - The second integer is the minor
832 version. Currently 0.
833 "Printf" sequence of Each string is encoded information
834 strings about a printf function call. The
835 encoded information is organized as
836 fields separated by colon (':'):
838 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
843 A 32 bit integer as a unique id for
844 each printf function call
847 A 32 bit integer equal to the number
848 of arguments of printf function call
851 ``S[i]`` (where i = 0, 1, ... , N-1)
852 32 bit integers for the size in bytes
853 of the i-th FormatString argument of
854 the printf function call
857 The format string passed to the
858 printf function call.
859 "Kernels" sequence of Required Sequence of the mappings for each
860 mapping kernel in the code object. See
861 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
862 for the definition of the mapping.
863 ========== ============== ========= =======================================
867 .. table:: AMDHSA Code Object Kernel Metadata Mapping
868 :name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
870 ================= ============== ========= ================================
871 String Key Value Type Required? Description
872 ================= ============== ========= ================================
873 "Name" string Required Source name of the kernel.
874 "SymbolName" string Required Name of the kernel
875 descriptor ELF symbol.
876 "Language" string Source language of the kernel.
884 "LanguageVersion" sequence of - The first integer is the major
886 - The second integer is the
888 "Attrs" mapping Mapping of kernel attributes.
890 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
891 for the mapping definition.
892 "Args" sequence of Sequence of mappings of the
893 mapping kernel arguments. See
894 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
895 for the definition of the mapping.
896 "CodeProps" mapping Mapping of properties related to
898 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
899 for the mapping definition.
900 "DebugProps" mapping Mapping of properties related to
901 the kernel debugging. See
902 :ref:`amdgpu-amdhsa-code-object-kernel-debug-properties-metadata-mapping-table`
903 for the mapping definition.
904 ================= ============== ========= ================================
908 .. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
909 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
911 =================== ============== ========= ==============================
912 String Key Value Type Required? Description
913 =================== ============== ========= ==============================
914 "ReqdWorkGroupSize" sequence of The dispatch work-group size
915 3 integers X, Y, Z must correspond to the
918 Corresponds to the OpenCL
919 ``reqd_work_group_size``
921 "WorkGroupSizeHint" sequence of The dispatch work-group size
922 3 integers X, Y, Z is likely to be the
925 Corresponds to the OpenCL
926 ``work_group_size_hint``
928 "VecTypeHint" string The name of a scalar or vector
931 Corresponds to the OpenCL
932 ``vec_type_hint`` attribute.
934 "RuntimeHandle" string The external symbol name
935 associated with a kernel.
936 OpenCL runtime allocates a
937 global buffer for the symbol
938 and saves the kernel's address
939 to it, which is used for
940 device side enqueueing. Only
941 available for device side
943 =================== ============== ========= ==============================
947 .. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
948 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
950 ================= ============== ========= ================================
951 String Key Value Type Required? Description
952 ================= ============== ========= ================================
953 "Name" string Kernel argument name.
954 "TypeName" string Kernel argument type name.
955 "Size" integer Required Kernel argument size in bytes.
956 "Align" integer Required Kernel argument alignment in
957 bytes. Must be a power of two.
958 "ValueKind" string Required Kernel argument kind that
959 specifies how to set up the
960 corresponding argument.
964 The argument is copied
965 directly into the kernarg.
968 A global address space pointer
969 to the buffer data is passed
972 "DynamicSharedPointer"
973 A group address space pointer
974 to dynamically allocated LDS
975 is passed in the kernarg.
978 A global address space
979 pointer to a S# is passed in
983 A global address space
984 pointer to a T# is passed in
988 A global address space pointer
989 to an OpenCL pipe is passed in
993 A global address space pointer
994 to an OpenCL device enqueue
995 queue is passed in the
998 "HiddenGlobalOffsetX"
999 The OpenCL grid dispatch
1000 global offset for the X
1001 dimension is passed in the
1004 "HiddenGlobalOffsetY"
1005 The OpenCL grid dispatch
1006 global offset for the Y
1007 dimension is passed in the
1010 "HiddenGlobalOffsetZ"
1011 The OpenCL grid dispatch
1012 global offset for the Z
1013 dimension is passed in the
1017 An argument that is not used
1018 by the kernel. Space needs to
1019 be left for it, but it does
1020 not need to be set up.
1022 "HiddenPrintfBuffer"
1023 A global address space pointer
1024 to the runtime printf buffer
1025 is passed in kernarg.
1027 "HiddenDefaultQueue"
1028 A global address space pointer
1029 to the OpenCL device enqueue
1030 queue that should be used by
1031 the kernel by default is
1032 passed in the kernarg.
1034 "HiddenCompletionAction"
1040 "ValueType" string Required Kernel argument value type. Only
1041 present if "ValueKind" is
1042 "ByValue". For vector data
1043 types, the value is for the
1044 element type. Values include:
1060 How can it be determined if a
1061 vector type, and what size
1063 "PointeeAlign" integer Alignment in bytes of pointee
1064 type for pointer type kernel
1065 argument. Must be a power
1066 of 2. Only present if
1068 "DynamicSharedPointer".
1069 "AddrSpaceQual" string Kernel argument address space
1070 qualifier. Only present if
1071 "ValueKind" is "GlobalBuffer" or
1072 "DynamicSharedPointer". Values
1083 Is GlobalBuffer only Global
1085 DynamicSharedPointer always
1086 Local? Can HCC allow Generic?
1087 How can Private or Region
1089 "AccQual" string Kernel argument access
1090 qualifier. Only present if
1091 "ValueKind" is "Image" or
1102 "ActualAccQual" string The actual memory accesses
1103 performed by the kernel on the
1104 kernel argument. Only present if
1105 "ValueKind" is "GlobalBuffer",
1106 "Image", or "Pipe". This may be
1107 more restrictive than indicated
1108 by "AccQual" to reflect what the
1109 kernel actual does. If not
1110 present then the runtime must
1111 assume what is implied by
1112 "AccQual" and "IsConst". Values
1119 "IsConst" boolean Indicates if the kernel argument
1120 is const qualified. Only present
1124 "IsRestrict" boolean Indicates if the kernel argument
1125 is restrict qualified. Only
1126 present if "ValueKind" is
1129 "IsVolatile" boolean Indicates if the kernel argument
1130 is volatile qualified. Only
1131 present if "ValueKind" is
1134 "IsPipe" boolean Indicates if the kernel argument
1135 is pipe qualified. Only present
1136 if "ValueKind" is "Pipe".
1139 Can GlobalBuffer be pipe
1141 ================= ============== ========= ================================
1145 .. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
1146 :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
1148 ============================ ============== ========= =====================
1149 String Key Value Type Required? Description
1150 ============================ ============== ========= =====================
1151 "KernargSegmentSize" integer Required The size in bytes of
1153 that holds the values
1156 "GroupSegmentFixedSize" integer Required The amount of group
1160 bytes. This does not
1162 dynamically allocated
1163 group segment memory
1167 "PrivateSegmentFixedSize" integer Required The amount of fixed
1168 private address space
1169 memory required for a
1173 is 1 then additional
1175 to this value for the
1177 "KernargSegmentAlign" integer Required The maximum byte
1180 kernarg segment. Must
1182 "WavefrontSize" integer Required Wavefront size. Must
1184 "NumSGPRs" integer Number of scalar
1188 includes the special
1194 SGPR added if a trap
1200 "NumVGPRs" integer Number of vector
1204 "MaxFlatWorkgroupSize" integer Maximum flat
1207 kernel in work-items.
1208 "IsDynamicCallStack" boolean Indicates if the
1213 "IsXNACKEnabled" boolean Indicates if the
1217 ============================ ============== ========= =====================
1221 .. table:: AMDHSA Code Object Kernel Debug Properties Metadata Mapping
1222 :name: amdgpu-amdhsa-code-object-kernel-debug-properties-metadata-mapping-table
1224 =================================== ============== ========= ==============
1225 String Key Value Type Required? Description
1226 =================================== ============== ========= ==============
1227 "DebuggerABIVersion" sequence of
1229 "ReservedNumVGPRs" integer
1230 "ReservedFirstVGPR" integer
1231 "PrivateSegmentBufferSGPR" integer
1232 "WavefrontPrivateSegmentOffsetSGPR" integer
1233 =================================== ============== ========= ==============
1236 Plan to remove the debug properties metadata.
1241 The HSA architected queuing language (AQL) defines a user space memory interface
1242 that can be used to control the dispatch of kernels, in an agent independent
1243 way. An agent can have zero or more AQL queues created for it using the ROCm
1244 runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
1245 *HSA Platform System Architecture Specification* [HSA]_ for the AQL queue
1246 mechanics and packet layouts.
1248 The packet processor of a kernel agent is responsible for detecting and
1249 dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
1250 packet processor is implemented by the hardware command processor (CP),
1251 asynchronous dispatch controller (ADC) and shader processor input controller
1254 The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
1255 mode driver to initialize and register the AQL queue with CP.
1257 To dispatch a kernel the following actions are performed. This can occur in the
1258 CPU host program, or from an HSA kernel executing on a GPU.
1260 1. A pointer to an AQL queue for the kernel agent on which the kernel is to be
1261 executed is obtained.
1262 2. A pointer to the kernel descriptor (see
1263 :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
1264 obtained. It must be for a kernel that is contained in a code object that that
1265 was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
1267 3. Space is allocated for the kernel arguments using the ROCm runtime allocator
1268 for a memory region with the kernarg property for the kernel agent that will
1269 execute the kernel. It must be at least 16 byte aligned.
1270 4. Kernel argument values are assigned to the kernel argument memory
1271 allocation. The layout is defined in the *HSA Programmer’s Language Reference*
1272 [HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
1273 memory in the same way constant memory is accessed. (Note that the HSA
1274 specification allows an implementation to copy the kernel argument contents to
1275 another location that is accessed by the kernel.)
1276 5. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
1277 api uses 64 bit atomic operations to reserve space in the AQL queue for the
1278 packet. The packet must be set up, and the final write must use an atomic
1279 store release to set the packet kind to ensure the packet contents are
1280 visible to the kernel agent. AQL defines a doorbell signal mechanism to
1281 notify the kernel agent that the AQL queue has been updated. These rules, and
1282 the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
1283 System Architecture Specification* [HSA]_.
1284 6. A kernel dispatch packet includes information about the actual dispatch,
1285 such as grid and work-group size, together with information from the code
1286 object about the kernel, such as segment sizes. The ROCm runtime queries on
1287 the kernel symbol can be used to obtain the code object values which are
1288 recorded in the :ref:`amdgpu-amdhsa-hsa-code-object-metadata`.
1289 7. CP executes micro-code and is responsible for detecting and setting up the
1290 GPU to execute the wavefronts of a kernel dispatch.
1291 8. CP ensures that when the a wavefront starts executing the kernel machine
1292 code, the scalar general purpose registers (SGPR) and vector general purpose
1293 registers (VGPR) are set up as required by the machine code. The required
1294 setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
1295 register state is defined in
1296 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
1297 9. The prolog of the kernel machine code (see
1298 :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
1299 before continuing executing the machine code that corresponds to the kernel.
1300 10. When the kernel dispatch has completed execution, CP signals the completion
1301 signal specified in the kernel dispatch packet if not 0.
1303 .. _amdgpu-amdhsa-memory-spaces:
1308 The memory space properties are:
1310 .. table:: AMDHSA Memory Spaces
1311 :name: amdgpu-amdhsa-memory-spaces-table
1313 ================= =========== ======== ======= ==================
1314 Memory Space Name HSA Segment Hardware Address NULL Value
1316 ================= =========== ======== ======= ==================
1317 Private private scratch 32 0x00000000
1318 Local group LDS 32 0xFFFFFFFF
1319 Global global global 64 0x0000000000000000
1320 Constant constant *same as 64 0x0000000000000000
1322 Generic flat flat 64 0x0000000000000000
1323 Region N/A GDS 32 *not implemented
1325 ================= =========== ======== ======= ==================
1327 The global and constant memory spaces both use global virtual addresses, which
1328 are the same virtual address space used by the CPU. However, some virtual
1329 addresses may only be accessible to the CPU, some only accessible by the GPU,
1332 Using the constant memory space indicates that the data will not change during
1333 the execution of the kernel. This allows scalar read instructions to be
1334 used. The vector and scalar L1 caches are invalidated of volatile data before
1335 each kernel dispatch execution to allow constant memory to change values between
1338 The local memory space uses the hardware Local Data Store (LDS) which is
1339 automatically allocated when the hardware creates work-groups of wavefronts, and
1340 freed when all the wavefronts of a work-group have terminated. The data store
1341 (DS) instructions can be used to access it.
1343 The private memory space uses the hardware scratch memory support. If the kernel
1344 uses scratch, then the hardware allocates memory that is accessed using
1345 wavefront lane dword (4 byte) interleaving. The mapping used from private
1346 address to physical address is:
1348 ``wavefront-scratch-base +
1349 (private-address * wavefront-size * 4) +
1350 (wavefront-lane-id * 4)``
1352 There are different ways that the wavefront scratch base address is determined
1353 by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
1354 memory can be accessed in an interleaved manner using buffer instruction with
1355 the scratch buffer descriptor and per wave scratch offset, by the scratch
1356 instructions, or by flat instructions. If each lane of a wavefront accesses the
1357 same private address, the interleaving results in adjacent dwords being accessed
1358 and hence requires fewer cache lines to be fetched. Multi-dword access is not
1359 supported except by flat and scratch instructions in GFX9.
1361 The generic address space uses the hardware flat address support available in
1362 GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
1363 local appertures), that are outside the range of addressible global memory, to
1364 map from a flat address to a private or local address.
1366 FLAT instructions can take a flat address and access global, private (scratch)
1367 and group (LDS) memory depending in if the address is within one of the
1368 apperture ranges. Flat access to scratch requires hardware aperture setup and
1369 setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
1370 access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
1371 (see :ref:`amdgpu-amdhsa-m0`).
1373 To convert between a segment address and a flat address the base address of the
1374 appertures address can be used. For GFX7-GFX8 these are available in the
1375 :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
1376 Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
1377 GFX9 the appature base addresses are directly available as inline constant
1378 registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
1379 address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
1380 which makes it easier to convert from flat to segment or segment to flat.
1385 Image and sample handles created by the ROCm runtime are 64 bit addresses of a
1386 hardware 32 byte V# and 48 byte S# object respectively. In order to support the
1387 HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
1388 enumeration values for the queries that are not trivially deducible from the S#
1394 HSA signal handles created by the ROCm runtime are 64 bit addresses of a
1395 structure allocated in memory accessible from both the CPU and GPU. The
1396 structure is defined by the ROCm runtime and subject to change between releases
1397 (see [AMD-ROCm-github]_).
1399 .. _amdgpu-amdhsa-hsa-aql-queue:
1404 The HSA AQL queue structure is defined by the ROCm runtime and subject to change
1405 between releases (see [AMD-ROCm-github]_). For some processors it contains
1406 fields needed to implement certain language features such as the flat address
1407 aperture bases. It also contains fields used by CP such as managing the
1408 allocation of scratch memory.
1410 .. _amdgpu-amdhsa-kernel-descriptor:
1415 A kernel descriptor consists of the information needed by CP to initiate the
1416 execution of a kernel, including the entry point address of the machine code
1417 that implements the kernel.
1419 Kernel Descriptor for GFX6-GFX9
1420 +++++++++++++++++++++++++++++++
1422 CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
1424 .. table:: Kernel Descriptor for GFX6-GFX9
1425 :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
1427 ======= ======= =============================== ===========================
1428 Bits Size Field Name Description
1429 ======= ======= =============================== ===========================
1430 31:0 4 bytes group_segment_fixed_size The amount of fixed local
1431 address space memory
1432 required for a work-group
1433 in bytes. This does not
1434 include any dynamically
1435 allocated local address
1436 space memory that may be
1437 added when the kernel is
1439 63:32 4 bytes private_segment_fixed_size The amount of fixed
1440 private address space
1441 memory required for a
1442 work-item in bytes. If
1443 is_dynamic_callstack is 1
1444 then additional space must
1445 be added to this value for
1447 95:64 4 bytes max_flat_workgroup_size Maximum flat work-group
1448 size supported by the
1449 kernel in work-items.
1450 96 1 bit is_dynamic_call_stack Indicates if the generated
1451 machine code is using a
1452 dynamically sized call
1454 97 1 bit is_xnack_enabled Indicates if the generated
1455 machine code is capable of
1457 127:98 30 bits Reserved. Must be 0.
1458 191:128 8 bytes kernel_code_entry_byte_offset Byte offset (possibly
1461 descriptor to kernel's
1462 entry point instruction
1463 which must be 256 byte
1465 383:192 24 Reserved. Must be 0.
1467 415:384 4 bytes compute_pgm_rsrc1 Compute Shader (CS)
1468 program settings used by
1470 ``COMPUTE_PGM_RSRC1``
1473 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1_t-gfx6-gfx9-table`.
1474 447:416 4 bytes compute_pgm_rsrc2 Compute Shader (CS)
1475 program settings used by
1477 ``COMPUTE_PGM_RSRC2``
1480 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
1481 448 1 bit enable_sgpr_private_segment Enable the setup of the
1482 _buffer SGPR user data registers
1484 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1486 The total number of SGPR
1488 requested must not exceed
1489 16 and match value in
1490 ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
1491 Any requests beyond 16
1493 449 1 bit enable_sgpr_dispatch_ptr *see above*
1494 450 1 bit enable_sgpr_queue_ptr *see above*
1495 451 1 bit enable_sgpr_kernarg_segment_ptr *see above*
1496 452 1 bit enable_sgpr_dispatch_id *see above*
1497 453 1 bit enable_sgpr_flat_scratch_init *see above*
1498 454 1 bit enable_sgpr_private_segment *see above*
1500 455 1 bit enable_sgpr_grid_workgroup Not implemented in CP and
1501 _count_X should always be 0.
1502 456 1 bit enable_sgpr_grid_workgroup Not implemented in CP and
1503 _count_Y should always be 0.
1504 457 1 bit enable_sgpr_grid_workgroup Not implemented in CP and
1505 _count_Z should always be 0.
1506 463:458 6 bits Reserved. Must be 0.
1507 511:464 4 Reserved. Must be 0.
1509 512 **Total size 64 bytes.**
1510 ======= ===================================================================
1514 .. table:: compute_pgm_rsrc1 for GFX6-GFX9
1515 :name: amdgpu-amdhsa-compute_pgm_rsrc1_t-gfx6-gfx9-table
1517 ======= ======= =============================== ===========================================================================
1518 Bits Size Field Name Description
1519 ======= ======= =============================== ===========================================================================
1520 5:0 6 bits granulated_workitem_vgpr_count Number of vector registers
1521 used by each work-item,
1522 granularity is device
1526 roundup((max-vgpg + 1)
1529 Used by CP to set up
1530 ``COMPUTE_PGM_RSRC1.VGPRS``.
1531 9:6 4 bits granulated_wavefront_sgpr_count Number of scalar registers
1532 used by a wavefront,
1533 granularity is device
1537 roundup((max-sgpg + 1)
1540 roundup((max-sgpg + 1)
1543 Includes the special SGPRs
1544 for VCC, Flat Scratch (for
1545 GFX7 onwards) and XNACK
1546 (for GFX8 onwards). It does
1547 not include the 16 SGPR
1548 added if a trap handler is
1551 Used by CP to set up
1552 ``COMPUTE_PGM_RSRC1.SGPRS``.
1553 11:10 2 bits priority Must be 0.
1555 Start executing wavefront
1556 at the specified priority.
1558 CP is responsible for
1560 ``COMPUTE_PGM_RSRC1.PRIORITY``.
1561 13:12 2 bits float_mode_round_32 Wavefront starts execution
1562 with specified rounding
1565 precision floating point
1568 Floating point rounding
1569 mode values are defined in
1570 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1572 Used by CP to set up
1573 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1574 15:14 2 bits float_mode_round_16_64 Wavefront starts execution
1575 with specified rounding
1576 denorm mode for half/double (16
1577 and 64 bit) floating point
1578 precision floating point
1581 Floating point rounding
1582 mode values are defined in
1583 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1585 Used by CP to set up
1586 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1587 17:16 2 bits float_mode_denorm_32 Wavefront starts execution
1588 with specified denorm mode
1591 precision floating point
1594 Floating point denorm mode
1595 values are defined in
1596 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1598 Used by CP to set up
1599 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1600 19:18 2 bits float_mode_denorm_16_64 Wavefront starts execution
1601 with specified denorm mode
1603 and 64 bit) floating point
1604 precision floating point
1607 Floating point denorm mode
1608 values are defined in
1609 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1611 Used by CP to set up
1612 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1613 20 1 bit priv Must be 0.
1615 Start executing wavefront
1616 in privilege trap handler
1619 CP is responsible for
1621 ``COMPUTE_PGM_RSRC1.PRIV``.
1622 21 1 bit enable_dx10_clamp Wavefront starts execution
1623 with DX10 clamp mode
1624 enabled. Used by the vector
1625 ALU to force DX-10 style
1626 treatment of NaN's (when
1627 set, clamp NaN to zero,
1631 Used by CP to set up
1632 ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
1633 22 1 bit debug_mode Must be 0.
1635 Start executing wavefront
1636 in single step mode.
1638 CP is responsible for
1640 ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
1641 23 1 bit enable_ieee_mode Wavefront starts execution
1643 enabled. Floating point
1644 opcodes that support
1645 exception flag gathering
1646 will quiet and propagate
1647 signaling-NaN inputs per
1648 IEEE 754-2008. Min_dx10 and
1649 max_dx10 become IEEE
1650 754-2008 compliant due to
1651 signaling-NaN propagation
1654 Used by CP to set up
1655 ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
1656 24 1 bit bulky Must be 0.
1658 Only one work-group allowed
1659 to execute on a compute
1662 CP is responsible for
1664 ``COMPUTE_PGM_RSRC1.BULKY``.
1665 25 1 bit cdbg_user Must be 0.
1667 Flag that can be used to
1668 control debugging code.
1670 CP is responsible for
1672 ``COMPUTE_PGM_RSRC1.CDBG_USER``.
1673 31:26 6 bits Reserved. Must be 0.
1674 32 **Total size 4 bytes**
1675 ======= ===================================================================================================================
1679 .. table:: compute_pgm_rsrc2 for GFX6-GFX9
1680 :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
1682 ======= ======= =============================== ===========================================================================
1683 Bits Size Field Name Description
1684 ======= ======= =============================== ===========================================================================
1685 0 1 bit enable_sgpr_private_segment Enable the setup of the
1686 _wave_offset SGPR wave scratch offset
1687 system register (see
1688 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1690 Used by CP to set up
1691 ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
1692 5:1 5 bits user_sgpr_count The total number of SGPR
1694 requested. This number must
1695 match the number of user
1696 data registers enabled.
1698 Used by CP to set up
1699 ``COMPUTE_PGM_RSRC2.USER_SGPR``.
1700 6 1 bit enable_trap_handler Set to 1 if code contains a
1701 TRAP instruction which
1702 requires a trap handler to
1706 ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``
1708 installed a trap handler
1709 regardless of the setting
1711 7 1 bit enable_sgpr_workgroup_id_x Enable the setup of the
1712 system SGPR register for
1713 the work-group id in the X
1715 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1717 Used by CP to set up
1718 ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
1719 8 1 bit enable_sgpr_workgroup_id_y Enable the setup of the
1720 system SGPR register for
1721 the work-group id in the Y
1723 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1725 Used by CP to set up
1726 ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
1727 9 1 bit enable_sgpr_workgroup_id_z Enable the setup of the
1728 system SGPR register for
1729 the work-group id in the Z
1731 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1733 Used by CP to set up
1734 ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
1735 10 1 bit enable_sgpr_workgroup_info Enable the setup of the
1736 system SGPR register for
1737 work-group information (see
1738 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1740 Used by CP to set up
1741 ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
1742 12:11 2 bits enable_vgpr_workitem_id Enable the setup of the
1743 VGPR system registers used
1744 for the work-item ID.
1745 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
1748 Used by CP to set up
1749 ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
1750 13 1 bit enable_exception_address_watch Must be 0.
1752 Wavefront starts execution
1754 exceptions enabled which
1755 are generated when L1 has
1756 witnessed a thread access
1760 CP is responsible for
1761 filling in the address
1763 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1764 according to what the
1766 14 1 bit enable_exception_memory Must be 0.
1768 Wavefront starts execution
1769 with memory violation
1770 exceptions exceptions
1771 enabled which are generated
1772 when a memory violation has
1773 occurred for this wave from
1775 (write-to-read-only-memory,
1776 mis-aligned atomic, LDS
1777 address out of range,
1778 illegal address, etc.).
1782 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1783 according to what the
1785 23:15 9 bits granulated_lds_size Must be 0.
1787 CP uses the rounded value
1788 from the dispatch packet,
1789 not this value, as the
1790 dispatch may contain
1791 dynamically allocated group
1792 segment memory. CP writes
1794 ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
1796 Amount of group segment
1797 (LDS) to allocate for each
1798 work-group. Granularity is
1802 roundup(lds-size / (64 * 4))
1804 roundup(lds-size / (128 * 4))
1806 24 1 bit enable_exception_ieee_754_fp Wavefront starts execution
1807 _invalid_operation with specified exceptions
1810 Used by CP to set up
1811 ``COMPUTE_PGM_RSRC2.EXCP_EN``
1812 (set from bits 0..6).
1816 25 1 bit enable_exception_fp_denormal FP Denormal one or more
1817 _source input operands is a
1819 26 1 bit enable_exception_ieee_754_fp IEEE 754 FP Division by
1820 _division_by_zero Zero
1821 27 1 bit enable_exception_ieee_754_fp IEEE 754 FP FP Overflow
1823 28 1 bit enable_exception_ieee_754_fp IEEE 754 FP Underflow
1825 29 1 bit enable_exception_ieee_754_fp IEEE 754 FP Inexact
1827 30 1 bit enable_exception_int_divide_by Integer Division by Zero
1828 _zero (rcp_iflag_f32 instruction
1830 31 1 bit Reserved. Must be 0.
1831 32 **Total size 4 bytes.**
1832 ======= ===================================================================================================================
1836 .. table:: Floating Point Rounding Mode Enumeration Values
1837 :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
1839 ===================================== ===== ===============================
1840 Enumeration Name Value Description
1841 ===================================== ===== ===============================
1842 AMD_FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
1843 AMD_FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
1844 AMD_FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
1845 AMD_FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
1846 ===================================== ===== ===============================
1850 .. table:: Floating Point Denorm Mode Enumeration Values
1851 :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
1853 ===================================== ===== ===============================
1854 Enumeration Name Value Description
1855 ===================================== ===== ===============================
1856 AMD_FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination
1858 AMD_FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
1859 AMD_FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
1860 AMD_FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
1861 ===================================== ===== ===============================
1865 .. table:: System VGPR Work-Item ID Enumeration Values
1866 :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
1868 ===================================== ===== ===============================
1869 Enumeration Name Value Description
1870 ===================================== ===== ===============================
1871 AMD_SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension ID.
1872 AMD_SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
1874 AMD_SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
1876 AMD_SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
1877 ===================================== ===== ===============================
1879 .. _amdgpu-amdhsa-initial-kernel-execution-state:
1881 Initial Kernel Execution State
1882 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1884 This section defines the register state that will be set up by the packet
1885 processor prior to the start of execution of every wavefront. This is limited by
1886 the constraints of the hardware controllers of CP/ADC/SPI.
1888 The order of the SGPR registers is defined, but the compiler can specify which
1889 ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
1890 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
1891 for enabled registers are dense starting at SGPR0: the first enabled register is
1892 SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
1895 The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
1896 all waves of the grid. It is possible to specify more than 16 User SGPRs using
1897 the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
1898 initialized. These are then immediately followed by the System SGPRs that are
1899 set up by ADC/SPI and can have different values for each wave of the grid
1902 SGPR register initial state is defined in
1903 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
1905 .. table:: SGPR Register Set Up Order
1906 :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
1908 ========== ========================== ====== ==============================
1909 SGPR Order Name Number Description
1910 (kernel descriptor enable of
1912 ========== ========================== ====== ==============================
1913 First Private Segment Buffer 4 V# that can be used, together
1914 (enable_sgpr_private with Scratch Wave Offset as an
1915 _segment_buffer) offset, to access the private
1916 memory space using a segment
1919 CP uses the value provided by
1921 then Dispatch Ptr 2 64 bit address of AQL dispatch
1922 (enable_sgpr_dispatch_ptr) packet for kernel dispatch
1924 then Queue Ptr 2 64 bit address of amd_queue_t
1925 (enable_sgpr_queue_ptr) object for AQL queue on which
1926 the dispatch packet was
1928 then Kernarg Segment Ptr 2 64 bit address of Kernarg
1929 (enable_sgpr_kernarg segment. This is directly
1930 _segment_ptr) copied from the
1931 kernarg_address in the kernel
1934 Having CP load it once avoids
1935 loading it at the beginning of
1937 then Dispatch Id 2 64 bit Dispatch ID of the
1938 (enable_sgpr_dispatch_id) dispatch packet being
1940 then Flat Scratch Init 2 This is 2 SGPRs:
1941 (enable_sgpr_flat_scratch
1945 The first SGPR is a 32 bit
1947 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
1948 to per SPI base of memory
1949 for scratch for the queue
1950 executing the kernel
1951 dispatch. CP obtains this
1952 from the runtime. (The
1953 Scratch Segment Buffer base
1955 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
1956 plus this offset.) The value
1957 of Scratch Wave Offset must
1958 be added to this offset by
1959 the kernel machine code,
1960 right shifted by 8, and
1961 moved to the FLAT_SCRATCH_HI
1963 FLAT_SCRATCH_HI corresponds
1964 to SGPRn-4 on GFX7, and
1965 SGPRn-6 on GFX8 (where SGPRn
1966 is the highest numbered SGPR
1967 allocated to the wave).
1969 multiplied by 256 (as it is
1970 in units of 256 bytes) and
1972 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
1973 to calculate the per wave
1974 FLAT SCRATCH BASE in flat
1975 memory instructions that
1979 The second SGPR is 32 bit
1980 byte size of a single
1981 work-item’s scratch memory
1982 usage. CP obtains this from
1983 the runtime, and it is
1984 always a multiple of DWORD.
1985 CP checks that the value in
1986 the kernel dispatch packet
1987 Private Segment Byte Size is
1988 not larger, and requests the
1989 runtime to increase the
1990 queue's scratch size if
1991 necessary. The kernel code
1993 FLAT_SCRATCH_LO which is
1994 SGPRn-3 on GFX7 and SGPRn-5
1995 on GFX8. FLAT_SCRATCH_LO is
1996 used as the FLAT SCRATCH
1998 instructions. Having CP load
1999 it once avoids loading it at
2000 the beginning of every
2001 wavefront. GFX9 This is the
2002 64 bit base address of the
2003 per SPI scratch backing
2004 memory managed by SPI for
2005 the queue executing the
2006 kernel dispatch. CP obtains
2007 this from the runtime (and
2008 divides it if there are
2009 multiple Shader Arrays each
2010 with its own SPI). The value
2011 of Scratch Wave Offset must
2012 be added by the kernel
2013 machine code and the result
2014 moved to the FLAT_SCRATCH
2015 SGPR which is SGPRn-6 and
2016 SGPRn-5. It is used as the
2017 FLAT SCRATCH BASE in flat
2018 memory instructions. then
2019 Private Segment Size 1 The
2020 32 bit byte size of a
2021 (enable_sgpr_private single
2023 scratch_segment_size) memory
2024 allocation. This is the
2025 value from the kernel
2026 dispatch packet Private
2027 Segment Byte Size rounded up
2028 by CP to a multiple of
2031 Having CP load it once avoids
2032 loading it at the beginning of
2035 This is not used for
2036 GFX7-GFX8 since it is the same
2037 value as the second SGPR of
2038 Flat Scratch Init. However, it
2039 may be needed for GFX9 which
2040 changes the meaning of the
2041 Flat Scratch Init value.
2042 then Grid Work-Group Count X 1 32 bit count of the number of
2043 (enable_sgpr_grid work-groups in the X dimension
2044 _workgroup_count_X) for the grid being
2045 executed. Computed from the
2046 fields in the kernel dispatch
2047 packet as ((grid_size.x +
2048 workgroup_size.x - 1) /
2050 then Grid Work-Group Count Y 1 32 bit count of the number of
2051 (enable_sgpr_grid work-groups in the Y dimension
2052 _workgroup_count_Y && for the grid being
2053 less than 16 previous executed. Computed from the
2054 SGPRs) fields in the kernel dispatch
2055 packet as ((grid_size.y +
2056 workgroup_size.y - 1) /
2059 Only initialized if <16
2060 previous SGPRs initialized.
2061 then Grid Work-Group Count Z 1 32 bit count of the number of
2062 (enable_sgpr_grid work-groups in the Z dimension
2063 _workgroup_count_Z && for the grid being
2064 less than 16 previous executed. Computed from the
2065 SGPRs) fields in the kernel dispatch
2066 packet as ((grid_size.z +
2067 workgroup_size.z - 1) /
2070 Only initialized if <16
2071 previous SGPRs initialized.
2072 then Work-Group Id X 1 32 bit work-group id in X
2073 (enable_sgpr_workgroup_id dimension of grid for
2075 then Work-Group Id Y 1 32 bit work-group id in Y
2076 (enable_sgpr_workgroup_id dimension of grid for
2078 then Work-Group Id Z 1 32 bit work-group id in Z
2079 (enable_sgpr_workgroup_id dimension of grid for
2081 then Work-Group Info 1 {first_wave, 14’b0000,
2082 (enable_sgpr_workgroup ordered_append_term[10:0],
2083 _info) threadgroup_size_in_waves[5:0]}
2084 then Scratch Wave Offset 1 32 bit byte offset from base
2085 (enable_sgpr_private of scratch base of queue
2086 _segment_wave_offset) executing the kernel
2087 dispatch. Must be used as an
2089 segment address when using
2090 Scratch Segment Buffer. It
2091 must be used to set up FLAT
2092 SCRATCH for flat addressing
2094 :ref:`amdgpu-amdhsa-flat-scratch`).
2095 ========== ========================== ====== ==============================
2097 The order of the VGPR registers is defined, but the compiler can specify which
2098 ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
2099 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2100 for enabled registers are dense starting at VGPR0: the first enabled register is
2101 VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
2104 VGPR register initial state is defined in
2105 :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
2107 .. table:: VGPR Register Set Up Order
2108 :name: amdgpu-amdhsa-vgpr-register-set-up-order-table
2110 ========== ========================== ====== ==============================
2111 VGPR Order Name Number Description
2112 (kernel descriptor enable of
2114 ========== ========================== ====== ==============================
2115 First Work-Item Id X 1 32 bit work item id in X
2116 (Always initialized) dimension of work-group for
2118 then Work-Item Id Y 1 32 bit work item id in Y
2119 (enable_vgpr_workitem_id dimension of work-group for
2120 > 0) wavefront lane.
2121 then Work-Item Id Z 1 32 bit work item id in Z
2122 (enable_vgpr_workitem_id dimension of work-group for
2123 > 1) wavefront lane.
2124 ========== ========================== ====== ==============================
2126 The setting of registers is is done by GPU CP/ADC/SPI hardware as follows:
2128 1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
2130 2. Work-group Id registers X, Y, Z are set by ADC which supports any
2131 combination including none.
2132 3. Scratch Wave Offset is set by SPI in a per wave basis which is why its value
2133 cannot included with the flat scratch init value which is per queue.
2134 4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
2137 Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
2138 value to the hardware required SGPRn-3 and SGPRn-4 respectively.
2140 The global segment can be accessed either using buffer instructions (GFX6 which
2141 has V# 64 bit address support), flat instructions (GFX7-9), or global
2142 instructions (GFX9).
2144 If buffer operations are used then the compiler can generate a V# with the
2145 following properties:
2149 * ATC: 1 if IOMMU present (such as APU)
2151 * MTYPE set to support memory coherence that matches the runtime (such as CC for
2152 APU and NC for dGPU).
2154 .. _amdgpu-amdhsa-kernel-prolog:
2159 .. _amdgpu-amdhsa-m0:
2165 The M0 register must be initialized with a value at least the total LDS size
2166 if the kernel may access LDS via DS or flat operations. Total LDS size is
2167 available in dispatch packet. For M0, it is also possible to use maximum
2168 possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
2171 The M0 register is not used for range checking LDS accesses and so does not
2172 need to be initialized in the prolog.
2174 .. _amdgpu-amdhsa-flat-scratch:
2179 If the kernel may use flat operations to access scratch memory, the prolog code
2180 must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
2181 are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wave
2182 Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
2185 Flat scratch is not supported.
2188 1. The low word of Flat Scratch Init is 32 bit byte offset from
2189 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
2190 being managed by SPI for the queue executing the kernel dispatch. This is
2191 the same value used in the Scratch Segment Buffer V# base address. The
2192 prolog must add the value of Scratch Wave Offset to get the wave's byte
2193 scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
2194 FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
2195 by 8 before moving into FLAT_SCRATCH_LO.
2196 2. The second word of Flat Scratch Init is 32 bit byte size of a single
2197 work-items scratch memory usage. This is directly loaded from the kernel
2198 dispatch packet Private Segment Byte Size and rounded up to a multiple of
2199 DWORD. Having CP load it once avoids loading it at the beginning of every
2200 wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
2203 The Flat Scratch Init is the 64 bit address of the base of scratch backing
2204 memory being managed by SPI for the queue executing the kernel dispatch. The
2205 prolog must add the value of Scratch Wave Offset and moved to the FLAT_SCRATCH
2206 pair for use as the flat scratch base in flat memory instructions.
2208 .. _amdgpu-amdhsa-memory-model:
2213 This section describes the mapping of LLVM memory model onto AMDGPU machine code
2214 (see :ref:`memmodel`). *The implementation is WIP.*
2217 Update when implementation complete.
2219 Support more relaxed OpenCL memory model to be controlled by environment
2220 component of target triple.
2222 The AMDGPU backend supports the memory synchronization scopes specified in
2223 :ref:`amdgpu-memory-scopes`.
2225 The code sequences used to implement the memory model are defined in table
2226 :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
2228 The sequences specify the order of instructions that a single thread must
2229 execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
2230 to other memory instructions executed by the same thread. This allows them to be
2231 moved earlier or later which can allow them to be combined with other instances
2232 of the same instruction, or hoisted/sunk out of loops to improve
2233 performance. Only the instructions related to the memory model are given;
2234 additional ``s_waitcnt`` instructions are required to ensure registers are
2235 defined before being used. These may be able to be combined with the memory
2236 model ``s_waitcnt`` instructions as described above.
2238 The AMDGPU memory model supports both the HSA [HSA]_ memory model, and the
2239 OpenCL [OpenCL]_ memory model. The HSA memory model uses a single happens-before
2240 relation for all address spaces (see :ref:`amdgpu-address-spaces`). The OpenCL
2241 memory model which has separate happens-before relations for the global and
2242 local address spaces, and only a fence specifying both global and local address
2243 space joins the relationships. Since the LLVM ``memfence`` instruction does not
2244 allow an address space to be specified the OpenCL fence has to convervatively
2245 assume both local and global address space was specified. However, optimizations
2246 can often be done to eliminate the additional ``s_waitcnt``instructions when
2247 there are no intervening corresponding ``ds/flat_load/store/atomic`` memory
2248 instructions. The code sequences in the table indicate what can be omitted for
2249 the OpenCL memory. The target triple environment is used to determine if the
2250 source language is OpenCL (see :ref:`amdgpu-opencl`).
2252 ``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
2255 ``buffer/global/flat_load/store/atomic`` instructions to global memory are
2256 termed vector memory operations.
2260 * Each agent has multiple compute units (CU).
2261 * Each CU has multiple SIMDs that execute wavefronts.
2262 * The wavefronts for a single work-group are executed in the same CU but may be
2263 executed by different SIMDs.
2264 * Each CU has a single LDS memory shared by the wavefronts of the work-groups
2266 * All LDS operations of a CU are performed as wavefront wide operations in a
2267 global order and involve no caching. Completion is reported to a wavefront in
2269 * The LDS memory has multiple request queues shared by the SIMDs of a
2270 CU. Therefore, the LDS operations performed by different waves of a work-group
2271 can be reordered relative to each other, which can result in reordering the
2272 visibility of vector memory operations with respect to LDS operations of other
2273 wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
2274 ensure synchronization between LDS operations and vector memory operations
2275 between waves of a work-group, but not between operations performed by the
2277 * The vector memory operations are performed as wavefront wide operations and
2278 completion is reported to a wavefront in execution order. The exception is
2279 that for GFX7-9 ``flat_load/store/atomic`` instructions can report out of
2280 vector memory order if they access LDS memory, and out of LDS operation order
2281 if they access global memory.
2282 * The vector memory operations access a vector L1 cache shared by all wavefronts
2283 on a CU. Therefore, no special action is required for coherence between
2284 wavefronts in the same work-group. A ``buffer_wbinvl1_vol`` is required for
2285 coherence between waves executing in different work-groups as they may be
2286 executing on different CUs.
2287 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
2288 on a group of CUs. The scalar and vector L1 caches are not coherent. However,
2289 scalar operations are used in a restricted way so do not impact the memory
2290 model. See :ref:`amdgpu-amdhsa-memory-spaces`.
2291 * The vector and scalar memory operations use an L2 cache shared by all CUs on
2293 * The L2 cache has independent channels to service disjoint ranges of virtual
2295 * Each CU has a separate request queue per channel. Therefore, the vector and
2296 scalar memory operations performed by waves executing in different work-groups
2297 (which may be executing on different CUs) of an agent can be reordered
2298 relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
2299 synchronization between vector memory operations of different CUs. It ensures a
2300 previous vector memory operation has completed before executing a subsequent
2301 vector memory or LDS operation and so can be used to meet the requirements of
2302 acquire and release.
2303 * The L2 cache can be kept coherent with other agents on some targets, or ranges
2304 of virtual addresses can be set up to bypass it to ensure system coherence.
2306 Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-8),
2307 or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
2308 memory, atomic memory orderings are not meaningful and all accesses are treated
2311 Constant address space uses ``buffer/global_load`` instructions (or equivalent
2312 scalar memory instructions). Since the constant address space contents do not
2313 change during the execution of a kernel dispatch it is not legal to perform
2314 stores, and atomic memory orderings are not meaningful and all access are
2315 treated as non-atomic.
2317 A memory synchronization scope wider than work-group is not meaningful for the
2318 group (LDS) address space and is treated as work-group.
2320 The memory model does not support the region address space which is treated as
2323 Acquire memory ordering is not meaningful on store atomic instructions and is
2324 treated as non-atomic.
2326 Release memory ordering is not meaningful on load atomic instructions and is
2327 treated a non-atomic.
2329 Acquire-release memory ordering is not meaningful on load or store atomic
2330 instructions and is treated as acquire and release respectively.
2332 AMDGPU backend only uses scalar memory operations to access memory that is
2333 proven to not change during the execution of the kernel dispatch. This includes
2334 constant address space and global address space for program scope const
2335 variables. Therefore the kernel machine code does not have to maintain the
2336 scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
2337 and vector L1 caches are invalidated between kernel dispatches by CP since
2338 constant address space data may change between kernel dispatch executions. See
2339 :ref:`amdgpu-amdhsa-memory-spaces`.
2341 The one execption is if scalar writes are used to spill SGPR registers. In this
2342 case the AMDGPU backend ensures the memory location used to spill is never
2343 accessed by vector memory operations at the same time. If scalar writes are used
2344 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
2345 return since the locations may be used for vector memory instructions by a
2346 future wave that uses the same scratch area, or a function call that creates a
2347 frame at the same address, respectively. There is no need for a ``s_dcache_inv``
2348 as all scalar writes are write-before-read in the same thread.
2350 Scratch backing memory (which is used for the private address space) is accessed
2351 with MTYPE NC_NV (non-coherenent non-volatile). Since the private address space
2352 is only accessed by a single thread, and is always write-before-read,
2353 there is never a need to invalidate these entries from the L1 cache. Hence all
2354 cache invalidates are done as ``*_vol`` to only invalidate the volatile cache
2357 On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
2358 to invalidate the L2 cache. This also causes it to be treated as non-volatile
2359 and so is not invalidated by ``*_vol``. On APU it is accessed as CC (cache
2360 coherent) and so the L2 cache will coherent with the CPU and other agents.
2362 .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
2363 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
2365 ============ ============ ============== ========== =======================
2366 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
2367 Ordering Sync Scope Address
2369 ============ ============ ============== ========== =======================
2371 ---------------------------------------------------------------------------
2372 load *none* *none* - global non-volatile
2373 - generic 1. buffer/global/flat_load
2375 1. buffer/global/flat_load
2377 load *none* *none* - local 1. ds_load
2378 store *none* *none* - global 1. buffer/global/flat_store
2380 store *none* *none* - local 1. ds_store
2381 **Unordered Atomic**
2382 ---------------------------------------------------------------------------
2383 load atomic unordered *any* *any* *Same as non-atomic*.
2384 store atomic unordered *any* *any* *Same as non-atomic*.
2385 atomicrmw unordered *any* *any* *Same as monotonic
2387 **Monotonic Atomic**
2388 ---------------------------------------------------------------------------
2389 load atomic monotonic - singlethread - global 1. buffer/global/flat_load
2390 - wavefront - generic
2392 load atomic monotonic - singlethread - local 1. ds_load
2395 load atomic monotonic - agent - global 1. buffer/global/flat_load
2396 - system - generic glc=1
2397 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
2398 - wavefront - generic
2402 store atomic monotonic - singlethread - local 1. ds_store
2405 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
2406 - wavefront - generic
2410 atomicrmw monotonic - singlethread - local 1. ds_atomic
2414 ---------------------------------------------------------------------------
2415 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
2418 load atomic acquire - workgroup - global 1. buffer/global_load
2419 load atomic acquire - workgroup - local 1. ds/flat_load
2420 - generic 2. s_waitcnt lgkmcnt(0)
2424 - Must happen before
2437 load atomic acquire - agent - global 1. buffer/global_load
2439 2. s_waitcnt vmcnt(0)
2441 - Must happen before
2449 3. buffer_wbinvl1_vol
2451 - Must happen before
2461 load atomic acquire - agent - generic 1. flat_load glc=1
2462 - system 2. s_waitcnt vmcnt(0) &
2467 - Must happen before
2470 - Ensures the flat_load
2475 3. buffer_wbinvl1_vol
2477 - Must happen before
2487 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
2490 atomicrmw acquire - workgroup - global 1. buffer/global_atomic
2491 atomicrmw acquire - workgroup - local 1. ds/flat_atomic
2492 - generic 2. waitcnt lgkmcnt(0)
2496 - Must happen before
2509 atomicrmw acquire - agent - global 1. buffer/global_atomic
2510 - system 2. s_waitcnt vmcnt(0)
2512 - Must happen before
2521 3. buffer_wbinvl1_vol
2523 - Must happen before
2533 atomicrmw acquire - agent - generic 1. flat_atomic
2534 - system 2. s_waitcnt vmcnt(0) &
2539 - Must happen before
2548 3. buffer_wbinvl1_vol
2550 - Must happen before
2560 fence acquire - singlethread *none* *none*
2562 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0)
2593 fence-paired-atomic).
2594 - Must happen before
2605 fence-paired-atomic.
2607 fence acquire - agent *none* 1. s_waitcnt vmcnt(0) &
2622 - Could be split into
2631 - s_waitcnt vmcnt(0)
2642 fence-paired-atomic).
2643 - s_waitcnt lgkmcnt(0)
2654 fence-paired-atomic).
2655 - Must happen before
2669 fence-paired-atomic.
2671 2. buffer_wbinvl1_vol
2673 - Must happen before
2674 any following global/generic
2684 ---------------------------------------------------------------------------
2685 store atomic release - singlethread - global 1. buffer/global/ds/flat_store
2688 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
2698 - Must happen before
2709 2. buffer/global/flat_store
2710 store atomic release - workgroup - local 1. ds_store
2711 store atomic release - agent - global 1. s_waitcnt vmcnt(0) &
2712 - system - generic lgkmcnt(0)
2716 - Could be split into
2725 - s_waitcnt vmcnt(0)
2732 - s_waitcnt lgkmcnt(0)
2739 - Must happen before
2750 2. buffer/global/ds/flat_store
2751 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
2754 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
2764 - Must happen before
2775 2. buffer/global/flat_atomic
2776 atomicrmw release - workgroup - local 1. ds_atomic
2777 atomicrmw release - agent - global 1. s_waitcnt vmcnt(0) &
2778 - system - generic lgkmcnt(0)
2782 - Could be split into
2791 - s_waitcnt vmcnt(0)
2798 - s_waitcnt lgkmcnt(0)
2805 - Must happen before
2816 2. buffer/global/ds/flat_atomic*
2817 fence release - singlethread *none* *none*
2819 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0)
2839 - Must happen before
2848 fence-paired-atomic).
2855 fence-paired-atomic.
2857 fence release - agent *none* 1. s_waitcnt vmcnt(0) &
2872 - Could be split into
2881 - s_waitcnt vmcnt(0)
2888 - s_waitcnt lgkmcnt(0)
2895 - Must happen before
2904 fence-paired-atomic).
2911 fence-paired-atomic.
2913 **Acquire-Release Atomic**
2914 ---------------------------------------------------------------------------
2915 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
2918 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
2928 - Must happen before
2939 2. buffer/global_atomic
2940 atomicrmw acq_rel - workgroup - local 1. ds_atomic
2941 2. s_waitcnt lgkmcnt(0)
2945 - Must happen before
2958 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
2968 - Must happen before
2980 3. s_waitcnt lgkmcnt(0)
2984 - Must happen before
2996 atomicrmw acq_rel - agent - global 1. s_waitcnt vmcnt(0) &
3001 - Could be split into
3010 - s_waitcnt vmcnt(0)
3017 - s_waitcnt lgkmcnt(0)
3024 - Must happen before
3035 2. buffer/global_atomic
3036 3. s_waitcnt vmcnt(0)
3038 - Must happen before
3047 4. buffer_wbinvl1_vol
3049 - Must happen before
3059 atomicrmw acq_rel - agent - generic 1. s_waitcnt vmcnt(0) &
3064 - Could be split into
3073 - s_waitcnt vmcnt(0)
3080 - s_waitcnt lgkmcnt(0)
3087 - Must happen before
3099 3. s_waitcnt vmcnt(0) &
3104 - Must happen before
3113 4. buffer_wbinvl1_vol
3115 - Must happen before
3125 fence acq_rel - singlethread *none* *none*
3127 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0)
3147 - Must happen before
3170 fence-paired-atomic)
3191 fence-paired-atomic).
3196 fence acq_rel - agent *none* 1. s_waitcnt vmcnt(0) &
3211 - Could be split into
3220 - s_waitcnt vmcnt(0)
3227 - s_waitcnt lgkmcnt(0)
3234 - Must happen before
3239 global/local/generic
3248 fence-paired-atomic)
3260 global/local/generic
3269 fence-paired-atomic).
3274 2. buffer_wbinvl1_vol
3276 - Must happen before
3290 **Sequential Consistent Atomic**
3291 ---------------------------------------------------------------------------
3292 load atomic seq_cst - singlethread - global *Same as corresponding
3293 - wavefront - local load atomic acquire*.
3294 - workgroup - generic
3295 load atomic seq_cst - agent - global 1. s_waitcnt vmcnt(0)
3297 - generic - Must happen after
3344 instructions same as
3348 store atomic seq_cst - singlethread - global *Same as corresponding
3349 - wavefront - local store atomic release*.
3350 - workgroup - generic
3351 store atomic seq_cst - agent - global *Same as corresponding
3352 - system - generic store atomic release*.
3353 atomicrmw seq_cst - singlethread - global *Same as corresponding
3354 - wavefront - local atomicrmw acq_rel*.
3355 - workgroup - generic
3356 atomicrmw seq_cst - agent - global *Same as corresponding
3357 - system - generic atomicrmw acq_rel*.
3358 fence seq_cst - singlethread *none* *Same as corresponding
3359 - wavefront fence acq_rel*.
3363 ============ ============ ============== ========== =======================
3365 The memory order also adds the single thread optimization constrains defined in
3367 :ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
3369 .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
3370 :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
3372 ============ ==============================================================
3373 LLVM Memory Optimization Constraints
3375 ============ ==============================================================
3378 acquire - If a load atomic/atomicrmw then no following load/load
3379 atomic/store/ store atomic/atomicrmw/fence instruction can
3380 be moved before the acquire.
3381 - If a fence then same as load atomic, plus no preceding
3382 associated fence-paired-atomic can be moved after the fence.
3383 release - If a store atomic/atomicrmw then no preceding load/load
3384 atomic/store/ store atomic/atomicrmw/fence instruction can
3385 be moved after the release.
3386 - If a fence then same as store atomic, plus no following
3387 associated fence-paired-atomic can be moved before the
3389 acq_rel Same constraints as both acquire and release.
3390 seq_cst - If a load atomic then same constraints as acquire, plus no
3391 preceding sequentially consistent load atomic/store
3392 atomic/atomicrmw/fence instruction can be moved after the
3394 - If a store atomic then the same constraints as release, plus
3395 no following sequentially consistent load atomic/store
3396 atomic/atomicrmw/fence instruction can be moved before the
3398 - If an atomicrmw/fence then same constraints as acq_rel.
3399 ============ ==============================================================
3404 For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
3405 (such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
3406 the ``s_trap`` instruction with the following usage:
3408 .. table:: AMDGPU Trap Handler for AMDHSA OS
3409 :name: amdgpu-trap-handler-for-amdhsa-os-table
3411 =================== =============== =============== =======================
3412 Usage Code Sequence Trap Handler Description
3414 =================== =============== =============== =======================
3415 reserved ``s_trap 0x00`` Reserved by hardware.
3416 ``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA
3417 ``queue_ptr`` ``debugtrap``
3418 ``VGPR0``: intrinsic (not
3419 ``arg`` implemented).
3420 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be
3421 ``queue_ptr`` terminated and its
3422 associated queue put
3423 into the error state.
3424 ``llvm.debugtrap`` ``s_trap 0x03`` ``SGPR0-1``: If debugger not
3425 ``queue_ptr`` installed handled
3426 same as ``llvm.trap``.
3427 debugger breakpoint ``s_trap 0x07`` Reserved for debugger
3429 debugger ``s_trap 0x08`` Reserved for debugger.
3430 debugger ``s_trap 0xfe`` Reserved for debugger.
3431 debugger ``s_trap 0xff`` Reserved for debugger.
3432 =================== =============== =============== =======================
3437 This section provides code conventions used when the target triple OS is
3438 empty (see :ref:`amdgpu-target-triples`).
3443 For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
3444 not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
3445 instructions are handled as follows:
3447 .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
3448 :name: amdgpu-trap-handler-for-non-amdhsa-os-table
3450 =============== =============== ===========================================
3451 Usage Code Sequence Description
3452 =============== =============== ===========================================
3453 llvm.trap s_endpgm Causes wavefront to be terminated.
3454 llvm.debugtrap *none* Compiler warning given that there is no
3455 trap handler installed.
3456 =============== =============== ===========================================
3466 When generating code for the OpenCL language the target triple environment
3467 should be ``opencl`` or ``amdgizcl`` (see :ref:`amdgpu-target-triples`).
3469 When the language is OpenCL the following differences occur:
3471 1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
3472 2. The AMDGPU backend adds additional arguments to the kernel.
3473 3. Additional metadata is generated
3474 (:ref:`amdgpu-amdhsa-hsa-code-object-metadata`).
3477 Specify what affect this has. Hidden arguments added. Additional metadata
3485 When generating code for the OpenCL language the target triple environment
3486 should be ``hcc`` (see :ref:`amdgpu-target-triples`).
3488 When the language is OpenCL the following differences occur:
3490 1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
3493 Specify what affect this has.
3498 AMDGPU backend has LLVM-MC based assembler which is currently in development.
3499 It supports AMDGCN GFX6-GFX8.
3501 This section describes general syntax for instructions and operands. For more
3502 information about instructions, their semantics and supported combinations of
3503 operands, refer to one of instruction set architecture manuals
3504 [AMD-Souther-Islands]_, [AMD-Sea-Islands]_, [AMD-Volcanic-Islands]_ and
3507 An instruction has the following syntax (register operands are normally
3508 comma-separated while extra operands are space-separated):
3510 *<opcode> <register_operand0>, ... <extra_operand0> ...*
3515 The following syntax for register operands is supported:
3517 * SGPR registers: s0, ... or s[0], ...
3518 * VGPR registers: v0, ... or v[0], ...
3519 * TTMP registers: ttmp0, ... or ttmp[0], ...
3520 * Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
3521 * Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
3522 * Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
3523 * Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
3524 * Register index expressions: v[2*2], s[1-1:2-1]
3525 * 'off' indicates that an operand is not enabled
3527 The following extra operands are supported:
3529 * offset, offset0, offset1
3531 * glc, slc, tfe bits
3532 * waitcnt: integer or combination of counter values
3535 - abs (\| \|), neg (\-)
3539 - row_shl, row_shr, row_ror, row_rol
3540 - row_mirror, row_half_mirror, row_bcast
3541 - wave_shl, wave_shr, wave_ror, wave_rol, quad_perm
3542 - row_mask, bank_mask, bound_ctrl
3546 - dst_sel, src0_sel, src1_sel (BYTE_N, WORD_M, DWORD)
3547 - dst_unused (UNUSED_PAD, UNUSED_SEXT, UNUSED_PRESERVE)
3550 Instruction Examples
3551 ~~~~~~~~~~~~~~~~~~~~
3556 .. code-block:: nasm
3558 ds_add_u32 v2, v4 offset:16
3559 ds_write_src2_b64 v2 offset0:4 offset1:8
3560 ds_cmpst_f32 v2, v4, v6
3561 ds_min_rtn_f64 v[8:9], v2, v[4:5]
3564 For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
3569 .. code-block:: nasm
3571 flat_load_dword v1, v[3:4]
3572 flat_store_dwordx3 v[3:4], v[5:7]
3573 flat_atomic_swap v1, v[3:4], v5 glc
3574 flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
3575 flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
3577 For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
3582 .. code-block:: nasm
3584 buffer_load_dword v1, off, s[4:7], s1
3585 buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
3586 buffer_store_format_xy v[1:2], off, s[4:7], s1
3588 buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
3590 For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
3595 .. code-block:: nasm
3597 s_load_dword s1, s[2:3], 0xfc
3598 s_load_dwordx8 s[8:15], s[2:3], s4
3599 s_load_dwordx16 s[88:103], s[2:3], s4
3603 For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
3608 .. code-block:: nasm
3611 s_mov_b64 s[0:1], 0x80000000
3613 s_wqm_b64 s[2:3], s[4:5]
3614 s_bcnt0_i32_b64 s1, s[2:3]
3615 s_swappc_b64 s[2:3], s[4:5]
3616 s_cbranch_join s[4:5]
3618 For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
3623 .. code-block:: nasm
3625 s_add_u32 s1, s2, s3
3626 s_and_b64 s[2:3], s[4:5], s[6:7]
3627 s_cselect_b32 s1, s2, s3
3628 s_andn2_b32 s2, s4, s6
3629 s_lshr_b64 s[2:3], s[4:5], s6
3630 s_ashr_i32 s2, s4, s6
3631 s_bfm_b64 s[2:3], s4, s6
3632 s_bfe_i64 s[2:3], s[4:5], s6
3633 s_cbranch_g_fork s[4:5], s[6:7]
3635 For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
3640 .. code-block:: nasm
3643 s_bitcmp1_b32 s1, s2
3644 s_bitcmp0_b64 s[2:3], s4
3647 For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
3652 .. code-block:: nasm
3657 s_waitcnt 0 ; Wait for all counters to be 0
3658 s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
3659 s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
3663 s_sendmsg sendmsg(MSG_INTERRUPT)
3666 For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
3668 Unless otherwise mentioned, little verification is performed on the operands
3669 of SOPP Instructions, so it is up to the programmer to be familiar with the
3670 range or acceptable values.
3675 For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
3676 the assembler will automatically use optimal encoding based on its operands.
3677 To force specific encoding, one can add a suffix to the opcode of the instruction:
3679 * _e32 for 32-bit VOP1/VOP2/VOPC
3680 * _e64 for 64-bit VOP3
3682 * _sdwa for VOP_SDWA
3684 VOP1/VOP2/VOP3/VOPC examples:
3686 .. code-block:: nasm
3689 v_mov_b32_e32 v1, v2
3691 v_cvt_f64_i32_e32 v[1:2], v2
3692 v_floor_f32_e32 v1, v2
3693 v_bfrev_b32_e32 v1, v2
3694 v_add_f32_e32 v1, v2, v3
3695 v_mul_i32_i24_e64 v1, v2, 3
3696 v_mul_i32_i24_e32 v1, -3, v3
3697 v_mul_i32_i24_e32 v1, -100, v3
3698 v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
3699 v_max_f16_e32 v1, v2, v3
3703 .. code-block:: nasm
3705 v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
3706 v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
3707 v_mov_b32 v0, v0 wave_shl:1
3708 v_mov_b32 v0, v0 row_mirror
3709 v_mov_b32 v0, v0 row_bcast:31
3710 v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
3711 v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
3712 v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
3716 .. code-block:: nasm
3718 v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
3719 v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
3720 v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
3721 v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
3722 v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
3724 For full list of supported instructions, refer to "Vector ALU instructions".
3726 HSA Code Object Directives
3727 ~~~~~~~~~~~~~~~~~~~~~~~~~~
3729 AMDGPU ABI defines auxiliary data in output code object. In assembly source,
3730 one can specify them with assembler directives.
3732 .hsa_code_object_version major, minor
3733 +++++++++++++++++++++++++++++++++++++
3735 *major* and *minor* are integers that specify the version of the HSA code
3736 object that will be generated by the assembler.
3738 .hsa_code_object_isa [major, minor, stepping, vendor, arch]
3739 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3742 *major*, *minor*, and *stepping* are all integers that describe the instruction
3743 set architecture (ISA) version of the assembly program.
3745 *vendor* and *arch* are quoted strings. *vendor* should always be equal to
3746 "AMD" and *arch* should always be equal to "AMDGPU".
3748 By default, the assembler will derive the ISA version, *vendor*, and *arch*
3749 from the value of the -mcpu option that is passed to the assembler.
3751 .amdgpu_hsa_kernel (name)
3752 +++++++++++++++++++++++++
3754 This directives specifies that the symbol with given name is a kernel entry point
3755 (label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
3760 This directive marks the beginning of a list of key / value pairs that are used
3761 to specify the amd_kernel_code_t object that will be emitted by the assembler.
3762 The list must be terminated by the *.end_amd_kernel_code_t* directive. For
3763 any amd_kernel_code_t values that are unspecified a default value will be
3764 used. The default value for all keys is 0, with the following exceptions:
3766 - *kernel_code_version_major* defaults to 1.
3767 - *machine_kind* defaults to 1.
3768 - *machine_version_major*, *machine_version_minor*, and
3769 *machine_version_stepping* are derived from the value of the -mcpu option
3770 that is passed to the assembler.
3771 - *kernel_code_entry_byte_offset* defaults to 256.
3772 - *wavefront_size* defaults to 6.
3773 - *kernarg_segment_alignment*, *group_segment_alignment*, and
3774 *private_segment_alignment* default to 4. Note that alignments are specified
3775 as a power of two, so a value of **n** means an alignment of 2^ **n**.
3777 The *.amd_kernel_code_t* directive must be placed immediately after the
3778 function label and before any instructions.
3780 For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
3781 comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
3783 Here is an example of a minimal amd_kernel_code_t specification:
3785 .. code-block:: none
3787 .hsa_code_object_version 1,0
3788 .hsa_code_object_isa
3793 .amdgpu_hsa_kernel hello_world
3798 enable_sgpr_kernarg_segment_ptr = 1
3800 compute_pgm_rsrc1_vgprs = 0
3801 compute_pgm_rsrc1_sgprs = 0
3802 compute_pgm_rsrc2_user_sgpr = 2
3803 kernarg_segment_byte_size = 8
3804 wavefront_sgpr_count = 2
3805 workitem_vgpr_count = 3
3806 .end_amd_kernel_code_t
3808 s_load_dwordx2 s[0:1], s[0:1] 0x0
3809 v_mov_b32 v0, 3.14159
3810 s_waitcnt lgkmcnt(0)
3813 flat_store_dword v[1:2], v0
3816 .size hello_world, .Lfunc_end0-hello_world
3818 Additional Documentation
3819 ========================
3821 .. [AMD-R6xx] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
3822 .. [AMD-R7xx] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
3823 .. [AMD-Evergreen] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
3824 .. [AMD-Cayman-Trinity] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
3825 .. [AMD-Souther-Islands] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
3826 .. [AMD-Sea-Islands] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
3827 .. [AMD-Volcanic-Islands] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
3828 .. [AMD-Vega] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
3829 .. [AMD-OpenCL_Programming-Guide] `AMD Accelerated Parallel Processing OpenCL Programming Guide <http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf>`_
3830 .. [AMD-APP-SDK] `AMD Accelerated Parallel Processing APP SDK Documentation <http://developer.amd.com/tools/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/documentation/>`__
3831 .. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
3832 .. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
3833 .. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
3834 .. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
3835 .. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
3836 .. [YAML] `YAML Ain’t Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
3837 .. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
3838 .. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
3839 .. [AMD-AMDGPU-Compute-Application-Binary-Interface] `AMDGPU Compute Application Binary Interface <https://github.com/RadeonOpenCompute/ROCm-ComputeABI-Doc/blob/master/AMDGPU-ABI.md>`__