1 =============================
2 User Guide for AMDGPU Backend
3 =============================
11 The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
12 R600 family up until the current GCN families. It lives in the
13 ``lib/Target/AMDGPU`` directory.
18 .. _amdgpu-target-triples:
23 Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
24 specify the target triple:
26 .. table:: AMDGPU Target Triples
27 :name: amdgpu-target-triples-table
29 ============ ======== ========= ===========
30 Architecture Vendor OS Environment
31 ============ ======== ========= ===========
32 r600 amd <empty> <empty>
33 amdgcn amd <empty> <empty>
34 amdgcn amd amdhsa <empty>
35 amdgcn amd amdhsa opencl
36 amdgcn amd amdhsa amdgizcl
37 amdgcn amd amdhsa amdgiz
39 ============ ======== ========= ===========
42 Supports AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders executed on
46 Supports AMD GPUs GCN GFX6 onwards for graphics and compute shaders executed on
49 ``amdgcn-amd-amdhsa-``
50 Supports AMD GCN GPUs GFX6 onwards for compute kernels executed on HSA [HSA]_
51 compatible runtimes such as AMD's ROCm [AMD-ROCm]_.
53 ``amdgcn-amd-amdhsa-opencl``
54 Supports AMD GCN GPUs GFX6 onwards for OpenCL compute kernels executed on HSA
55 [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. See
58 ``amdgcn-amd-amdhsa-amdgizcl``
59 Same as ``amdgcn-amd-amdhsa-opencl`` except a different address space mapping
60 is used (see :ref:`amdgpu-address-spaces`).
62 ``amdgcn-amd-amdhsa-amdgiz``
63 Same as ``amdgcn-amd-amdhsa-`` except a different address space mapping is
64 used (see :ref:`amdgpu-address-spaces`).
66 ``amdgcn-amd-amdhsa-hcc``
67 Supports AMD GCN GPUs GFX6 onwards for AMD HC language compute kernels
68 executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. See
71 .. _amdgpu-processors:
76 Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
77 names from both the *Processor* and *Alternative Processor* can be used.
79 .. table:: AMDGPU Processors
80 :name: amdgpu-processors-table
82 ========== =========== ============ ===== ======= ==================
83 Processor Alternative Target dGPU/ Runtime Example
84 Processor Triple APU Support Products
86 ========== =========== ============ ===== ======= ==================
87 **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
88 --------------------------------------------------------------------
93 **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
94 --------------------------------------------------------------------
98 **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
99 --------------------------------------------------------------------
105 **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
106 --------------------------------------------------------------------
111 **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
112 --------------------------------------------------------------------
113 gfx600 - tahiti amdgcn dGPU
114 gfx601 - pitcairn amdgcn dGPU
118 **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
119 --------------------------------------------------------------------
120 gfx700 - bonaire amdgcn dGPU - Radeon HD 7790
124 \ - kaveri amdgcn APU - A6-7000
134 gfx701 - hawaii amdgcn dGPU ROCm - FirePro W8100
138 gfx702 dGPU ROCm - Radeon R9 290
142 gfx703 - kabini amdgcn APU - E1-2100
151 **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
152 --------------------------------------------------------------------
153 gfx800 - iceland amdgcn dGPU - FirePro S7150
161 gfx801 - carrizo amdgcn APU - A6-8500P
167 \ amdgcn APU ROCm - A10-8700P
170 \ amdgcn APU - A10-9600P
176 \ amdgcn APU - E2-9010
179 gfx802 - tonga amdgcn dGPU ROCm Same as gfx800
180 gfx803 - fiji amdgcn dGPU ROCm - Radeon R9 Nano
185 - Radeon Instinct MI8
186 \ - polaris10 amdgcn dGPU ROCm - Radeon RX 470
188 - Radeon Instinct MI6
189 \ - polaris11 amdgcn dGPU ROCm - Radeon RX 460
190 gfx804 amdgcn dGPU Same as gfx803
191 gfx810 - stoney amdgcn APU
192 **GCN GFX9** [AMD-GCN-GFX9]_
193 --------------------------------------------------------------------
194 gfx900 amdgcn dGPU - Radeon Vega
200 - Radeon Instinct MI25
201 gfx901 amdgcn dGPU ROCm Same as gfx900
204 gfx902 amdgcn APU *TBA*
209 gfx903 amdgcn APU Same as gfx902
212 ========== =========== ============ ===== ======= ==================
214 .. _amdgpu-address-spaces:
219 The AMDGPU backend uses the following address space mappings.
221 The memory space names used in the table, aside from the region memory space, is
222 from the OpenCL standard.
224 LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
226 .. table:: Address Space Mapping
227 :name: amdgpu-address-space-mapping-table
229 ================== ================= ================= ================= =================
230 LLVM Address Space Memory Space
231 ------------------ -----------------------------------------------------------------------
232 \ Current Default amdgiz/amdgizcl hcc Future Default
233 ================== ================= ================= ================= =================
234 0 Private (Scratch) Generic (Flat) Generic (Flat) Generic (Flat)
235 1 Global Global Global Global
236 2 Constant Constant Constant Region (GDS)
237 3 Local (group/LDS) Local (group/LDS) Local (group/LDS) Local (group/LDS)
238 4 Generic (Flat) Region (GDS) Region (GDS) Constant
239 5 Region (GDS) Private (Scratch) Private (Scratch) Private (Scratch)
240 ================== ================= ================= ================= =================
243 This is the current default address space mapping used for all languages
244 except hcc. This will shortly be deprecated.
247 This is the current address space mapping used when ``amdgiz`` or ``amdgizcl``
248 is specified as the target triple environment value.
251 This is the current address space mapping used when ``hcc`` is specified as
252 the target triple environment value.This will shortly be deprecated.
255 This will shortly be the only address space mapping for all languages using
258 .. _amdgpu-memory-scopes:
263 This section provides LLVM memory synchronization scopes supported by the AMDGPU
264 backend memory model when the target triple OS is ``amdhsa`` (see
265 :ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
267 The memory model supported is based on the HSA memory model [HSA]_ which is
268 based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
269 relation is transitive over the synchonizes-with relation independent of scope,
270 and synchonizes-with allows the memory scope instances to be inclusive (see
271 table :ref:`amdgpu-amdhsa-llvm-sync-scopes-amdhsa-table`).
273 This is different to the OpenCL [OpenCL]_ memory model which does not have scope
274 inclusion and requires the memory scopes to exactly match. However, this
275 is conservatively correct for OpenCL.
277 .. table:: AMDHSA LLVM Sync Scopes for AMDHSA
278 :name: amdgpu-amdhsa-llvm-sync-scopes-amdhsa-table
280 ================ ==========================================================
281 LLVM Sync Scope Description
282 ================ ==========================================================
283 *none* The default: ``system``.
285 Synchronizes with, and participates in modification and
286 seq_cst total orderings with, other operations (except
287 image operations) for all address spaces (except private,
288 or generic that accesses private) provided the other
289 operation's sync scope is:
292 - ``agent`` and executed by a thread on the same agent.
293 - ``workgroup`` and executed by a thread in the same
295 - ``wavefront`` and executed by a thread in the same
298 ``agent`` Synchronizes with, and participates in modification and
299 seq_cst total orderings with, other operations (except
300 image operations) for all address spaces (except private,
301 or generic that accesses private) provided the other
302 operation's sync scope is:
304 - ``system`` or ``agent`` and executed by a thread on the
306 - ``workgroup`` and executed by a thread in the same
308 - ``wavefront`` and executed by a thread in the same
311 ``workgroup`` Synchronizes with, and participates in modification and
312 seq_cst total orderings with, other operations (except
313 image operations) for all address spaces (except private,
314 or generic that accesses private) provided the other
315 operation's sync scope is:
317 - ``system``, ``agent`` or ``workgroup`` and executed by a
318 thread in the same workgroup.
319 - ``wavefront`` and executed by a thread in the same
322 ``wavefront`` Synchronizes with, and participates in modification and
323 seq_cst total orderings with, other operations (except
324 image operations) for all address spaces (except private,
325 or generic that accesses private) provided the other
326 operation's sync scope is:
328 - ``system``, ``agent``, ``workgroup`` or ``wavefront``
329 and executed by a thread in the same wavefront.
331 ``singlethread`` Only synchronizes with, and participates in modification
332 and seq_cst total orderings with, other operations (except
333 image operations) running in the same thread for all
334 address spaces (for example, in signal handlers).
335 ================ ==========================================================
340 The AMDGPU backend implements the following intrinsics.
342 *This section is WIP.*
345 List AMDGPU intrinsics
350 The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
351 can be linked by ``lld`` to produce a standard ELF shared code object which can
352 be loaded and executed on an AMDGPU target.
357 The AMDGPU backend uses the following ELF header:
359 .. table:: AMDGPU ELF Header
360 :name: amdgpu-elf-header-table
362 ========================== ===============================
364 ========================== ===============================
365 ``e_ident[EI_CLASS]`` ``ELFCLASS64``
366 ``e_ident[EI_DATA]`` ``ELFDATA2LSB``
367 ``e_ident[EI_OSABI]`` ``ELFOSABI_AMDGPU_HSA``,
368 ``ELFOSABI_AMDGPU_PAL`` or
369 ``ELFOSABI_AMDGPU_MESA3D``
370 ``e_ident[EI_ABIVERSION]`` ``ELFABIVERSION_AMDGPU_HSA``,
371 ``ELFABIVERSION_AMDGPU_PAL`` or
372 ``ELFABIVERSION_AMDGPU_MESA3D``
373 ``e_type`` ``ET_REL`` or ``ET_DYN``
374 ``e_machine`` ``EM_AMDGPU``
377 ========================== ===============================
381 .. table:: AMDGPU ELF Header Enumeration Values
382 :name: amdgpu-elf-header-enumeration-values-table
384 =============================== =====
386 =============================== =====
388 ``ELFOSABI_AMDGPU_HSA`` 64
389 ``ELFOSABI_AMDGPU_PAL`` 65
390 ``ELFOSABI_AMDGPU_MESA3D`` 66
391 ``ELFABIVERSION_AMDGPU_HSA`` 1
392 ``ELFABIVERSION_AMDGPU_PAL`` 0
393 ``ELFABIVERSION_AMDGPU_MESA3D`` 0
394 =============================== =====
396 ``e_ident[EI_CLASS]``
397 The ELF class is always ``ELFCLASS64``. The AMDGPU backend only supports 64
401 All AMDGPU targets use ELFDATA2LSB for little-endian byte ordering.
403 ``e_ident[EI_OSABI]``
404 One of the following AMD GPU architecture specific OS ABIs:
406 * ``ELFOSABI_AMDGPU_HSA`` is used to specify that the code object conforms to
407 the AMD HSA runtime ABI [HSA]_.
409 * ``ELFOSABI_AMDGPU_PAL`` is used to specify that the code object conforms to
410 the AMD PAL runtime ABI.
412 * ``ELFOSABI_AMDGPU_MESA3D`` is used to specify that the code object conforms
413 to the AMD MESA runtime ABI.
415 ``e_ident[EI_ABIVERSION]``
416 The ABI version of the AMD GPU architecture specific OS ABI to which the code
419 * ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
422 * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
425 * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
429 Can be one of the following values:
433 The type produced by the AMD GPU backend compiler as it is relocatable code
437 The type produced by the linker as it is a shared code object.
439 The AMD HSA runtime loader requires a ``ET_DYN`` code object.
442 The value ``EM_AMDGPU`` is used for the machine for all members of the AMD GPU
443 architecture family. The specific member is specified in the
444 ``NT_AMD_AMDGPU_ISA`` entry in the ``.note`` section (see
445 :ref:`amdgpu-note-records`).
448 The entry point is 0 as the entry points for individual kernels must be
449 selected in order to invoke them through AQL packets.
452 The value is 0 as no flags are used.
457 An AMDGPU target ELF code object has the standard ELF sections which include:
459 .. table:: AMDGPU ELF Sections
460 :name: amdgpu-elf-sections-table
462 ================== ================ =================================
464 ================== ================ =================================
465 ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
466 ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
467 ``.debug_``\ *\** ``SHT_PROGBITS`` *none*
468 ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
469 ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
470 ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
471 ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
472 ``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
473 ``.note`` ``SHT_NOTE`` *none*
474 ``.rela``\ *name* ``SHT_RELA`` *none*
475 ``.rela.dyn`` ``SHT_RELA`` *none*
476 ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
477 ``.shstrtab`` ``SHT_STRTAB`` *none*
478 ``.strtab`` ``SHT_STRTAB`` *none*
479 ``.symtab`` ``SHT_SYMTAB`` *none*
480 ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
481 ================== ================ =================================
483 These sections have their standard meanings (see [ELF]_) and are only generated
487 The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
488 DWARF produced by the AMDGPU backend.
490 ``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
491 The standard sections used by a dynamic loader.
494 See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
497 ``.rela``\ *name*, ``.rela.dyn``
498 For relocatable code objects, *name* is the name of the section that the
499 relocation records apply. For example, ``.rela.text`` is the section name for
500 relocation records associated with the ``.text`` section.
502 For linked shared code objects, ``.rela.dyn`` contains all the relocation
503 records from each of the relocatable code object's ``.rela``\ *name* sections.
505 See :ref:`amdgpu-relocation-records` for the relocation records supported by
509 The executable machine code for the kernels and functions they call. Generated
510 as position independent code. See :ref:`amdgpu-code-conventions` for
511 information on conventions used in the isa generation.
513 .. _amdgpu-note-records:
518 As required by ``ELFCLASS64``, minimal zero byte padding must be generated after
519 the ``name`` field to ensure the ``desc`` field is 4 byte aligned. In addition,
520 minimal zero byte padding must be generated to ensure the ``desc`` field size is
521 a multiple of 4 bytes. The ``sh_addralign`` field of the ``.note`` section must
522 be at least 4 to indicate at least 8 byte alignment.
524 The AMDGPU backend code object uses the following ELF note records in the
525 ``.note`` section. The *Description* column specifies the layout of the note
526 record's ``desc`` field. All fields are consecutive bytes. Note records with
527 variable size strings have a corresponding ``*_size`` field that specifies the
528 number of bytes, including the terminating null character, in the string. The
529 string(s) come immediately after the preceding fields.
531 Additional note records can be present.
533 .. table:: AMDGPU ELF Note Records
534 :name: amdgpu-elf-note-records-table
536 ===== ============================== ======================================
537 Name Type Description
538 ===== ============================== ======================================
539 "AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
540 "AMD" ``NT_AMD_AMDGPU_ISA`` <isa name null terminated string>
541 ===== ============================== ======================================
545 .. table:: AMDGPU ELF Note Record Enumeration Values
546 :name: amdgpu-elf-note-record-enumeration-values-table
548 ============================== =====
550 ============================== =====
552 ``NT_AMD_AMDGPU_HSA_METADATA`` 10
553 ``NT_AMD_AMDGPU_ISA`` 11
554 ============================== =====
556 ``NT_AMD_AMDGPU_ISA``
557 Specifies the instruction set architecture used by the machine code contained
560 This note record is required for code objects containing machine code for
561 processors matching the ``amdgcn`` architecture in table
562 :ref:`amdgpu-processors`.
564 The null terminated string has the following syntax:
566 *architecture*\ ``-``\ *vendor*\ ``-``\ *os*\ ``-``\ *environment*\ ``-``\ *processor*
571 The architecture from table :ref:`amdgpu-target-triples-table`.
573 This is always ``amdgcn`` when the target triple OS is ``amdhsa`` (see
574 :ref:`amdgpu-target-triples`).
577 The vendor from table :ref:`amdgpu-target-triples-table`.
579 For the AMDGPU backend this is always ``amd``.
582 The OS from table :ref:`amdgpu-target-triples-table`.
585 An environment from table :ref:`amdgpu-target-triples-table`, or blank if
586 the environment has no affect on the execution of the code object.
588 For the AMDGPU backend this is currently always blank.
590 The processor from table :ref:`amdgpu-processors-table`.
594 ``amdgcn-amd-amdhsa--gfx901``
596 ``NT_AMD_AMDGPU_HSA_METADATA``
597 Specifies extensible metadata associated with the code objects executed on HSA
598 [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
599 the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
600 :ref:`amdgpu-amdhsa-hsa-code-object-metadata` for the syntax of the code
601 object metadata string.
608 Symbols include the following:
610 .. table:: AMDGPU ELF Symbols
611 :name: amdgpu-elf-symbols-table
613 ===================== ============== ============= ==================
614 Name Type Section Description
615 ===================== ============== ============= ==================
616 *link-name* ``STT_OBJECT`` - ``.data`` Global variable
619 *link-name*\ ``@kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
620 *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point
621 ===================== ============== ============= ==================
624 Global variables both used and defined by the compilation unit.
626 If the symbol is defined in the compilation unit then it is allocated in the
627 appropriate section according to if it has initialized data or is readonly.
629 If the symbol is external then its section is ``STN_UNDEF`` and the loader
630 will resolve relocations using the definition provided by another code object
631 or explicitly defined by the runtime.
633 All global symbols, whether defined in the compilation unit or external, are
634 accessed by the machine code indirectly through a GOT table entry. This
635 allows them to be preemptable. The GOT table is only supported when the target
636 triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
639 Add description of linked shared object symbols. Seems undefined symbols
640 are marked as STT_NOTYPE.
643 Every HSA kernel has an associated kernel descriptor. It is the address of the
644 kernel descriptor that is used in the AQL dispatch packet used to invoke the
645 kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
646 defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
649 Every HSA kernel also has a symbol for its machine code entry point.
651 .. _amdgpu-relocation-records:
656 AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
657 relocatable fields are:
660 This specifies a 32-bit field occupying 4 bytes with arbitrary byte
661 alignment. These values use the same byte order as other word values in the
662 AMD GPU architecture.
665 This specifies a 64-bit field occupying 8 bytes with arbitrary byte
666 alignment. These values use the same byte order as other word values in the
667 AMD GPU architecture.
669 Following notations are used for specifying relocation calculations:
672 Represents the addend used to compute the value of the relocatable field.
675 Represents the offset into the global offset table at which the relocation
676 entry's symbol will reside during execution.
679 Represents the address of the global offset table.
682 Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
683 of the storage unit being relocated (computed using ``r_offset``).
686 Represents the value of the symbol whose index resides in the relocation
687 entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``.
690 Represents the base address of a loaded executable or shared object which is
691 the difference between the ELF address and the actual load address. Relocations
692 using this are only valid in executable or shared objects.
694 The following relocation types are supported:
696 .. table:: AMDGPU ELF Relocation Records
697 :name: amdgpu-elf-relocation-records-table
699 ========================== ===== ========== ==============================
700 Relocation Type Value Field Calculation
701 ========================== ===== ========== ==============================
702 ``R_AMDGPU_NONE`` 0 *none* *none*
703 ``R_AMDGPU_ABS32_LO`` 1 ``word32`` (S + A) & 0xFFFFFFFF
704 ``R_AMDGPU_ABS32_HI`` 2 ``word32`` (S + A) >> 32
705 ``R_AMDGPU_ABS64`` 3 ``word64`` S + A
706 ``R_AMDGPU_REL32`` 4 ``word32`` S + A - P
707 ``R_AMDGPU_REL64`` 5 ``word64`` S + A - P
708 ``R_AMDGPU_ABS32`` 6 ``word32`` S + A
709 ``R_AMDGPU_GOTPCREL`` 7 ``word32`` G + GOT + A - P
710 ``R_AMDGPU_GOTPCREL32_LO`` 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
711 ``R_AMDGPU_GOTPCREL32_HI`` 9 ``word32`` (G + GOT + A - P) >> 32
712 ``R_AMDGPU_REL32_LO`` 10 ``word32`` (S + A - P) & 0xFFFFFFFF
713 ``R_AMDGPU_REL32_HI`` 11 ``word32`` (S + A - P) >> 32
715 ``R_AMDGPU_RELATIVE64`` 13 ``word64`` B + A
716 ========================== ===== ========== ==============================
723 Standard DWARF [DWARF]_ Version 2 sections can be generated. These contain
724 information that maps the code object executable code and data to the source
725 language constructs. It can be used by tools such as debuggers and profilers.
727 Address Space Mapping
728 ~~~~~~~~~~~~~~~~~~~~~
730 The following address space mapping is used:
732 .. table:: AMDGPU DWARF Address Space Mapping
733 :name: amdgpu-dwarf-address-space-mapping-table
735 =================== =================
736 DWARF Address Space Memory Space
737 =================== =================
742 *omitted* Generic (Flat)
743 *not supported* Region (GDS)
744 =================== =================
746 See :ref:`amdgpu-address-spaces` for information on the memory space terminology
749 An ``address_class`` attribute is generated on pointer type DIEs to specify the
750 DWARF address space of the value of the pointer when it is in the *private* or
751 *local* address space. Otherwise the attribute is omitted.
753 An ``XDEREF`` operation is generated in location list expressions for variables
754 that are allocated in the *private* and *local* address space. Otherwise no
755 ``XDREF`` is omitted.
760 *This section is WIP.*
763 Define DWARF register enumeration.
765 If want to present a wavefront state then should expose vector registers as
766 64 wide (rather than per work-item view that LLVM uses). Either as separate
767 registers, or a 64x4 byte single register. In either case use a new LANE op
768 (akin to XDREF) to select the current lane usage in a location
769 expression. This would also allow scalar register spilling to vector register
770 lanes to be expressed (currently no debug information is being generated for
771 spilling). If choose a wide single register approach then use LANE in
772 conjunction with PIECE operation to select the dword part of the register for
773 the current lane. If the separate register approach then use LANE to select
779 *This section is WIP.*
782 DWARF extension to include runtime generated source text.
784 .. _amdgpu-code-conventions:
789 This section provides code conventions used for each supported target triple OS
790 (see :ref:`amdgpu-target-triples`).
795 This section provides code conventions used when the target triple OS is
796 ``amdhsa`` (see :ref:`amdgpu-target-triples`).
798 .. _amdgpu-amdhsa-hsa-code-object-metadata:
803 The code object metadata specifies extensible metadata associated with the code
804 objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
805 [AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record
806 (see :ref:`amdgpu-note-records`) and is required when the target triple OS is
807 ``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
808 information necessary to support the ROCM kernel queries. For example, the
809 segment sizes needed in a dispatch packet. In addition, a high level language
810 runtime may require other information to be included. For example, the AMD
811 OpenCL runtime records kernel argument information.
813 The metadata is specified as a YAML formatted string (see [YAML]_ and
817 Is the string null terminated? It probably should not if YAML allows it to
818 contain null characters, otherwise it should be.
820 The metadata is represented as a single YAML document comprised of the mapping
821 defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
824 For boolean values, the string values of ``false`` and ``true`` are used for
825 false and true respectively.
827 Additional information can be added to the mappings. To avoid conflicts, any
828 non-AMD key names should be prefixed by "*vendor-name*.".
830 .. table:: AMDHSA Code Object Metadata Mapping
831 :name: amdgpu-amdhsa-code-object-metadata-mapping-table
833 ========== ============== ========= =======================================
834 String Key Value Type Required? Description
835 ========== ============== ========= =======================================
836 "Version" sequence of Required - The first integer is the major
837 2 integers version. Currently 1.
838 - The second integer is the minor
839 version. Currently 0.
840 "Printf" sequence of Each string is encoded information
841 strings about a printf function call. The
842 encoded information is organized as
843 fields separated by colon (':'):
845 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
850 A 32 bit integer as a unique id for
851 each printf function call
854 A 32 bit integer equal to the number
855 of arguments of printf function call
858 ``S[i]`` (where i = 0, 1, ... , N-1)
859 32 bit integers for the size in bytes
860 of the i-th FormatString argument of
861 the printf function call
864 The format string passed to the
865 printf function call.
866 "Kernels" sequence of Required Sequence of the mappings for each
867 mapping kernel in the code object. See
868 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
869 for the definition of the mapping.
870 ========== ============== ========= =======================================
874 .. table:: AMDHSA Code Object Kernel Metadata Mapping
875 :name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
877 ================= ============== ========= ================================
878 String Key Value Type Required? Description
879 ================= ============== ========= ================================
880 "Name" string Required Source name of the kernel.
881 "SymbolName" string Required Name of the kernel
882 descriptor ELF symbol.
883 "Language" string Source language of the kernel.
891 "LanguageVersion" sequence of - The first integer is the major
893 - The second integer is the
895 "Attrs" mapping Mapping of kernel attributes.
897 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
898 for the mapping definition.
899 "Args" sequence of Sequence of mappings of the
900 mapping kernel arguments. See
901 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
902 for the definition of the mapping.
903 "CodeProps" mapping Mapping of properties related to
905 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
906 for the mapping definition.
907 "DebugProps" mapping Mapping of properties related to
908 the kernel debugging. See
909 :ref:`amdgpu-amdhsa-code-object-kernel-debug-properties-metadata-mapping-table`
910 for the mapping definition.
911 ================= ============== ========= ================================
915 .. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
916 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
918 =================== ============== ========= ==============================
919 String Key Value Type Required? Description
920 =================== ============== ========= ==============================
921 "ReqdWorkGroupSize" sequence of The dispatch work-group size
922 3 integers X, Y, Z must correspond to the
925 Corresponds to the OpenCL
926 ``reqd_work_group_size``
928 "WorkGroupSizeHint" sequence of The dispatch work-group size
929 3 integers X, Y, Z is likely to be the
932 Corresponds to the OpenCL
933 ``work_group_size_hint``
935 "VecTypeHint" string The name of a scalar or vector
938 Corresponds to the OpenCL
939 ``vec_type_hint`` attribute.
941 "RuntimeHandle" string The external symbol name
942 associated with a kernel.
943 OpenCL runtime allocates a
944 global buffer for the symbol
945 and saves the kernel's address
946 to it, which is used for
947 device side enqueueing. Only
948 available for device side
950 =================== ============== ========= ==============================
954 .. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
955 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
957 ================= ============== ========= ================================
958 String Key Value Type Required? Description
959 ================= ============== ========= ================================
960 "Name" string Kernel argument name.
961 "TypeName" string Kernel argument type name.
962 "Size" integer Required Kernel argument size in bytes.
963 "Align" integer Required Kernel argument alignment in
964 bytes. Must be a power of two.
965 "ValueKind" string Required Kernel argument kind that
966 specifies how to set up the
967 corresponding argument.
971 The argument is copied
972 directly into the kernarg.
975 A global address space pointer
976 to the buffer data is passed
979 "DynamicSharedPointer"
980 A group address space pointer
981 to dynamically allocated LDS
982 is passed in the kernarg.
985 A global address space
986 pointer to a S# is passed in
990 A global address space
991 pointer to a T# is passed in
995 A global address space pointer
996 to an OpenCL pipe is passed in
1000 A global address space pointer
1001 to an OpenCL device enqueue
1002 queue is passed in the
1005 "HiddenGlobalOffsetX"
1006 The OpenCL grid dispatch
1007 global offset for the X
1008 dimension is passed in the
1011 "HiddenGlobalOffsetY"
1012 The OpenCL grid dispatch
1013 global offset for the Y
1014 dimension is passed in the
1017 "HiddenGlobalOffsetZ"
1018 The OpenCL grid dispatch
1019 global offset for the Z
1020 dimension is passed in the
1024 An argument that is not used
1025 by the kernel. Space needs to
1026 be left for it, but it does
1027 not need to be set up.
1029 "HiddenPrintfBuffer"
1030 A global address space pointer
1031 to the runtime printf buffer
1032 is passed in kernarg.
1034 "HiddenDefaultQueue"
1035 A global address space pointer
1036 to the OpenCL device enqueue
1037 queue that should be used by
1038 the kernel by default is
1039 passed in the kernarg.
1041 "HiddenCompletionAction"
1042 A global address space pointer
1043 to help link enqueued kernels into
1044 the ancestor tree for determining
1045 when the parent kernel has finished.
1047 "ValueType" string Required Kernel argument value type. Only
1048 present if "ValueKind" is
1049 "ByValue". For vector data
1050 types, the value is for the
1051 element type. Values include:
1067 How can it be determined if a
1068 vector type, and what size
1070 "PointeeAlign" integer Alignment in bytes of pointee
1071 type for pointer type kernel
1072 argument. Must be a power
1073 of 2. Only present if
1075 "DynamicSharedPointer".
1076 "AddrSpaceQual" string Kernel argument address space
1077 qualifier. Only present if
1078 "ValueKind" is "GlobalBuffer" or
1079 "DynamicSharedPointer". Values
1090 Is GlobalBuffer only Global
1092 DynamicSharedPointer always
1093 Local? Can HCC allow Generic?
1094 How can Private or Region
1096 "AccQual" string Kernel argument access
1097 qualifier. Only present if
1098 "ValueKind" is "Image" or
1109 "ActualAccQual" string The actual memory accesses
1110 performed by the kernel on the
1111 kernel argument. Only present if
1112 "ValueKind" is "GlobalBuffer",
1113 "Image", or "Pipe". This may be
1114 more restrictive than indicated
1115 by "AccQual" to reflect what the
1116 kernel actual does. If not
1117 present then the runtime must
1118 assume what is implied by
1119 "AccQual" and "IsConst". Values
1126 "IsConst" boolean Indicates if the kernel argument
1127 is const qualified. Only present
1131 "IsRestrict" boolean Indicates if the kernel argument
1132 is restrict qualified. Only
1133 present if "ValueKind" is
1136 "IsVolatile" boolean Indicates if the kernel argument
1137 is volatile qualified. Only
1138 present if "ValueKind" is
1141 "IsPipe" boolean Indicates if the kernel argument
1142 is pipe qualified. Only present
1143 if "ValueKind" is "Pipe".
1146 Can GlobalBuffer be pipe
1148 ================= ============== ========= ================================
1152 .. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
1153 :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
1155 ============================ ============== ========= =====================
1156 String Key Value Type Required? Description
1157 ============================ ============== ========= =====================
1158 "KernargSegmentSize" integer Required The size in bytes of
1160 that holds the values
1163 "GroupSegmentFixedSize" integer Required The amount of group
1167 bytes. This does not
1169 dynamically allocated
1170 group segment memory
1174 "PrivateSegmentFixedSize" integer Required The amount of fixed
1175 private address space
1176 memory required for a
1180 is 1 then additional
1182 to this value for the
1184 "KernargSegmentAlign" integer Required The maximum byte
1187 kernarg segment. Must
1189 "WavefrontSize" integer Required Wavefront size. Must
1191 "NumSGPRs" integer Number of scalar
1195 includes the special
1201 SGPR added if a trap
1207 "NumVGPRs" integer Number of vector
1211 "MaxFlatWorkGroupSize" integer Maximum flat
1214 kernel in work-items.
1215 "IsDynamicCallStack" boolean Indicates if the
1220 "IsXNACKEnabled" boolean Indicates if the
1224 ============================ ============== ========= =====================
1228 .. table:: AMDHSA Code Object Kernel Debug Properties Metadata Mapping
1229 :name: amdgpu-amdhsa-code-object-kernel-debug-properties-metadata-mapping-table
1231 =================================== ============== ========= ==============
1232 String Key Value Type Required? Description
1233 =================================== ============== ========= ==============
1234 "DebuggerABIVersion" sequence of
1236 "ReservedNumVGPRs" integer
1237 "ReservedFirstVGPR" integer
1238 "PrivateSegmentBufferSGPR" integer
1239 "WavefrontPrivateSegmentOffsetSGPR" integer
1240 =================================== ============== ========= ==============
1243 Plan to remove the debug properties metadata.
1248 The HSA architected queuing language (AQL) defines a user space memory interface
1249 that can be used to control the dispatch of kernels, in an agent independent
1250 way. An agent can have zero or more AQL queues created for it using the ROCm
1251 runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
1252 *HSA Platform System Architecture Specification* [HSA]_ for the AQL queue
1253 mechanics and packet layouts.
1255 The packet processor of a kernel agent is responsible for detecting and
1256 dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
1257 packet processor is implemented by the hardware command processor (CP),
1258 asynchronous dispatch controller (ADC) and shader processor input controller
1261 The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
1262 mode driver to initialize and register the AQL queue with CP.
1264 To dispatch a kernel the following actions are performed. This can occur in the
1265 CPU host program, or from an HSA kernel executing on a GPU.
1267 1. A pointer to an AQL queue for the kernel agent on which the kernel is to be
1268 executed is obtained.
1269 2. A pointer to the kernel descriptor (see
1270 :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
1271 obtained. It must be for a kernel that is contained in a code object that that
1272 was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
1274 3. Space is allocated for the kernel arguments using the ROCm runtime allocator
1275 for a memory region with the kernarg property for the kernel agent that will
1276 execute the kernel. It must be at least 16 byte aligned.
1277 4. Kernel argument values are assigned to the kernel argument memory
1278 allocation. The layout is defined in the *HSA Programmer's Language Reference*
1279 [HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
1280 memory in the same way constant memory is accessed. (Note that the HSA
1281 specification allows an implementation to copy the kernel argument contents to
1282 another location that is accessed by the kernel.)
1283 5. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
1284 api uses 64 bit atomic operations to reserve space in the AQL queue for the
1285 packet. The packet must be set up, and the final write must use an atomic
1286 store release to set the packet kind to ensure the packet contents are
1287 visible to the kernel agent. AQL defines a doorbell signal mechanism to
1288 notify the kernel agent that the AQL queue has been updated. These rules, and
1289 the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
1290 System Architecture Specification* [HSA]_.
1291 6. A kernel dispatch packet includes information about the actual dispatch,
1292 such as grid and work-group size, together with information from the code
1293 object about the kernel, such as segment sizes. The ROCm runtime queries on
1294 the kernel symbol can be used to obtain the code object values which are
1295 recorded in the :ref:`amdgpu-amdhsa-hsa-code-object-metadata`.
1296 7. CP executes micro-code and is responsible for detecting and setting up the
1297 GPU to execute the wavefronts of a kernel dispatch.
1298 8. CP ensures that when the a wavefront starts executing the kernel machine
1299 code, the scalar general purpose registers (SGPR) and vector general purpose
1300 registers (VGPR) are set up as required by the machine code. The required
1301 setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
1302 register state is defined in
1303 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
1304 9. The prolog of the kernel machine code (see
1305 :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
1306 before continuing executing the machine code that corresponds to the kernel.
1307 10. When the kernel dispatch has completed execution, CP signals the completion
1308 signal specified in the kernel dispatch packet if not 0.
1310 .. _amdgpu-amdhsa-memory-spaces:
1315 The memory space properties are:
1317 .. table:: AMDHSA Memory Spaces
1318 :name: amdgpu-amdhsa-memory-spaces-table
1320 ================= =========== ======== ======= ==================
1321 Memory Space Name HSA Segment Hardware Address NULL Value
1323 ================= =========== ======== ======= ==================
1324 Private private scratch 32 0x00000000
1325 Local group LDS 32 0xFFFFFFFF
1326 Global global global 64 0x0000000000000000
1327 Constant constant *same as 64 0x0000000000000000
1329 Generic flat flat 64 0x0000000000000000
1330 Region N/A GDS 32 *not implemented
1332 ================= =========== ======== ======= ==================
1334 The global and constant memory spaces both use global virtual addresses, which
1335 are the same virtual address space used by the CPU. However, some virtual
1336 addresses may only be accessible to the CPU, some only accessible by the GPU,
1339 Using the constant memory space indicates that the data will not change during
1340 the execution of the kernel. This allows scalar read instructions to be
1341 used. The vector and scalar L1 caches are invalidated of volatile data before
1342 each kernel dispatch execution to allow constant memory to change values between
1345 The local memory space uses the hardware Local Data Store (LDS) which is
1346 automatically allocated when the hardware creates work-groups of wavefronts, and
1347 freed when all the wavefronts of a work-group have terminated. The data store
1348 (DS) instructions can be used to access it.
1350 The private memory space uses the hardware scratch memory support. If the kernel
1351 uses scratch, then the hardware allocates memory that is accessed using
1352 wavefront lane dword (4 byte) interleaving. The mapping used from private
1353 address to physical address is:
1355 ``wavefront-scratch-base +
1356 (private-address * wavefront-size * 4) +
1357 (wavefront-lane-id * 4)``
1359 There are different ways that the wavefront scratch base address is determined
1360 by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
1361 memory can be accessed in an interleaved manner using buffer instruction with
1362 the scratch buffer descriptor and per wave scratch offset, by the scratch
1363 instructions, or by flat instructions. If each lane of a wavefront accesses the
1364 same private address, the interleaving results in adjacent dwords being accessed
1365 and hence requires fewer cache lines to be fetched. Multi-dword access is not
1366 supported except by flat and scratch instructions in GFX9.
1368 The generic address space uses the hardware flat address support available in
1369 GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
1370 local appertures), that are outside the range of addressible global memory, to
1371 map from a flat address to a private or local address.
1373 FLAT instructions can take a flat address and access global, private (scratch)
1374 and group (LDS) memory depending in if the address is within one of the
1375 apperture ranges. Flat access to scratch requires hardware aperture setup and
1376 setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
1377 access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
1378 (see :ref:`amdgpu-amdhsa-m0`).
1380 To convert between a segment address and a flat address the base address of the
1381 appertures address can be used. For GFX7-GFX8 these are available in the
1382 :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
1383 Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
1384 GFX9 the appature base addresses are directly available as inline constant
1385 registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
1386 address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
1387 which makes it easier to convert from flat to segment or segment to flat.
1392 Image and sample handles created by the ROCm runtime are 64 bit addresses of a
1393 hardware 32 byte V# and 48 byte S# object respectively. In order to support the
1394 HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
1395 enumeration values for the queries that are not trivially deducible from the S#
1401 HSA signal handles created by the ROCm runtime are 64 bit addresses of a
1402 structure allocated in memory accessible from both the CPU and GPU. The
1403 structure is defined by the ROCm runtime and subject to change between releases
1404 (see [AMD-ROCm-github]_).
1406 .. _amdgpu-amdhsa-hsa-aql-queue:
1411 The HSA AQL queue structure is defined by the ROCm runtime and subject to change
1412 between releases (see [AMD-ROCm-github]_). For some processors it contains
1413 fields needed to implement certain language features such as the flat address
1414 aperture bases. It also contains fields used by CP such as managing the
1415 allocation of scratch memory.
1417 .. _amdgpu-amdhsa-kernel-descriptor:
1422 A kernel descriptor consists of the information needed by CP to initiate the
1423 execution of a kernel, including the entry point address of the machine code
1424 that implements the kernel.
1426 Kernel Descriptor for GFX6-GFX9
1427 +++++++++++++++++++++++++++++++
1429 CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
1431 .. table:: Kernel Descriptor for GFX6-GFX9
1432 :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
1434 ======= ======= =============================== ============================
1435 Bits Size Field Name Description
1436 ======= ======= =============================== ============================
1437 31:0 4 bytes GroupSegmentFixedSize The amount of fixed local
1438 address space memory
1439 required for a work-group
1440 in bytes. This does not
1441 include any dynamically
1442 allocated local address
1443 space memory that may be
1444 added when the kernel is
1446 63:32 4 bytes PrivateSegmentFixedSize The amount of fixed
1447 private address space
1448 memory required for a
1449 work-item in bytes. If
1450 is_dynamic_callstack is 1
1451 then additional space must
1452 be added to this value for
1454 95:64 4 bytes MaxFlatWorkGroupSize Maximum flat work-group
1455 size supported by the
1456 kernel in work-items.
1457 96 1 bit IsDynamicCallStack Indicates if the generated
1458 machine code is using a
1459 dynamically sized call
1461 97 1 bit IsXNACKEnabled Indicates if the generated
1462 machine code is capable of
1464 127:98 30 bits Reserved, must be 0.
1465 191:128 8 bytes KernelCodeEntryByteOffset Byte offset (possibly
1468 descriptor to kernel's
1469 entry point instruction
1470 which must be 256 byte
1472 383:192 24 Reserved, must be 0.
1474 415:384 4 bytes ComputePgmRsrc1 Compute Shader (CS)
1475 program settings used by
1477 ``COMPUTE_PGM_RSRC1``
1480 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
1481 447:416 4 bytes ComputePgmRsrc2 Compute Shader (CS)
1482 program settings used by
1484 ``COMPUTE_PGM_RSRC2``
1487 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
1488 448 1 bit EnableSGPRPrivateSegmentBuffer Enable the setup of the
1489 SGPR user data registers
1491 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1493 The total number of SGPR
1495 requested must not exceed
1496 16 and match value in
1497 ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
1498 Any requests beyond 16
1500 449 1 bit EnableSGPRDispatchPtr *see above*
1501 450 1 bit EnableSGPRQueuePtr *see above*
1502 451 1 bit EnableSGPRKernargSegmentPtr *see above*
1503 452 1 bit EnableSGPRDispatchID *see above*
1504 453 1 bit EnableSGPRFlatScratchInit *see above*
1505 454 1 bit EnableSGPRPrivateSegmentSize *see above*
1506 455 1 bit EnableSGPRGridWorkgroupCountX Not implemented in CP and
1508 456 1 bit EnableSGPRGridWorkgroupCountY Not implemented in CP and
1510 457 1 bit EnableSGPRGridWorkgroupCountZ Not implemented in CP and
1512 463:458 6 bits Reserved, must be 0.
1513 511:464 6 Reserved, must be 0.
1515 512 **Total size 64 bytes.**
1516 ======= ====================================================================
1520 .. table:: compute_pgm_rsrc1 for GFX6-GFX9
1521 :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table
1523 ======= ======= =============================== ===========================================================================
1524 Bits Size Field Name Description
1525 ======= ======= =============================== ===========================================================================
1526 5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector registers
1527 used by each work-item,
1528 granularity is device
1533 - roundup((max_vgpg + 1)
1536 Used by CP to set up
1537 ``COMPUTE_PGM_RSRC1.VGPRS``.
1538 9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar registers
1539 used by a wavefront,
1540 granularity is device
1545 - roundup((max_sgpg + 1)
1549 - roundup((max_sgpg + 1)
1552 Includes the special SGPRs
1553 for VCC, Flat Scratch (for
1554 GFX7 onwards) and XNACK
1555 (for GFX8 onwards). It does
1556 not include the 16 SGPR
1557 added if a trap handler is
1560 Used by CP to set up
1561 ``COMPUTE_PGM_RSRC1.SGPRS``.
1562 11:10 2 bits PRIORITY Must be 0.
1564 Start executing wavefront
1565 at the specified priority.
1567 CP is responsible for
1569 ``COMPUTE_PGM_RSRC1.PRIORITY``.
1570 13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution
1571 with specified rounding
1574 precision floating point
1577 Floating point rounding
1578 mode values are defined in
1579 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1581 Used by CP to set up
1582 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1583 15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution
1584 with specified rounding
1585 denorm mode for half/double (16
1586 and 64 bit) floating point
1587 precision floating point
1590 Floating point rounding
1591 mode values are defined in
1592 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1594 Used by CP to set up
1595 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1596 17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution
1597 with specified denorm mode
1600 precision floating point
1603 Floating point denorm mode
1604 values are defined in
1605 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1607 Used by CP to set up
1608 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1609 19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution
1610 with specified denorm mode
1612 and 64 bit) floating point
1613 precision floating point
1616 Floating point denorm mode
1617 values are defined in
1618 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1620 Used by CP to set up
1621 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1622 20 1 bit PRIV Must be 0.
1624 Start executing wavefront
1625 in privilege trap handler
1628 CP is responsible for
1630 ``COMPUTE_PGM_RSRC1.PRIV``.
1631 21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution
1632 with DX10 clamp mode
1633 enabled. Used by the vector
1634 ALU to force DX10 style
1635 treatment of NaN's (when
1636 set, clamp NaN to zero,
1640 Used by CP to set up
1641 ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
1642 22 1 bit DEBUG_MODE Must be 0.
1644 Start executing wavefront
1645 in single step mode.
1647 CP is responsible for
1649 ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
1650 23 1 bit ENABLE_IEEE_MODE Wavefront starts execution
1652 enabled. Floating point
1653 opcodes that support
1654 exception flag gathering
1655 will quiet and propagate
1656 signaling-NaN inputs per
1657 IEEE 754-2008. Min_dx10 and
1658 max_dx10 become IEEE
1659 754-2008 compliant due to
1660 signaling-NaN propagation
1663 Used by CP to set up
1664 ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
1665 24 1 bit BULKY Must be 0.
1667 Only one work-group allowed
1668 to execute on a compute
1671 CP is responsible for
1673 ``COMPUTE_PGM_RSRC1.BULKY``.
1674 25 1 bit CDBG_USER Must be 0.
1676 Flag that can be used to
1677 control debugging code.
1679 CP is responsible for
1681 ``COMPUTE_PGM_RSRC1.CDBG_USER``.
1682 26 1 bit FP16_OVFL GFX6-8
1683 Reserved, must be 0.
1685 Wavefront starts execution
1686 with specified fp16 overflow
1689 - If 0, fp16 overflow generates
1691 - If 1, fp16 overflow that is the
1692 result of an +/-INF input value
1693 or divide by 0 produces a +/-INF,
1694 otherwise clamps computed
1695 overflow to +/-MAX_FP16 as
1698 Used by CP to set up
1699 ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
1700 31:27 5 bits Reserved, must be 0.
1701 32 **Total size 4 bytes**
1702 ======= ===================================================================================================================
1706 .. table:: compute_pgm_rsrc2 for GFX6-GFX9
1707 :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
1709 ======= ======= =============================== ===========================================================================
1710 Bits Size Field Name Description
1711 ======= ======= =============================== ===========================================================================
1712 0 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the
1713 _WAVE_OFFSET SGPR wave scratch offset
1714 system register (see
1715 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1717 Used by CP to set up
1718 ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
1719 5:1 5 bits USER_SGPR_COUNT The total number of SGPR
1721 requested. This number must
1722 match the number of user
1723 data registers enabled.
1725 Used by CP to set up
1726 ``COMPUTE_PGM_RSRC2.USER_SGPR``.
1727 6 1 bit ENABLE_TRAP_HANDLER Set to 1 if code contains a
1728 TRAP instruction which
1729 requires a trap handler to
1733 ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``
1735 installed a trap handler
1736 regardless of the setting
1738 7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the
1739 system SGPR register for
1740 the work-group id in the X
1742 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1744 Used by CP to set up
1745 ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
1746 8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the
1747 system SGPR register for
1748 the work-group id in the Y
1750 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1752 Used by CP to set up
1753 ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
1754 9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the
1755 system SGPR register for
1756 the work-group id in the Z
1758 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1760 Used by CP to set up
1761 ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
1762 10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the
1763 system SGPR register for
1764 work-group information (see
1765 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1767 Used by CP to set up
1768 ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
1769 12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the
1770 VGPR system registers used
1771 for the work-item ID.
1772 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
1775 Used by CP to set up
1776 ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
1777 13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0.
1779 Wavefront starts execution
1781 exceptions enabled which
1782 are generated when L1 has
1783 witnessed a thread access
1787 CP is responsible for
1788 filling in the address
1790 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1791 according to what the
1793 14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0.
1795 Wavefront starts execution
1796 with memory violation
1797 exceptions exceptions
1798 enabled which are generated
1799 when a memory violation has
1800 occurred for this wave from
1802 (write-to-read-only-memory,
1803 mis-aligned atomic, LDS
1804 address out of range,
1805 illegal address, etc.).
1809 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1810 according to what the
1812 23:15 9 bits GRANULATED_LDS_SIZE Must be 0.
1814 CP uses the rounded value
1815 from the dispatch packet,
1816 not this value, as the
1817 dispatch may contain
1818 dynamically allocated group
1819 segment memory. CP writes
1821 ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
1823 Amount of group segment
1824 (LDS) to allocate for each
1825 work-group. Granularity is
1829 roundup(lds-size / (64 * 4))
1831 roundup(lds-size / (128 * 4))
1833 24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution
1834 _INVALID_OPERATION with specified exceptions
1837 Used by CP to set up
1838 ``COMPUTE_PGM_RSRC2.EXCP_EN``
1839 (set from bits 0..6).
1843 25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more
1844 _SOURCE input operands is a
1846 26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by
1847 _DIVISION_BY_ZERO Zero
1848 27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow
1850 28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow
1852 29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact
1854 30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero
1855 _ZERO (rcp_iflag_f32 instruction
1857 31 1 bit Reserved, must be 0.
1858 32 **Total size 4 bytes.**
1859 ======= ===================================================================================================================
1863 .. table:: Floating Point Rounding Mode Enumeration Values
1864 :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
1866 ====================================== ===== ==============================
1867 Enumeration Name Value Description
1868 ====================================== ===== ==============================
1869 AMDGPU_FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
1870 AMDGPU_FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
1871 AMDGPU_FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
1872 AMDGPU_FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
1873 ====================================== ===== ==============================
1877 .. table:: Floating Point Denorm Mode Enumeration Values
1878 :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
1880 ====================================== ===== ==============================
1881 Enumeration Name Value Description
1882 ====================================== ===== ==============================
1883 AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination
1885 AMDGPU_FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
1886 AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
1887 AMDGPU_FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
1888 ====================================== ===== ==============================
1892 .. table:: System VGPR Work-Item ID Enumeration Values
1893 :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
1895 ======================================== ===== ============================
1896 Enumeration Name Value Description
1897 ======================================== ===== ============================
1898 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension
1900 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
1902 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
1904 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
1905 ======================================== ===== ============================
1907 .. _amdgpu-amdhsa-initial-kernel-execution-state:
1909 Initial Kernel Execution State
1910 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1912 This section defines the register state that will be set up by the packet
1913 processor prior to the start of execution of every wavefront. This is limited by
1914 the constraints of the hardware controllers of CP/ADC/SPI.
1916 The order of the SGPR registers is defined, but the compiler can specify which
1917 ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
1918 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
1919 for enabled registers are dense starting at SGPR0: the first enabled register is
1920 SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
1923 The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
1924 all waves of the grid. It is possible to specify more than 16 User SGPRs using
1925 the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
1926 initialized. These are then immediately followed by the System SGPRs that are
1927 set up by ADC/SPI and can have different values for each wave of the grid
1930 SGPR register initial state is defined in
1931 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
1933 .. table:: SGPR Register Set Up Order
1934 :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
1936 ========== ========================== ====== ==============================
1937 SGPR Order Name Number Description
1938 (kernel descriptor enable of
1940 ========== ========================== ====== ==============================
1941 First Private Segment Buffer 4 V# that can be used, together
1942 (enable_sgpr_private with Scratch Wave Offset as an
1943 _segment_buffer) offset, to access the private
1944 memory space using a segment
1947 CP uses the value provided by
1949 then Dispatch Ptr 2 64 bit address of AQL dispatch
1950 (enable_sgpr_dispatch_ptr) packet for kernel dispatch
1952 then Queue Ptr 2 64 bit address of amd_queue_t
1953 (enable_sgpr_queue_ptr) object for AQL queue on which
1954 the dispatch packet was
1956 then Kernarg Segment Ptr 2 64 bit address of Kernarg
1957 (enable_sgpr_kernarg segment. This is directly
1958 _segment_ptr) copied from the
1959 kernarg_address in the kernel
1962 Having CP load it once avoids
1963 loading it at the beginning of
1965 then Dispatch Id 2 64 bit Dispatch ID of the
1966 (enable_sgpr_dispatch_id) dispatch packet being
1968 then Flat Scratch Init 2 This is 2 SGPRs:
1969 (enable_sgpr_flat_scratch
1973 The first SGPR is a 32 bit
1975 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
1976 to per SPI base of memory
1977 for scratch for the queue
1978 executing the kernel
1979 dispatch. CP obtains this
1980 from the runtime. (The
1981 Scratch Segment Buffer base
1983 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
1984 plus this offset.) The value
1985 of Scratch Wave Offset must
1986 be added to this offset by
1987 the kernel machine code,
1988 right shifted by 8, and
1989 moved to the FLAT_SCRATCH_HI
1991 FLAT_SCRATCH_HI corresponds
1992 to SGPRn-4 on GFX7, and
1993 SGPRn-6 on GFX8 (where SGPRn
1994 is the highest numbered SGPR
1995 allocated to the wave).
1997 multiplied by 256 (as it is
1998 in units of 256 bytes) and
2000 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2001 to calculate the per wave
2002 FLAT SCRATCH BASE in flat
2003 memory instructions that
2007 The second SGPR is 32 bit
2008 byte size of a single
2009 work-item's scratch memory
2010 usage. CP obtains this from
2011 the runtime, and it is
2012 always a multiple of DWORD.
2013 CP checks that the value in
2014 the kernel dispatch packet
2015 Private Segment Byte Size is
2016 not larger, and requests the
2017 runtime to increase the
2018 queue's scratch size if
2019 necessary. The kernel code
2021 FLAT_SCRATCH_LO which is
2022 SGPRn-3 on GFX7 and SGPRn-5
2023 on GFX8. FLAT_SCRATCH_LO is
2024 used as the FLAT SCRATCH
2026 instructions. Having CP load
2027 it once avoids loading it at
2028 the beginning of every
2029 wavefront. GFX9 This is the
2030 64 bit base address of the
2031 per SPI scratch backing
2032 memory managed by SPI for
2033 the queue executing the
2034 kernel dispatch. CP obtains
2035 this from the runtime (and
2036 divides it if there are
2037 multiple Shader Arrays each
2038 with its own SPI). The value
2039 of Scratch Wave Offset must
2040 be added by the kernel
2041 machine code and the result
2042 moved to the FLAT_SCRATCH
2043 SGPR which is SGPRn-6 and
2044 SGPRn-5. It is used as the
2045 FLAT SCRATCH BASE in flat
2046 memory instructions. then
2047 Private Segment Size 1 The
2048 32 bit byte size of a
2049 (enable_sgpr_private single
2051 scratch_segment_size) memory
2052 allocation. This is the
2053 value from the kernel
2054 dispatch packet Private
2055 Segment Byte Size rounded up
2056 by CP to a multiple of
2059 Having CP load it once avoids
2060 loading it at the beginning of
2063 This is not used for
2064 GFX7-GFX8 since it is the same
2065 value as the second SGPR of
2066 Flat Scratch Init. However, it
2067 may be needed for GFX9 which
2068 changes the meaning of the
2069 Flat Scratch Init value.
2070 then Grid Work-Group Count X 1 32 bit count of the number of
2071 (enable_sgpr_grid work-groups in the X dimension
2072 _workgroup_count_X) for the grid being
2073 executed. Computed from the
2074 fields in the kernel dispatch
2075 packet as ((grid_size.x +
2076 workgroup_size.x - 1) /
2078 then Grid Work-Group Count Y 1 32 bit count of the number of
2079 (enable_sgpr_grid work-groups in the Y dimension
2080 _workgroup_count_Y && for the grid being
2081 less than 16 previous executed. Computed from the
2082 SGPRs) fields in the kernel dispatch
2083 packet as ((grid_size.y +
2084 workgroup_size.y - 1) /
2087 Only initialized if <16
2088 previous SGPRs initialized.
2089 then Grid Work-Group Count Z 1 32 bit count of the number of
2090 (enable_sgpr_grid work-groups in the Z dimension
2091 _workgroup_count_Z && for the grid being
2092 less than 16 previous executed. Computed from the
2093 SGPRs) fields in the kernel dispatch
2094 packet as ((grid_size.z +
2095 workgroup_size.z - 1) /
2098 Only initialized if <16
2099 previous SGPRs initialized.
2100 then Work-Group Id X 1 32 bit work-group id in X
2101 (enable_sgpr_workgroup_id dimension of grid for
2103 then Work-Group Id Y 1 32 bit work-group id in Y
2104 (enable_sgpr_workgroup_id dimension of grid for
2106 then Work-Group Id Z 1 32 bit work-group id in Z
2107 (enable_sgpr_workgroup_id dimension of grid for
2109 then Work-Group Info 1 {first_wave, 14'b0000,
2110 (enable_sgpr_workgroup ordered_append_term[10:0],
2111 _info) threadgroup_size_in_waves[5:0]}
2112 then Scratch Wave Offset 1 32 bit byte offset from base
2113 (enable_sgpr_private of scratch base of queue
2114 _segment_wave_offset) executing the kernel
2115 dispatch. Must be used as an
2117 segment address when using
2118 Scratch Segment Buffer. It
2119 must be used to set up FLAT
2120 SCRATCH for flat addressing
2122 :ref:`amdgpu-amdhsa-flat-scratch`).
2123 ========== ========================== ====== ==============================
2125 The order of the VGPR registers is defined, but the compiler can specify which
2126 ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
2127 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2128 for enabled registers are dense starting at VGPR0: the first enabled register is
2129 VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
2132 VGPR register initial state is defined in
2133 :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
2135 .. table:: VGPR Register Set Up Order
2136 :name: amdgpu-amdhsa-vgpr-register-set-up-order-table
2138 ========== ========================== ====== ==============================
2139 VGPR Order Name Number Description
2140 (kernel descriptor enable of
2142 ========== ========================== ====== ==============================
2143 First Work-Item Id X 1 32 bit work item id in X
2144 (Always initialized) dimension of work-group for
2146 then Work-Item Id Y 1 32 bit work item id in Y
2147 (enable_vgpr_workitem_id dimension of work-group for
2148 > 0) wavefront lane.
2149 then Work-Item Id Z 1 32 bit work item id in Z
2150 (enable_vgpr_workitem_id dimension of work-group for
2151 > 1) wavefront lane.
2152 ========== ========================== ====== ==============================
2154 The setting of registers is is done by GPU CP/ADC/SPI hardware as follows:
2156 1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
2158 2. Work-group Id registers X, Y, Z are set by ADC which supports any
2159 combination including none.
2160 3. Scratch Wave Offset is set by SPI in a per wave basis which is why its value
2161 cannot included with the flat scratch init value which is per queue.
2162 4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
2165 Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
2166 value to the hardware required SGPRn-3 and SGPRn-4 respectively.
2168 The global segment can be accessed either using buffer instructions (GFX6 which
2169 has V# 64 bit address support), flat instructions (GFX7-9), or global
2170 instructions (GFX9).
2172 If buffer operations are used then the compiler can generate a V# with the
2173 following properties:
2177 * ATC: 1 if IOMMU present (such as APU)
2179 * MTYPE set to support memory coherence that matches the runtime (such as CC for
2180 APU and NC for dGPU).
2182 .. _amdgpu-amdhsa-kernel-prolog:
2187 .. _amdgpu-amdhsa-m0:
2193 The M0 register must be initialized with a value at least the total LDS size
2194 if the kernel may access LDS via DS or flat operations. Total LDS size is
2195 available in dispatch packet. For M0, it is also possible to use maximum
2196 possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
2199 The M0 register is not used for range checking LDS accesses and so does not
2200 need to be initialized in the prolog.
2202 .. _amdgpu-amdhsa-flat-scratch:
2207 If the kernel may use flat operations to access scratch memory, the prolog code
2208 must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
2209 are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wave
2210 Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
2213 Flat scratch is not supported.
2216 1. The low word of Flat Scratch Init is 32 bit byte offset from
2217 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
2218 being managed by SPI for the queue executing the kernel dispatch. This is
2219 the same value used in the Scratch Segment Buffer V# base address. The
2220 prolog must add the value of Scratch Wave Offset to get the wave's byte
2221 scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
2222 FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
2223 by 8 before moving into FLAT_SCRATCH_LO.
2224 2. The second word of Flat Scratch Init is 32 bit byte size of a single
2225 work-items scratch memory usage. This is directly loaded from the kernel
2226 dispatch packet Private Segment Byte Size and rounded up to a multiple of
2227 DWORD. Having CP load it once avoids loading it at the beginning of every
2228 wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
2231 The Flat Scratch Init is the 64 bit address of the base of scratch backing
2232 memory being managed by SPI for the queue executing the kernel dispatch. The
2233 prolog must add the value of Scratch Wave Offset and moved to the FLAT_SCRATCH
2234 pair for use as the flat scratch base in flat memory instructions.
2236 .. _amdgpu-amdhsa-memory-model:
2241 This section describes the mapping of LLVM memory model onto AMDGPU machine code
2242 (see :ref:`memmodel`). *The implementation is WIP.*
2245 Update when implementation complete.
2247 The AMDGPU backend supports the memory synchronization scopes specified in
2248 :ref:`amdgpu-memory-scopes`.
2250 The code sequences used to implement the memory model are defined in table
2251 :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
2253 The sequences specify the order of instructions that a single thread must
2254 execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
2255 to other memory instructions executed by the same thread. This allows them to be
2256 moved earlier or later which can allow them to be combined with other instances
2257 of the same instruction, or hoisted/sunk out of loops to improve
2258 performance. Only the instructions related to the memory model are given;
2259 additional ``s_waitcnt`` instructions are required to ensure registers are
2260 defined before being used. These may be able to be combined with the memory
2261 model ``s_waitcnt`` instructions as described above.
2263 The AMDGPU backend supports the following memory models:
2265 HSA Memory Model [HSA]_
2266 The HSA memory model uses a single happens-before relation for all address
2267 spaces (see :ref:`amdgpu-address-spaces`).
2268 OpenCL Memory Model [OpenCL]_
2269 The OpenCL memory model which has separate happens-before relations for the
2270 global and local address spaces. Only a fence specifying both global and
2271 local address space, and seq_cst instructions join the relationships. Since
2272 the LLVM ``memfence`` instruction does not allow an address space to be
2273 specified the OpenCL fence has to convervatively assume both local and
2274 global address space was specified. However, optimizations can often be
2275 done to eliminate the additional ``s_waitcnt`` instructions when there are
2276 no intervening memory instructions which access the corresponding address
2277 space. The code sequences in the table indicate what can be omitted for the
2278 OpenCL memory. The target triple environment is used to determine if the
2279 source language is OpenCL (see :ref:`amdgpu-opencl`).
2281 ``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
2284 ``buffer/global/flat_load/store/atomic`` instructions to global memory are
2285 termed vector memory operations.
2289 * Each agent has multiple compute units (CU).
2290 * Each CU has multiple SIMDs that execute wavefronts.
2291 * The wavefronts for a single work-group are executed in the same CU but may be
2292 executed by different SIMDs.
2293 * Each CU has a single LDS memory shared by the wavefronts of the work-groups
2295 * All LDS operations of a CU are performed as wavefront wide operations in a
2296 global order and involve no caching. Completion is reported to a wavefront in
2298 * The LDS memory has multiple request queues shared by the SIMDs of a
2299 CU. Therefore, the LDS operations performed by different waves of a work-group
2300 can be reordered relative to each other, which can result in reordering the
2301 visibility of vector memory operations with respect to LDS operations of other
2302 wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
2303 ensure synchronization between LDS operations and vector memory operations
2304 between waves of a work-group, but not between operations performed by the
2306 * The vector memory operations are performed as wavefront wide operations and
2307 completion is reported to a wavefront in execution order. The exception is
2308 that for GFX7-9 ``flat_load/store/atomic`` instructions can report out of
2309 vector memory order if they access LDS memory, and out of LDS operation order
2310 if they access global memory.
2311 * The vector memory operations access a single vector L1 cache shared by all
2312 SIMDs a CU. Therefore, no special action is required for coherence between the
2313 lanes of a single wavefront, or for coherence between wavefronts in the same
2314 work-group. A ``buffer_wbinvl1_vol`` is required for coherence between waves
2315 executing in different work-groups as they may be executing on different CUs.
2316 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
2317 on a group of CUs. The scalar and vector L1 caches are not coherent. However,
2318 scalar operations are used in a restricted way so do not impact the memory
2319 model. See :ref:`amdgpu-amdhsa-memory-spaces`.
2320 * The vector and scalar memory operations use an L2 cache shared by all CUs on
2322 * The L2 cache has independent channels to service disjoint ranges of virtual
2324 * Each CU has a separate request queue per channel. Therefore, the vector and
2325 scalar memory operations performed by waves executing in different work-groups
2326 (which may be executing on different CUs) of an agent can be reordered
2327 relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
2328 synchronization between vector memory operations of different CUs. It ensures a
2329 previous vector memory operation has completed before executing a subsequent
2330 vector memory or LDS operation and so can be used to meet the requirements of
2331 acquire and release.
2332 * The L2 cache can be kept coherent with other agents on some targets, or ranges
2333 of virtual addresses can be set up to bypass it to ensure system coherence.
2335 Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-8),
2336 or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
2337 memory, atomic memory orderings are not meaningful and all accesses are treated
2340 Constant address space uses ``buffer/global_load`` instructions (or equivalent
2341 scalar memory instructions). Since the constant address space contents do not
2342 change during the execution of a kernel dispatch it is not legal to perform
2343 stores, and atomic memory orderings are not meaningful and all access are
2344 treated as non-atomic.
2346 A memory synchronization scope wider than work-group is not meaningful for the
2347 group (LDS) address space and is treated as work-group.
2349 The memory model does not support the region address space which is treated as
2352 Acquire memory ordering is not meaningful on store atomic instructions and is
2353 treated as non-atomic.
2355 Release memory ordering is not meaningful on load atomic instructions and is
2356 treated a non-atomic.
2358 Acquire-release memory ordering is not meaningful on load or store atomic
2359 instructions and is treated as acquire and release respectively.
2361 AMDGPU backend only uses scalar memory operations to access memory that is
2362 proven to not change during the execution of the kernel dispatch. This includes
2363 constant address space and global address space for program scope const
2364 variables. Therefore the kernel machine code does not have to maintain the
2365 scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
2366 and vector L1 caches are invalidated between kernel dispatches by CP since
2367 constant address space data may change between kernel dispatch executions. See
2368 :ref:`amdgpu-amdhsa-memory-spaces`.
2370 The one execption is if scalar writes are used to spill SGPR registers. In this
2371 case the AMDGPU backend ensures the memory location used to spill is never
2372 accessed by vector memory operations at the same time. If scalar writes are used
2373 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
2374 return since the locations may be used for vector memory instructions by a
2375 future wave that uses the same scratch area, or a function call that creates a
2376 frame at the same address, respectively. There is no need for a ``s_dcache_inv``
2377 as all scalar writes are write-before-read in the same thread.
2379 Scratch backing memory (which is used for the private address space)
2380 is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private
2381 address space is only accessed by a single thread, and is always
2382 write-before-read, there is never a need to invalidate these entries from the L1
2383 cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the
2384 volatile cache lines.
2386 On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
2387 to invalidate the L2 cache. This also causes it to be treated as
2388 non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC
2389 (cache coherent) and so the L2 cache will coherent with the CPU and other
2392 .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
2393 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
2395 ============ ============ ============== ========== ===============================
2396 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
2397 Ordering Sync Scope Address
2399 ============ ============ ============== ========== ===============================
2401 -----------------------------------------------------------------------------------
2402 load *none* *none* - global - !volatile & !nontemporal
2404 - private 1. buffer/global/flat_load
2406 - volatile & !nontemporal
2408 1. buffer/global/flat_load
2413 1. buffer/global/flat_load
2416 load *none* *none* - local 1. ds_load
2417 store *none* *none* - global - !nontemporal
2419 - private 1. buffer/global/flat_store
2423 1. buffer/global/flat_stote
2426 store *none* *none* - local 1. ds_store
2427 **Unordered Atomic**
2428 -----------------------------------------------------------------------------------
2429 load atomic unordered *any* *any* *Same as non-atomic*.
2430 store atomic unordered *any* *any* *Same as non-atomic*.
2431 atomicrmw unordered *any* *any* *Same as monotonic
2433 **Monotonic Atomic**
2434 -----------------------------------------------------------------------------------
2435 load atomic monotonic - singlethread - global 1. buffer/global/flat_load
2436 - wavefront - generic
2438 load atomic monotonic - singlethread - local 1. ds_load
2441 load atomic monotonic - agent - global 1. buffer/global/flat_load
2442 - system - generic glc=1
2443 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
2444 - wavefront - generic
2448 store atomic monotonic - singlethread - local 1. ds_store
2451 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
2452 - wavefront - generic
2456 atomicrmw monotonic - singlethread - local 1. ds_atomic
2460 -----------------------------------------------------------------------------------
2461 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
2464 load atomic acquire - workgroup - global 1. buffer/global/flat_load
2465 load atomic acquire - workgroup - local 1. ds_load
2466 2. s_waitcnt lgkmcnt(0)
2469 - Must happen before
2481 load atomic acquire - workgroup - generic 1. flat_load
2482 2. s_waitcnt lgkmcnt(0)
2485 - Must happen before
2497 load atomic acquire - agent - global 1. buffer/global/flat_load
2499 2. s_waitcnt vmcnt(0)
2501 - Must happen before
2509 3. buffer_wbinvl1_vol
2511 - Must happen before
2521 load atomic acquire - agent - generic 1. flat_load glc=1
2522 - system 2. s_waitcnt vmcnt(0) &
2527 - Must happen before
2530 - Ensures the flat_load
2535 3. buffer_wbinvl1_vol
2537 - Must happen before
2547 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
2550 atomicrmw acquire - workgroup - global 1. buffer/global/flat_atomic
2551 atomicrmw acquire - workgroup - local 1. ds_atomic
2552 2. waitcnt lgkmcnt(0)
2555 - Must happen before
2568 atomicrmw acquire - workgroup - generic 1. flat_atomic
2569 2. waitcnt lgkmcnt(0)
2572 - Must happen before
2585 atomicrmw acquire - agent - global 1. buffer/global/flat_atomic
2586 - system 2. s_waitcnt vmcnt(0)
2588 - Must happen before
2597 3. buffer_wbinvl1_vol
2599 - Must happen before
2609 atomicrmw acquire - agent - generic 1. flat_atomic
2610 - system 2. s_waitcnt vmcnt(0) &
2615 - Must happen before
2624 3. buffer_wbinvl1_vol
2626 - Must happen before
2636 fence acquire - singlethread *none* *none*
2638 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0)
2643 - However, since LLVM
2668 fence-paired-atomic).
2669 - Must happen before
2680 fence-paired-atomic.
2682 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
2689 - However, since LLVM
2697 - Could be split into
2706 - s_waitcnt vmcnt(0)
2717 fence-paired-atomic).
2718 - s_waitcnt lgkmcnt(0)
2729 fence-paired-atomic).
2730 - Must happen before
2744 fence-paired-atomic.
2746 2. buffer_wbinvl1_vol
2748 - Must happen before any
2749 following global/generic
2759 -----------------------------------------------------------------------------------
2760 store atomic release - singlethread - global 1. buffer/global/ds/flat_store
2763 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
2772 - Must happen before
2783 2. buffer/global/flat_store
2784 store atomic release - workgroup - local 1. ds_store
2785 store atomic release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
2794 - Must happen before
2806 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
2807 - system - generic vmcnt(0)
2811 - Could be split into
2820 - s_waitcnt vmcnt(0)
2827 - s_waitcnt lgkmcnt(0)
2834 - Must happen before
2845 2. buffer/global/ds/flat_store
2846 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
2849 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
2858 - Must happen before
2869 2. buffer/global/flat_atomic
2870 atomicrmw release - workgroup - local 1. ds_atomic
2871 atomicrmw release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
2880 - Must happen before
2892 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
2893 - system - generic vmcnt(0)
2897 - Could be split into
2906 - s_waitcnt vmcnt(0)
2913 - s_waitcnt lgkmcnt(0)
2920 - Must happen before
2931 2. buffer/global/ds/flat_atomic
2932 fence release - singlethread *none* *none*
2934 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0)
2939 - However, since LLVM
2960 - Must happen before
2969 fence-paired-atomic).
2976 fence-paired-atomic.
2978 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) &
2989 - However, since LLVM
3004 - Could be split into
3013 - s_waitcnt vmcnt(0)
3020 - s_waitcnt lgkmcnt(0)
3027 - Must happen before
3036 fence-paired-atomic).
3043 fence-paired-atomic.
3045 **Acquire-Release Atomic**
3046 -----------------------------------------------------------------------------------
3047 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
3050 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
3059 - Must happen before
3070 2. buffer/global/flat_atomic
3071 atomicrmw acq_rel - workgroup - local 1. ds_atomic
3072 2. s_waitcnt lgkmcnt(0)
3075 - Must happen before
3088 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
3097 - Must happen before
3109 3. s_waitcnt lgkmcnt(0)
3112 - Must happen before
3125 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
3130 - Could be split into
3139 - s_waitcnt vmcnt(0)
3146 - s_waitcnt lgkmcnt(0)
3153 - Must happen before
3164 2. buffer/global/flat_atomic
3165 3. s_waitcnt vmcnt(0)
3167 - Must happen before
3176 4. buffer_wbinvl1_vol
3178 - Must happen before
3188 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
3193 - Could be split into
3202 - s_waitcnt vmcnt(0)
3209 - s_waitcnt lgkmcnt(0)
3216 - Must happen before
3228 3. s_waitcnt vmcnt(0) &
3233 - Must happen before
3242 4. buffer_wbinvl1_vol
3244 - Must happen before
3254 fence acq_rel - singlethread *none* *none*
3256 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0)
3276 - Must happen before
3299 acquire-fence-paired-atomic
3320 release-fence-paired-atomic
3321 ). This satisfies the
3325 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) &
3332 - However, since LLVM
3340 - Could be split into
3349 - s_waitcnt vmcnt(0)
3356 - s_waitcnt lgkmcnt(0)
3363 - Must happen before
3368 global/local/generic
3377 acquire-fence-paired-atomic
3389 global/local/generic
3398 release-fence-paired-atomic
3399 ). This satisfies the
3403 2. buffer_wbinvl1_vol
3405 - Must happen before
3419 **Sequential Consistent Atomic**
3420 -----------------------------------------------------------------------------------
3421 load atomic seq_cst - singlethread - global *Same as corresponding
3422 - wavefront - local load atomic acquire,
3423 - generic except must generated
3424 all instructions even
3426 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0)
3441 lgkmcnt(0) and so do
3476 instructions same as
3479 except must generated
3480 all instructions even
3482 load atomic seq_cst - workgroup - local *Same as corresponding
3483 load atomic acquire,
3484 except must generated
3485 all instructions even
3487 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
3488 - system - generic vmcnt(0)
3490 - Could be split into
3499 - waitcnt lgkmcnt(0)
3512 lgkmcnt(0) and so do
3563 instructions same as
3566 except must generated
3567 all instructions even
3569 store atomic seq_cst - singlethread - global *Same as corresponding
3570 - wavefront - local store atomic release,
3571 - workgroup - generic except must generated
3572 all instructions even
3574 store atomic seq_cst - agent - global *Same as corresponding
3575 - system - generic store atomic release,
3576 except must generated
3577 all instructions even
3579 atomicrmw seq_cst - singlethread - global *Same as corresponding
3580 - wavefront - local atomicrmw acq_rel,
3581 - workgroup - generic except must generated
3582 all instructions even
3584 atomicrmw seq_cst - agent - global *Same as corresponding
3585 - system - generic atomicrmw acq_rel,
3586 except must generated
3587 all instructions even
3589 fence seq_cst - singlethread *none* *Same as corresponding
3590 - wavefront fence acq_rel,
3591 - workgroup except must generated
3592 - agent all instructions even
3593 - system for OpenCL.*
3594 ============ ============ ============== ========== ===============================
3596 The memory order also adds the single thread optimization constrains defined in
3598 :ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
3600 .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
3601 :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
3603 ============ ==============================================================
3604 LLVM Memory Optimization Constraints
3606 ============ ==============================================================
3609 acquire - If a load atomic/atomicrmw then no following load/load
3610 atomic/store/ store atomic/atomicrmw/fence instruction can
3611 be moved before the acquire.
3612 - If a fence then same as load atomic, plus no preceding
3613 associated fence-paired-atomic can be moved after the fence.
3614 release - If a store atomic/atomicrmw then no preceding load/load
3615 atomic/store/ store atomic/atomicrmw/fence instruction can
3616 be moved after the release.
3617 - If a fence then same as store atomic, plus no following
3618 associated fence-paired-atomic can be moved before the
3620 acq_rel Same constraints as both acquire and release.
3621 seq_cst - If a load atomic then same constraints as acquire, plus no
3622 preceding sequentially consistent load atomic/store
3623 atomic/atomicrmw/fence instruction can be moved after the
3625 - If a store atomic then the same constraints as release, plus
3626 no following sequentially consistent load atomic/store
3627 atomic/atomicrmw/fence instruction can be moved before the
3629 - If an atomicrmw/fence then same constraints as acq_rel.
3630 ============ ==============================================================
3635 For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
3636 (such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
3637 the ``s_trap`` instruction with the following usage:
3639 .. table:: AMDGPU Trap Handler for AMDHSA OS
3640 :name: amdgpu-trap-handler-for-amdhsa-os-table
3642 =================== =============== =============== =======================
3643 Usage Code Sequence Trap Handler Description
3645 =================== =============== =============== =======================
3646 reserved ``s_trap 0x00`` Reserved by hardware.
3647 ``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA
3648 ``queue_ptr`` ``debugtrap``
3649 ``VGPR0``: intrinsic (not
3650 ``arg`` implemented).
3651 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be
3652 ``queue_ptr`` terminated and its
3653 associated queue put
3654 into the error state.
3655 ``llvm.debugtrap`` ``s_trap 0x03`` ``SGPR0-1``: If debugger not
3656 ``queue_ptr`` installed handled
3657 same as ``llvm.trap``.
3658 debugger breakpoint ``s_trap 0x07`` Reserved for debugger
3660 debugger ``s_trap 0x08`` Reserved for debugger.
3661 debugger ``s_trap 0xfe`` Reserved for debugger.
3662 debugger ``s_trap 0xff`` Reserved for debugger.
3663 =================== =============== =============== =======================
3668 This section provides code conventions used when the target triple OS is
3669 empty (see :ref:`amdgpu-target-triples`).
3674 For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
3675 not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
3676 instructions are handled as follows:
3678 .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
3679 :name: amdgpu-trap-handler-for-non-amdhsa-os-table
3681 =============== =============== ===========================================
3682 Usage Code Sequence Description
3683 =============== =============== ===========================================
3684 llvm.trap s_endpgm Causes wavefront to be terminated.
3685 llvm.debugtrap *none* Compiler warning given that there is no
3686 trap handler installed.
3687 =============== =============== ===========================================
3697 When generating code for the OpenCL language the target triple environment
3698 should be ``opencl`` or ``amdgizcl`` (see :ref:`amdgpu-target-triples`).
3700 When the language is OpenCL the following differences occur:
3702 1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
3703 2. The AMDGPU backend adds additional arguments to the kernel.
3704 3. Additional metadata is generated
3705 (:ref:`amdgpu-amdhsa-hsa-code-object-metadata`).
3708 Specify what affect this has. Hidden arguments added. Additional metadata
3716 When generating code for the OpenCL language the target triple environment
3717 should be ``hcc`` (see :ref:`amdgpu-target-triples`).
3719 When the language is OpenCL the following differences occur:
3721 1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
3724 Specify what affect this has.
3729 AMDGPU backend has LLVM-MC based assembler which is currently in development.
3730 It supports AMDGCN GFX6-GFX8.
3732 This section describes general syntax for instructions and operands. For more
3733 information about instructions, their semantics and supported combinations of
3734 operands, refer to one of instruction set architecture manuals
3735 [AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_ and [AMD-GCN-GFX9]_.
3737 An instruction has the following syntax (register operands are normally
3738 comma-separated while extra operands are space-separated):
3740 *<opcode> <register_operand0>, ... <extra_operand0> ...*
3745 The following syntax for register operands is supported:
3747 * SGPR registers: s0, ... or s[0], ...
3748 * VGPR registers: v0, ... or v[0], ...
3749 * TTMP registers: ttmp0, ... or ttmp[0], ...
3750 * Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
3751 * Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
3752 * Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
3753 * Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
3754 * Register index expressions: v[2*2], s[1-1:2-1]
3755 * 'off' indicates that an operand is not enabled
3757 The following extra operands are supported:
3759 * offset, offset0, offset1
3761 * glc, slc, tfe bits
3762 * waitcnt: integer or combination of counter values
3765 - abs (\| \|), neg (\-)
3769 - row_shl, row_shr, row_ror, row_rol
3770 - row_mirror, row_half_mirror, row_bcast
3771 - wave_shl, wave_shr, wave_ror, wave_rol, quad_perm
3772 - row_mask, bank_mask, bound_ctrl
3776 - dst_sel, src0_sel, src1_sel (BYTE_N, WORD_M, DWORD)
3777 - dst_unused (UNUSED_PAD, UNUSED_SEXT, UNUSED_PRESERVE)
3780 Instruction Examples
3781 ~~~~~~~~~~~~~~~~~~~~
3786 .. code-block:: nasm
3788 ds_add_u32 v2, v4 offset:16
3789 ds_write_src2_b64 v2 offset0:4 offset1:8
3790 ds_cmpst_f32 v2, v4, v6
3791 ds_min_rtn_f64 v[8:9], v2, v[4:5]
3794 For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
3799 .. code-block:: nasm
3801 flat_load_dword v1, v[3:4]
3802 flat_store_dwordx3 v[3:4], v[5:7]
3803 flat_atomic_swap v1, v[3:4], v5 glc
3804 flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
3805 flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
3807 For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
3812 .. code-block:: nasm
3814 buffer_load_dword v1, off, s[4:7], s1
3815 buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
3816 buffer_store_format_xy v[1:2], off, s[4:7], s1
3818 buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
3820 For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
3825 .. code-block:: nasm
3827 s_load_dword s1, s[2:3], 0xfc
3828 s_load_dwordx8 s[8:15], s[2:3], s4
3829 s_load_dwordx16 s[88:103], s[2:3], s4
3833 For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
3838 .. code-block:: nasm
3841 s_mov_b64 s[0:1], 0x80000000
3843 s_wqm_b64 s[2:3], s[4:5]
3844 s_bcnt0_i32_b64 s1, s[2:3]
3845 s_swappc_b64 s[2:3], s[4:5]
3846 s_cbranch_join s[4:5]
3848 For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
3853 .. code-block:: nasm
3855 s_add_u32 s1, s2, s3
3856 s_and_b64 s[2:3], s[4:5], s[6:7]
3857 s_cselect_b32 s1, s2, s3
3858 s_andn2_b32 s2, s4, s6
3859 s_lshr_b64 s[2:3], s[4:5], s6
3860 s_ashr_i32 s2, s4, s6
3861 s_bfm_b64 s[2:3], s4, s6
3862 s_bfe_i64 s[2:3], s[4:5], s6
3863 s_cbranch_g_fork s[4:5], s[6:7]
3865 For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
3870 .. code-block:: nasm
3873 s_bitcmp1_b32 s1, s2
3874 s_bitcmp0_b64 s[2:3], s4
3877 For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
3882 .. code-block:: nasm
3887 s_waitcnt 0 ; Wait for all counters to be 0
3888 s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
3889 s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
3893 s_sendmsg sendmsg(MSG_INTERRUPT)
3896 For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
3898 Unless otherwise mentioned, little verification is performed on the operands
3899 of SOPP Instructions, so it is up to the programmer to be familiar with the
3900 range or acceptable values.
3905 For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
3906 the assembler will automatically use optimal encoding based on its operands.
3907 To force specific encoding, one can add a suffix to the opcode of the instruction:
3909 * _e32 for 32-bit VOP1/VOP2/VOPC
3910 * _e64 for 64-bit VOP3
3912 * _sdwa for VOP_SDWA
3914 VOP1/VOP2/VOP3/VOPC examples:
3916 .. code-block:: nasm
3919 v_mov_b32_e32 v1, v2
3921 v_cvt_f64_i32_e32 v[1:2], v2
3922 v_floor_f32_e32 v1, v2
3923 v_bfrev_b32_e32 v1, v2
3924 v_add_f32_e32 v1, v2, v3
3925 v_mul_i32_i24_e64 v1, v2, 3
3926 v_mul_i32_i24_e32 v1, -3, v3
3927 v_mul_i32_i24_e32 v1, -100, v3
3928 v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
3929 v_max_f16_e32 v1, v2, v3
3933 .. code-block:: nasm
3935 v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
3936 v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
3937 v_mov_b32 v0, v0 wave_shl:1
3938 v_mov_b32 v0, v0 row_mirror
3939 v_mov_b32 v0, v0 row_bcast:31
3940 v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
3941 v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
3942 v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
3946 .. code-block:: nasm
3948 v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
3949 v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
3950 v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
3951 v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
3952 v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
3954 For full list of supported instructions, refer to "Vector ALU instructions".
3956 HSA Code Object Directives
3957 ~~~~~~~~~~~~~~~~~~~~~~~~~~
3959 AMDGPU ABI defines auxiliary data in output code object. In assembly source,
3960 one can specify them with assembler directives.
3962 .hsa_code_object_version major, minor
3963 +++++++++++++++++++++++++++++++++++++
3965 *major* and *minor* are integers that specify the version of the HSA code
3966 object that will be generated by the assembler.
3968 .hsa_code_object_isa [major, minor, stepping, vendor, arch]
3969 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3972 *major*, *minor*, and *stepping* are all integers that describe the instruction
3973 set architecture (ISA) version of the assembly program.
3975 *vendor* and *arch* are quoted strings. *vendor* should always be equal to
3976 "AMD" and *arch* should always be equal to "AMDGPU".
3978 By default, the assembler will derive the ISA version, *vendor*, and *arch*
3979 from the value of the -mcpu option that is passed to the assembler.
3981 .amdgpu_hsa_kernel (name)
3982 +++++++++++++++++++++++++
3984 This directives specifies that the symbol with given name is a kernel entry point
3985 (label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
3990 This directive marks the beginning of a list of key / value pairs that are used
3991 to specify the amd_kernel_code_t object that will be emitted by the assembler.
3992 The list must be terminated by the *.end_amd_kernel_code_t* directive. For
3993 any amd_kernel_code_t values that are unspecified a default value will be
3994 used. The default value for all keys is 0, with the following exceptions:
3996 - *kernel_code_version_major* defaults to 1.
3997 - *machine_kind* defaults to 1.
3998 - *machine_version_major*, *machine_version_minor*, and
3999 *machine_version_stepping* are derived from the value of the -mcpu option
4000 that is passed to the assembler.
4001 - *kernel_code_entry_byte_offset* defaults to 256.
4002 - *wavefront_size* defaults to 6.
4003 - *kernarg_segment_alignment*, *group_segment_alignment*, and
4004 *private_segment_alignment* default to 4. Note that alignments are specified
4005 as a power of two, so a value of **n** means an alignment of 2^ **n**.
4007 The *.amd_kernel_code_t* directive must be placed immediately after the
4008 function label and before any instructions.
4010 For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
4011 comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
4013 Here is an example of a minimal amd_kernel_code_t specification:
4015 .. code-block:: none
4017 .hsa_code_object_version 1,0
4018 .hsa_code_object_isa
4023 .amdgpu_hsa_kernel hello_world
4028 enable_sgpr_kernarg_segment_ptr = 1
4030 compute_pgm_rsrc1_vgprs = 0
4031 compute_pgm_rsrc1_sgprs = 0
4032 compute_pgm_rsrc2_user_sgpr = 2
4033 kernarg_segment_byte_size = 8
4034 wavefront_sgpr_count = 2
4035 workitem_vgpr_count = 3
4036 .end_amd_kernel_code_t
4038 s_load_dwordx2 s[0:1], s[0:1] 0x0
4039 v_mov_b32 v0, 3.14159
4040 s_waitcnt lgkmcnt(0)
4043 flat_store_dword v[1:2], v0
4046 .size hello_world, .Lfunc_end0-hello_world
4048 Additional Documentation
4049 ========================
4051 .. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
4052 .. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
4053 .. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
4054 .. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
4055 .. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
4056 .. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
4057 .. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
4058 .. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
4059 .. [AMD-OpenCL_Programming-Guide] `AMD Accelerated Parallel Processing OpenCL Programming Guide <http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf>`_
4060 .. [AMD-APP-SDK] `AMD Accelerated Parallel Processing APP SDK Documentation <http://developer.amd.com/tools/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/documentation/>`__
4061 .. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
4062 .. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
4063 .. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
4064 .. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
4065 .. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
4066 .. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
4067 .. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
4068 .. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
4069 .. [AMD-AMDGPU-Compute-Application-Binary-Interface] `AMDGPU Compute Application Binary Interface <https://github.com/RadeonOpenCompute/ROCm-ComputeABI-Doc/blob/master/AMDGPU-ABI.md>`__