1 =============================
2 User Guide for AMDGPU Backend
3 =============================
11 The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
12 R600 family up until the current GCN families. It lives in the
13 ``lib/Target/AMDGPU`` directory.
18 .. _amdgpu-target-triples:
23 Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
24 specify the target triple:
26 .. table:: AMDGPU Target Triples
27 :name: amdgpu-target-triples-table
29 ============ ======== ========= ===========
30 Architecture Vendor OS Environment
31 ============ ======== ========= ===========
32 r600 amd <empty> <empty>
33 amdgcn amd <empty> <empty>
34 amdgcn amd amdhsa <empty>
35 amdgcn amd amdhsa opencl
36 amdgcn amd amdhsa amdgizcl
37 amdgcn amd amdhsa amdgiz
39 ============ ======== ========= ===========
42 Supports AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders executed on
46 Supports AMD GPUs GCN GFX6 onwards for graphics and compute shaders executed on
49 ``amdgcn-amd-amdhsa-``
50 Supports AMD GCN GPUs GFX6 onwards for compute kernels executed on HSA [HSA]_
51 compatible runtimes such as AMD's ROCm [AMD-ROCm]_.
53 ``amdgcn-amd-amdhsa-opencl``
54 Supports AMD GCN GPUs GFX6 onwards for OpenCL compute kernels executed on HSA
55 [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. See
58 ``amdgcn-amd-amdhsa-amdgizcl``
59 Same as ``amdgcn-amd-amdhsa-opencl`` except a different address space mapping
60 is used (see :ref:`amdgpu-address-spaces`).
62 ``amdgcn-amd-amdhsa-amdgiz``
63 Same as ``amdgcn-amd-amdhsa-`` except a different address space mapping is
64 used (see :ref:`amdgpu-address-spaces`).
66 ``amdgcn-amd-amdhsa-hcc``
67 Supports AMD GCN GPUs GFX6 onwards for AMD HC language compute kernels
68 executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. See
71 .. _amdgpu-processors:
76 Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
77 names from both the *Processor* and *Alternative Processor* can be used.
79 .. table:: AMDGPU Processors
80 :name: amdgpu-processors-table
82 ========== =========== ============ ===== ======= ==================
83 Processor Alternative Target dGPU/ Runtime Example
84 Processor Triple APU Support Products
86 ========== =========== ============ ===== ======= ==================
87 **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
88 --------------------------------------------------------------------
93 **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
94 --------------------------------------------------------------------
98 **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
99 --------------------------------------------------------------------
105 **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
106 --------------------------------------------------------------------
111 **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
112 --------------------------------------------------------------------
113 gfx600 - tahiti amdgcn dGPU
114 gfx601 - pitcairn amdgcn dGPU
118 **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
119 --------------------------------------------------------------------
120 gfx700 - bonaire amdgcn dGPU - Radeon HD 7790
124 \ - kaveri amdgcn APU - A6-7000
134 gfx701 - hawaii amdgcn dGPU ROCm - FirePro W8100
138 gfx702 dGPU ROCm - Radeon R9 290
142 gfx703 - kabini amdgcn APU - E1-2100
151 **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
152 --------------------------------------------------------------------
153 gfx800 - iceland amdgcn dGPU - FirePro S7150
161 gfx801 - carrizo amdgcn APU - A6-8500P
167 \ amdgcn APU ROCm - A10-8700P
170 \ amdgcn APU - A10-9600P
176 \ amdgcn APU - E2-9010
179 gfx802 - tonga amdgcn dGPU ROCm Same as gfx800
180 gfx803 - fiji amdgcn dGPU ROCm - Radeon R9 Nano
185 - Radeon Instinct MI8
186 \ - polaris10 amdgcn dGPU ROCm - Radeon RX 470
188 - Radeon Instinct MI6
189 \ - polaris11 amdgcn dGPU ROCm - Radeon RX 460
190 gfx804 amdgcn dGPU Same as gfx803
191 gfx810 - stoney amdgcn APU
192 **GCN GFX9** [AMD-GCN-GFX9]_
193 --------------------------------------------------------------------
194 gfx900 amdgcn dGPU - Radeon Vega
200 - Radeon Instinct MI25
201 gfx901 amdgcn dGPU ROCm Same as gfx900
204 gfx902 amdgcn APU *TBA*
209 gfx903 amdgcn APU Same as gfx902
212 ========== =========== ============ ===== ======= ==================
214 .. _amdgpu-address-spaces:
219 The AMDGPU backend uses the following address space mappings.
221 The memory space names used in the table, aside from the region memory space, is
222 from the OpenCL standard.
224 LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
226 .. table:: Address Space Mapping
227 :name: amdgpu-address-space-mapping-table
229 ================== ================= ================= ================= =================
230 LLVM Address Space Memory Space
231 ------------------ -----------------------------------------------------------------------
232 \ Current Default amdgiz/amdgizcl hcc Future Default
233 ================== ================= ================= ================= =================
234 0 Private (Scratch) Generic (Flat) Generic (Flat) Generic (Flat)
235 1 Global Global Global Global
236 2 Constant Constant Constant Region (GDS)
237 3 Local (group/LDS) Local (group/LDS) Local (group/LDS) Local (group/LDS)
238 4 Generic (Flat) Region (GDS) Region (GDS) Constant
239 5 Region (GDS) Private (Scratch) Private (Scratch) Private (Scratch)
240 ================== ================= ================= ================= =================
243 This is the current default address space mapping used for all languages
244 except hcc. This will shortly be deprecated.
247 This is the current address space mapping used when ``amdgiz`` or ``amdgizcl``
248 is specified as the target triple environment value.
251 This is the current address space mapping used when ``hcc`` is specified as
252 the target triple environment value.This will shortly be deprecated.
255 This will shortly be the only address space mapping for all languages using
258 .. _amdgpu-memory-scopes:
263 This section provides LLVM memory synchronization scopes supported by the AMDGPU
264 backend memory model when the target triple OS is ``amdhsa`` (see
265 :ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
267 The memory model supported is based on the HSA memory model [HSA]_ which is
268 based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
269 relation is transitive over the synchonizes-with relation independent of scope,
270 and synchonizes-with allows the memory scope instances to be inclusive (see
271 table :ref:`amdgpu-amdhsa-llvm-sync-scopes-amdhsa-table`).
273 This is different to the OpenCL [OpenCL]_ memory model which does not have scope
274 inclusion and requires the memory scopes to exactly match. However, this
275 is conservatively correct for OpenCL.
277 .. table:: AMDHSA LLVM Sync Scopes for AMDHSA
278 :name: amdgpu-amdhsa-llvm-sync-scopes-amdhsa-table
280 ================ ==========================================================
281 LLVM Sync Scope Description
282 ================ ==========================================================
283 *none* The default: ``system``.
285 Synchronizes with, and participates in modification and
286 seq_cst total orderings with, other operations (except
287 image operations) for all address spaces (except private,
288 or generic that accesses private) provided the other
289 operation's sync scope is:
292 - ``agent`` and executed by a thread on the same agent.
293 - ``workgroup`` and executed by a thread in the same
295 - ``wavefront`` and executed by a thread in the same
298 ``agent`` Synchronizes with, and participates in modification and
299 seq_cst total orderings with, other operations (except
300 image operations) for all address spaces (except private,
301 or generic that accesses private) provided the other
302 operation's sync scope is:
304 - ``system`` or ``agent`` and executed by a thread on the
306 - ``workgroup`` and executed by a thread in the same
308 - ``wavefront`` and executed by a thread in the same
311 ``workgroup`` Synchronizes with, and participates in modification and
312 seq_cst total orderings with, other operations (except
313 image operations) for all address spaces (except private,
314 or generic that accesses private) provided the other
315 operation's sync scope is:
317 - ``system``, ``agent`` or ``workgroup`` and executed by a
318 thread in the same workgroup.
319 - ``wavefront`` and executed by a thread in the same
322 ``wavefront`` Synchronizes with, and participates in modification and
323 seq_cst total orderings with, other operations (except
324 image operations) for all address spaces (except private,
325 or generic that accesses private) provided the other
326 operation's sync scope is:
328 - ``system``, ``agent``, ``workgroup`` or ``wavefront``
329 and executed by a thread in the same wavefront.
331 ``singlethread`` Only synchronizes with, and participates in modification
332 and seq_cst total orderings with, other operations (except
333 image operations) running in the same thread for all
334 address spaces (for example, in signal handlers).
335 ================ ==========================================================
340 The AMDGPU backend implements the following intrinsics.
342 *This section is WIP.*
345 List AMDGPU intrinsics
350 The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
351 can be linked by ``lld`` to produce a standard ELF shared code object which can
352 be loaded and executed on an AMDGPU target.
357 The AMDGPU backend uses the following ELF header:
359 .. table:: AMDGPU ELF Header
360 :name: amdgpu-elf-header-table
362 ========================== ===============================
364 ========================== ===============================
365 ``e_ident[EI_CLASS]`` ``ELFCLASS64``
366 ``e_ident[EI_DATA]`` ``ELFDATA2LSB``
367 ``e_ident[EI_OSABI]`` ``ELFOSABI_AMDGPU_HSA``,
368 ``ELFOSABI_AMDGPU_PAL`` or
369 ``ELFOSABI_AMDGPU_MESA3D``
370 ``e_ident[EI_ABIVERSION]`` ``ELFABIVERSION_AMDGPU_HSA``,
371 ``ELFABIVERSION_AMDGPU_PAL`` or
372 ``ELFABIVERSION_AMDGPU_MESA3D``
373 ``e_type`` ``ET_REL`` or ``ET_DYN``
374 ``e_machine`` ``EM_AMDGPU``
377 ========================== ===============================
381 .. table:: AMDGPU ELF Header Enumeration Values
382 :name: amdgpu-elf-header-enumeration-values-table
384 =============================== =====
386 =============================== =====
388 ``ELFOSABI_AMDGPU_HSA`` 64
389 ``ELFOSABI_AMDGPU_PAL`` 65
390 ``ELFOSABI_AMDGPU_MESA3D`` 66
391 ``ELFABIVERSION_AMDGPU_HSA`` 1
392 ``ELFABIVERSION_AMDGPU_PAL`` 0
393 ``ELFABIVERSION_AMDGPU_MESA3D`` 0
394 =============================== =====
396 ``e_ident[EI_CLASS]``
397 The ELF class is always ``ELFCLASS64``. The AMDGPU backend only supports 64
401 All AMDGPU targets use ELFDATA2LSB for little-endian byte ordering.
403 ``e_ident[EI_OSABI]``
404 One of the following AMD GPU architecture specific OS ABIs:
406 * ``ELFOSABI_AMDGPU_HSA`` is used to specify that the code object conforms to
407 the AMD HSA runtime ABI [HSA]_.
409 * ``ELFOSABI_AMDGPU_PAL`` is used to specify that the code object conforms to
410 the AMD PAL runtime ABI.
412 * ``ELFOSABI_AMDGPU_MESA3D`` is used to specify that the code object conforms
413 to the AMD MESA runtime ABI.
415 ``e_ident[EI_ABIVERSION]``
416 The ABI version of the AMD GPU architecture specific OS ABI to which the code
419 * ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
422 * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
425 * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
429 Can be one of the following values:
433 The type produced by the AMD GPU backend compiler as it is relocatable code
437 The type produced by the linker as it is a shared code object.
439 The AMD HSA runtime loader requires a ``ET_DYN`` code object.
442 The value ``EM_AMDGPU`` is used for the machine for all members of the AMD GPU
443 architecture family. The specific member is specified in the
444 ``NT_AMD_AMDGPU_ISA`` entry in the ``.note`` section (see
445 :ref:`amdgpu-note-records`).
448 The entry point is 0 as the entry points for individual kernels must be
449 selected in order to invoke them through AQL packets.
452 The value is 0 as no flags are used.
457 An AMDGPU target ELF code object has the standard ELF sections which include:
459 .. table:: AMDGPU ELF Sections
460 :name: amdgpu-elf-sections-table
462 ================== ================ =================================
464 ================== ================ =================================
465 ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
466 ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
467 ``.debug_``\ *\** ``SHT_PROGBITS`` *none*
468 ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
469 ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
470 ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
471 ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
472 ``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
473 ``.note`` ``SHT_NOTE`` *none*
474 ``.rela``\ *name* ``SHT_RELA`` *none*
475 ``.rela.dyn`` ``SHT_RELA`` *none*
476 ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
477 ``.shstrtab`` ``SHT_STRTAB`` *none*
478 ``.strtab`` ``SHT_STRTAB`` *none*
479 ``.symtab`` ``SHT_SYMTAB`` *none*
480 ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
481 ================== ================ =================================
483 These sections have their standard meanings (see [ELF]_) and are only generated
487 The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
488 DWARF produced by the AMDGPU backend.
490 ``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
491 The standard sections used by a dynamic loader.
494 See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
497 ``.rela``\ *name*, ``.rela.dyn``
498 For relocatable code objects, *name* is the name of the section that the
499 relocation records apply. For example, ``.rela.text`` is the section name for
500 relocation records associated with the ``.text`` section.
502 For linked shared code objects, ``.rela.dyn`` contains all the relocation
503 records from each of the relocatable code object's ``.rela``\ *name* sections.
505 See :ref:`amdgpu-relocation-records` for the relocation records supported by
509 The executable machine code for the kernels and functions they call. Generated
510 as position independent code. See :ref:`amdgpu-code-conventions` for
511 information on conventions used in the isa generation.
513 .. _amdgpu-note-records:
518 As required by ``ELFCLASS64``, minimal zero byte padding must be generated after
519 the ``name`` field to ensure the ``desc`` field is 4 byte aligned. In addition,
520 minimal zero byte padding must be generated to ensure the ``desc`` field size is
521 a multiple of 4 bytes. The ``sh_addralign`` field of the ``.note`` section must
522 be at least 4 to indicate at least 8 byte alignment.
524 The AMDGPU backend code object uses the following ELF note records in the
525 ``.note`` section. The *Description* column specifies the layout of the note
526 record’s ``desc`` field. All fields are consecutive bytes. Note records with
527 variable size strings have a corresponding ``*_size`` field that specifies the
528 number of bytes, including the terminating null character, in the string. The
529 string(s) come immediately after the preceding fields.
531 Additional note records can be present.
533 .. table:: AMDGPU ELF Note Records
534 :name: amdgpu-elf-note-records-table
536 ===== ============================== ======================================
537 Name Type Description
538 ===== ============================== ======================================
539 "AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
540 "AMD" ``NT_AMD_AMDGPU_ISA`` <isa name null terminated string>
541 ===== ============================== ======================================
545 .. table:: AMDGPU ELF Note Record Enumeration Values
546 :name: amdgpu-elf-note-record-enumeration-values-table
548 ============================== =====
550 ============================== =====
552 ``NT_AMD_AMDGPU_HSA_METADATA`` 10
553 ``NT_AMD_AMDGPU_ISA`` 11
554 ============================== =====
556 ``NT_AMD_AMDGPU_ISA``
557 Specifies the instruction set architecture used by the machine code contained
560 This note record is required for code objects containing machine code for
561 processors matching the ``amdgcn`` architecture in table
562 :ref:`amdgpu-processors`.
564 The null terminated string has the following syntax:
566 *architecture*\ ``-``\ *vendor*\ ``-``\ *os*\ ``-``\ *environment*\ ``-``\ *processor*
571 The architecture from table :ref:`amdgpu-target-triples-table`.
573 This is always ``amdgcn`` when the target triple OS is ``amdhsa`` (see
574 :ref:`amdgpu-target-triples`).
577 The vendor from table :ref:`amdgpu-target-triples-table`.
579 For the AMDGPU backend this is always ``amd``.
582 The OS from table :ref:`amdgpu-target-triples-table`.
585 An environment from table :ref:`amdgpu-target-triples-table`, or blank if
586 the environment has no affect on the execution of the code object.
588 For the AMDGPU backend this is currently always blank.
590 The processor from table :ref:`amdgpu-processors-table`.
594 ``amdgcn-amd-amdhsa--gfx901``
596 ``NT_AMD_AMDGPU_HSA_METADATA``
597 Specifies extensible metadata associated with the code objects executed on HSA
598 [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
599 the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
600 :ref:`amdgpu-amdhsa-hsa-code-object-metadata` for the syntax of the code
601 object metadata string.
608 Symbols include the following:
610 .. table:: AMDGPU ELF Symbols
611 :name: amdgpu-elf-symbols-table
613 ===================== ============== ============= ==================
614 Name Type Section Description
615 ===================== ============== ============= ==================
616 *link-name* ``STT_OBJECT`` - ``.data`` Global variable
619 *link-name*\ ``@kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
620 *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point
621 ===================== ============== ============= ==================
624 Global variables both used and defined by the compilation unit.
626 If the symbol is defined in the compilation unit then it is allocated in the
627 appropriate section according to if it has initialized data or is readonly.
629 If the symbol is external then its section is ``STN_UNDEF`` and the loader
630 will resolve relocations using the definition provided by another code object
631 or explicitly defined by the runtime.
633 All global symbols, whether defined in the compilation unit or external, are
634 accessed by the machine code indirectly through a GOT table entry. This
635 allows them to be preemptable. The GOT table is only supported when the target
636 triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
639 Add description of linked shared object symbols. Seems undefined symbols
640 are marked as STT_NOTYPE.
643 Every HSA kernel has an associated kernel descriptor. It is the address of the
644 kernel descriptor that is used in the AQL dispatch packet used to invoke the
645 kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
646 defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
649 Every HSA kernel also has a symbol for its machine code entry point.
651 .. _amdgpu-relocation-records:
656 AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
657 relocatable fields are:
660 This specifies a 32-bit field occupying 4 bytes with arbitrary byte
661 alignment. These values use the same byte order as other word values in the
662 AMD GPU architecture.
665 This specifies a 64-bit field occupying 8 bytes with arbitrary byte
666 alignment. These values use the same byte order as other word values in the
667 AMD GPU architecture.
669 Following notations are used for specifying relocation calculations:
672 Represents the addend used to compute the value of the relocatable field.
675 Represents the offset into the global offset table at which the relocation
676 entry’s symbol will reside during execution.
679 Represents the address of the global offset table.
682 Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
683 of the storage unit being relocated (computed using ``r_offset``).
686 Represents the value of the symbol whose index resides in the relocation
687 entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``.
690 Represents the base address of a loaded executable or shared object which is
691 the difference between the ELF address and the actual load address. Relocations
692 using this are only valid in executable or shared objects.
694 The following relocation types are supported:
696 .. table:: AMDGPU ELF Relocation Records
697 :name: amdgpu-elf-relocation-records-table
699 ========================== ===== ========== ==============================
700 Relocation Type Value Field Calculation
701 ========================== ===== ========== ==============================
702 ``R_AMDGPU_NONE`` 0 *none* *none*
703 ``R_AMDGPU_ABS32_LO`` 1 ``word32`` (S + A) & 0xFFFFFFFF
704 ``R_AMDGPU_ABS32_HI`` 2 ``word32`` (S + A) >> 32
705 ``R_AMDGPU_ABS64`` 3 ``word64`` S + A
706 ``R_AMDGPU_REL32`` 4 ``word32`` S + A - P
707 ``R_AMDGPU_REL64`` 5 ``word64`` S + A - P
708 ``R_AMDGPU_ABS32`` 6 ``word32`` S + A
709 ``R_AMDGPU_GOTPCREL`` 7 ``word32`` G + GOT + A - P
710 ``R_AMDGPU_GOTPCREL32_LO`` 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
711 ``R_AMDGPU_GOTPCREL32_HI`` 9 ``word32`` (G + GOT + A - P) >> 32
712 ``R_AMDGPU_REL32_LO`` 10 ``word32`` (S + A - P) & 0xFFFFFFFF
713 ``R_AMDGPU_REL32_HI`` 11 ``word32`` (S + A - P) >> 32
715 ``R_AMDGPU_RELATIVE64`` 13 ``word64`` B + A
716 ========================== ===== ========== ==============================
723 Standard DWARF [DWARF]_ Version 2 sections can be generated. These contain
724 information that maps the code object executable code and data to the source
725 language constructs. It can be used by tools such as debuggers and profilers.
727 Address Space Mapping
728 ~~~~~~~~~~~~~~~~~~~~~
730 The following address space mapping is used:
732 .. table:: AMDGPU DWARF Address Space Mapping
733 :name: amdgpu-dwarf-address-space-mapping-table
735 =================== =================
736 DWARF Address Space Memory Space
737 =================== =================
742 *omitted* Generic (Flat)
743 *not supported* Region (GDS)
744 =================== =================
746 See :ref:`amdgpu-address-spaces` for information on the memory space terminology
749 An ``address_class`` attribute is generated on pointer type DIEs to specify the
750 DWARF address space of the value of the pointer when it is in the *private* or
751 *local* address space. Otherwise the attribute is omitted.
753 An ``XDEREF`` operation is generated in location list expressions for variables
754 that are allocated in the *private* and *local* address space. Otherwise no
755 ``XDREF`` is omitted.
760 *This section is WIP.*
763 Define DWARF register enumeration.
765 If want to present a wavefront state then should expose vector registers as
766 64 wide (rather than per work-item view that LLVM uses). Either as separate
767 registers, or a 64x4 byte single register. In either case use a new LANE op
768 (akin to XDREF) to select the current lane usage in a location
769 expression. This would also allow scalar register spilling to vector register
770 lanes to be expressed (currently no debug information is being generated for
771 spilling). If choose a wide single register approach then use LANE in
772 conjunction with PIECE operation to select the dword part of the register for
773 the current lane. If the separate register approach then use LANE to select
779 *This section is WIP.*
782 DWARF extension to include runtime generated source text.
784 .. _amdgpu-code-conventions:
789 This section provides code conventions used for each supported target triple OS
790 (see :ref:`amdgpu-target-triples`).
795 This section provides code conventions used when the target triple OS is
796 ``amdhsa`` (see :ref:`amdgpu-target-triples`).
798 .. _amdgpu-amdhsa-hsa-code-object-metadata:
803 The code object metadata specifies extensible metadata associated with the code
804 objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
805 [AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record
806 (see :ref:`amdgpu-note-records`) and is required when the target triple OS is
807 ``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
808 information necessary to support the ROCM kernel queries. For example, the
809 segment sizes needed in a dispatch packet. In addition, a high level language
810 runtime may require other information to be included. For example, the AMD
811 OpenCL runtime records kernel argument information.
813 The metadata is specified as a YAML formatted string (see [YAML]_ and
817 Is the string null terminated? It probably should not if YAML allows it to
818 contain null characters, otherwise it should be.
820 The metadata is represented as a single YAML document comprised of the mapping
821 defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
824 For boolean values, the string values of ``false`` and ``true`` are used for
825 false and true respectively.
827 Additional information can be added to the mappings. To avoid conflicts, any
828 non-AMD key names should be prefixed by "*vendor-name*.".
830 .. table:: AMDHSA Code Object Metadata Mapping
831 :name: amdgpu-amdhsa-code-object-metadata-mapping-table
833 ========== ============== ========= =======================================
834 String Key Value Type Required? Description
835 ========== ============== ========= =======================================
836 "Version" sequence of Required - The first integer is the major
837 2 integers version. Currently 1.
838 - The second integer is the minor
839 version. Currently 0.
840 "Printf" sequence of Each string is encoded information
841 strings about a printf function call. The
842 encoded information is organized as
843 fields separated by colon (':'):
845 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
850 A 32 bit integer as a unique id for
851 each printf function call
854 A 32 bit integer equal to the number
855 of arguments of printf function call
858 ``S[i]`` (where i = 0, 1, ... , N-1)
859 32 bit integers for the size in bytes
860 of the i-th FormatString argument of
861 the printf function call
864 The format string passed to the
865 printf function call.
866 "Kernels" sequence of Required Sequence of the mappings for each
867 mapping kernel in the code object. See
868 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
869 for the definition of the mapping.
870 ========== ============== ========= =======================================
874 .. table:: AMDHSA Code Object Kernel Metadata Mapping
875 :name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
877 ================= ============== ========= ================================
878 String Key Value Type Required? Description
879 ================= ============== ========= ================================
880 "Name" string Required Source name of the kernel.
881 "SymbolName" string Required Name of the kernel
882 descriptor ELF symbol.
883 "Language" string Source language of the kernel.
891 "LanguageVersion" sequence of - The first integer is the major
893 - The second integer is the
895 "Attrs" mapping Mapping of kernel attributes.
897 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
898 for the mapping definition.
899 "Args" sequence of Sequence of mappings of the
900 mapping kernel arguments. See
901 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
902 for the definition of the mapping.
903 "CodeProps" mapping Mapping of properties related to
905 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
906 for the mapping definition.
907 "DebugProps" mapping Mapping of properties related to
908 the kernel debugging. See
909 :ref:`amdgpu-amdhsa-code-object-kernel-debug-properties-metadata-mapping-table`
910 for the mapping definition.
911 ================= ============== ========= ================================
915 .. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
916 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
918 =================== ============== ========= ==============================
919 String Key Value Type Required? Description
920 =================== ============== ========= ==============================
921 "ReqdWorkGroupSize" sequence of The dispatch work-group size
922 3 integers X, Y, Z must correspond to the
925 Corresponds to the OpenCL
926 ``reqd_work_group_size``
928 "WorkGroupSizeHint" sequence of The dispatch work-group size
929 3 integers X, Y, Z is likely to be the
932 Corresponds to the OpenCL
933 ``work_group_size_hint``
935 "VecTypeHint" string The name of a scalar or vector
938 Corresponds to the OpenCL
939 ``vec_type_hint`` attribute.
941 "RuntimeHandle" string The external symbol name
942 associated with a kernel.
943 OpenCL runtime allocates a
944 global buffer for the symbol
945 and saves the kernel's address
946 to it, which is used for
947 device side enqueueing. Only
948 available for device side
950 =================== ============== ========= ==============================
954 .. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
955 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
957 ================= ============== ========= ================================
958 String Key Value Type Required? Description
959 ================= ============== ========= ================================
960 "Name" string Kernel argument name.
961 "TypeName" string Kernel argument type name.
962 "Size" integer Required Kernel argument size in bytes.
963 "Align" integer Required Kernel argument alignment in
964 bytes. Must be a power of two.
965 "ValueKind" string Required Kernel argument kind that
966 specifies how to set up the
967 corresponding argument.
971 The argument is copied
972 directly into the kernarg.
975 A global address space pointer
976 to the buffer data is passed
979 "DynamicSharedPointer"
980 A group address space pointer
981 to dynamically allocated LDS
982 is passed in the kernarg.
985 A global address space
986 pointer to a S# is passed in
990 A global address space
991 pointer to a T# is passed in
995 A global address space pointer
996 to an OpenCL pipe is passed in
1000 A global address space pointer
1001 to an OpenCL device enqueue
1002 queue is passed in the
1005 "HiddenGlobalOffsetX"
1006 The OpenCL grid dispatch
1007 global offset for the X
1008 dimension is passed in the
1011 "HiddenGlobalOffsetY"
1012 The OpenCL grid dispatch
1013 global offset for the Y
1014 dimension is passed in the
1017 "HiddenGlobalOffsetZ"
1018 The OpenCL grid dispatch
1019 global offset for the Z
1020 dimension is passed in the
1024 An argument that is not used
1025 by the kernel. Space needs to
1026 be left for it, but it does
1027 not need to be set up.
1029 "HiddenPrintfBuffer"
1030 A global address space pointer
1031 to the runtime printf buffer
1032 is passed in kernarg.
1034 "HiddenDefaultQueue"
1035 A global address space pointer
1036 to the OpenCL device enqueue
1037 queue that should be used by
1038 the kernel by default is
1039 passed in the kernarg.
1041 "HiddenCompletionAction"
1047 "ValueType" string Required Kernel argument value type. Only
1048 present if "ValueKind" is
1049 "ByValue". For vector data
1050 types, the value is for the
1051 element type. Values include:
1067 How can it be determined if a
1068 vector type, and what size
1070 "PointeeAlign" integer Alignment in bytes of pointee
1071 type for pointer type kernel
1072 argument. Must be a power
1073 of 2. Only present if
1075 "DynamicSharedPointer".
1076 "AddrSpaceQual" string Kernel argument address space
1077 qualifier. Only present if
1078 "ValueKind" is "GlobalBuffer" or
1079 "DynamicSharedPointer". Values
1090 Is GlobalBuffer only Global
1092 DynamicSharedPointer always
1093 Local? Can HCC allow Generic?
1094 How can Private or Region
1096 "AccQual" string Kernel argument access
1097 qualifier. Only present if
1098 "ValueKind" is "Image" or
1109 "ActualAccQual" string The actual memory accesses
1110 performed by the kernel on the
1111 kernel argument. Only present if
1112 "ValueKind" is "GlobalBuffer",
1113 "Image", or "Pipe". This may be
1114 more restrictive than indicated
1115 by "AccQual" to reflect what the
1116 kernel actual does. If not
1117 present then the runtime must
1118 assume what is implied by
1119 "AccQual" and "IsConst". Values
1126 "IsConst" boolean Indicates if the kernel argument
1127 is const qualified. Only present
1131 "IsRestrict" boolean Indicates if the kernel argument
1132 is restrict qualified. Only
1133 present if "ValueKind" is
1136 "IsVolatile" boolean Indicates if the kernel argument
1137 is volatile qualified. Only
1138 present if "ValueKind" is
1141 "IsPipe" boolean Indicates if the kernel argument
1142 is pipe qualified. Only present
1143 if "ValueKind" is "Pipe".
1146 Can GlobalBuffer be pipe
1148 ================= ============== ========= ================================
1152 .. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
1153 :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
1155 ============================ ============== ========= =====================
1156 String Key Value Type Required? Description
1157 ============================ ============== ========= =====================
1158 "KernargSegmentSize" integer Required The size in bytes of
1160 that holds the values
1163 "GroupSegmentFixedSize" integer Required The amount of group
1167 bytes. This does not
1169 dynamically allocated
1170 group segment memory
1174 "PrivateSegmentFixedSize" integer Required The amount of fixed
1175 private address space
1176 memory required for a
1180 is 1 then additional
1182 to this value for the
1184 "KernargSegmentAlign" integer Required The maximum byte
1187 kernarg segment. Must
1189 "WavefrontSize" integer Required Wavefront size. Must
1191 "NumSGPRs" integer Number of scalar
1195 includes the special
1201 SGPR added if a trap
1207 "NumVGPRs" integer Number of vector
1211 "MaxFlatWorkGroupSize" integer Maximum flat
1214 kernel in work-items.
1215 "IsDynamicCallStack" boolean Indicates if the
1220 "IsXNACKEnabled" boolean Indicates if the
1224 ============================ ============== ========= =====================
1228 .. table:: AMDHSA Code Object Kernel Debug Properties Metadata Mapping
1229 :name: amdgpu-amdhsa-code-object-kernel-debug-properties-metadata-mapping-table
1231 =================================== ============== ========= ==============
1232 String Key Value Type Required? Description
1233 =================================== ============== ========= ==============
1234 "DebuggerABIVersion" sequence of
1236 "ReservedNumVGPRs" integer
1237 "ReservedFirstVGPR" integer
1238 "PrivateSegmentBufferSGPR" integer
1239 "WavefrontPrivateSegmentOffsetSGPR" integer
1240 =================================== ============== ========= ==============
1243 Plan to remove the debug properties metadata.
1248 The HSA architected queuing language (AQL) defines a user space memory interface
1249 that can be used to control the dispatch of kernels, in an agent independent
1250 way. An agent can have zero or more AQL queues created for it using the ROCm
1251 runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
1252 *HSA Platform System Architecture Specification* [HSA]_ for the AQL queue
1253 mechanics and packet layouts.
1255 The packet processor of a kernel agent is responsible for detecting and
1256 dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
1257 packet processor is implemented by the hardware command processor (CP),
1258 asynchronous dispatch controller (ADC) and shader processor input controller
1261 The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
1262 mode driver to initialize and register the AQL queue with CP.
1264 To dispatch a kernel the following actions are performed. This can occur in the
1265 CPU host program, or from an HSA kernel executing on a GPU.
1267 1. A pointer to an AQL queue for the kernel agent on which the kernel is to be
1268 executed is obtained.
1269 2. A pointer to the kernel descriptor (see
1270 :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
1271 obtained. It must be for a kernel that is contained in a code object that that
1272 was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
1274 3. Space is allocated for the kernel arguments using the ROCm runtime allocator
1275 for a memory region with the kernarg property for the kernel agent that will
1276 execute the kernel. It must be at least 16 byte aligned.
1277 4. Kernel argument values are assigned to the kernel argument memory
1278 allocation. The layout is defined in the *HSA Programmer’s Language Reference*
1279 [HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
1280 memory in the same way constant memory is accessed. (Note that the HSA
1281 specification allows an implementation to copy the kernel argument contents to
1282 another location that is accessed by the kernel.)
1283 5. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
1284 api uses 64 bit atomic operations to reserve space in the AQL queue for the
1285 packet. The packet must be set up, and the final write must use an atomic
1286 store release to set the packet kind to ensure the packet contents are
1287 visible to the kernel agent. AQL defines a doorbell signal mechanism to
1288 notify the kernel agent that the AQL queue has been updated. These rules, and
1289 the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
1290 System Architecture Specification* [HSA]_.
1291 6. A kernel dispatch packet includes information about the actual dispatch,
1292 such as grid and work-group size, together with information from the code
1293 object about the kernel, such as segment sizes. The ROCm runtime queries on
1294 the kernel symbol can be used to obtain the code object values which are
1295 recorded in the :ref:`amdgpu-amdhsa-hsa-code-object-metadata`.
1296 7. CP executes micro-code and is responsible for detecting and setting up the
1297 GPU to execute the wavefronts of a kernel dispatch.
1298 8. CP ensures that when the a wavefront starts executing the kernel machine
1299 code, the scalar general purpose registers (SGPR) and vector general purpose
1300 registers (VGPR) are set up as required by the machine code. The required
1301 setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
1302 register state is defined in
1303 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
1304 9. The prolog of the kernel machine code (see
1305 :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
1306 before continuing executing the machine code that corresponds to the kernel.
1307 10. When the kernel dispatch has completed execution, CP signals the completion
1308 signal specified in the kernel dispatch packet if not 0.
1310 .. _amdgpu-amdhsa-memory-spaces:
1315 The memory space properties are:
1317 .. table:: AMDHSA Memory Spaces
1318 :name: amdgpu-amdhsa-memory-spaces-table
1320 ================= =========== ======== ======= ==================
1321 Memory Space Name HSA Segment Hardware Address NULL Value
1323 ================= =========== ======== ======= ==================
1324 Private private scratch 32 0x00000000
1325 Local group LDS 32 0xFFFFFFFF
1326 Global global global 64 0x0000000000000000
1327 Constant constant *same as 64 0x0000000000000000
1329 Generic flat flat 64 0x0000000000000000
1330 Region N/A GDS 32 *not implemented
1332 ================= =========== ======== ======= ==================
1334 The global and constant memory spaces both use global virtual addresses, which
1335 are the same virtual address space used by the CPU. However, some virtual
1336 addresses may only be accessible to the CPU, some only accessible by the GPU,
1339 Using the constant memory space indicates that the data will not change during
1340 the execution of the kernel. This allows scalar read instructions to be
1341 used. The vector and scalar L1 caches are invalidated of volatile data before
1342 each kernel dispatch execution to allow constant memory to change values between
1345 The local memory space uses the hardware Local Data Store (LDS) which is
1346 automatically allocated when the hardware creates work-groups of wavefronts, and
1347 freed when all the wavefronts of a work-group have terminated. The data store
1348 (DS) instructions can be used to access it.
1350 The private memory space uses the hardware scratch memory support. If the kernel
1351 uses scratch, then the hardware allocates memory that is accessed using
1352 wavefront lane dword (4 byte) interleaving. The mapping used from private
1353 address to physical address is:
1355 ``wavefront-scratch-base +
1356 (private-address * wavefront-size * 4) +
1357 (wavefront-lane-id * 4)``
1359 There are different ways that the wavefront scratch base address is determined
1360 by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
1361 memory can be accessed in an interleaved manner using buffer instruction with
1362 the scratch buffer descriptor and per wave scratch offset, by the scratch
1363 instructions, or by flat instructions. If each lane of a wavefront accesses the
1364 same private address, the interleaving results in adjacent dwords being accessed
1365 and hence requires fewer cache lines to be fetched. Multi-dword access is not
1366 supported except by flat and scratch instructions in GFX9.
1368 The generic address space uses the hardware flat address support available in
1369 GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
1370 local appertures), that are outside the range of addressible global memory, to
1371 map from a flat address to a private or local address.
1373 FLAT instructions can take a flat address and access global, private (scratch)
1374 and group (LDS) memory depending in if the address is within one of the
1375 apperture ranges. Flat access to scratch requires hardware aperture setup and
1376 setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
1377 access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
1378 (see :ref:`amdgpu-amdhsa-m0`).
1380 To convert between a segment address and a flat address the base address of the
1381 appertures address can be used. For GFX7-GFX8 these are available in the
1382 :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
1383 Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
1384 GFX9 the appature base addresses are directly available as inline constant
1385 registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
1386 address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
1387 which makes it easier to convert from flat to segment or segment to flat.
1392 Image and sample handles created by the ROCm runtime are 64 bit addresses of a
1393 hardware 32 byte V# and 48 byte S# object respectively. In order to support the
1394 HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
1395 enumeration values for the queries that are not trivially deducible from the S#
1401 HSA signal handles created by the ROCm runtime are 64 bit addresses of a
1402 structure allocated in memory accessible from both the CPU and GPU. The
1403 structure is defined by the ROCm runtime and subject to change between releases
1404 (see [AMD-ROCm-github]_).
1406 .. _amdgpu-amdhsa-hsa-aql-queue:
1411 The HSA AQL queue structure is defined by the ROCm runtime and subject to change
1412 between releases (see [AMD-ROCm-github]_). For some processors it contains
1413 fields needed to implement certain language features such as the flat address
1414 aperture bases. It also contains fields used by CP such as managing the
1415 allocation of scratch memory.
1417 .. _amdgpu-amdhsa-kernel-descriptor:
1422 A kernel descriptor consists of the information needed by CP to initiate the
1423 execution of a kernel, including the entry point address of the machine code
1424 that implements the kernel.
1426 Kernel Descriptor for GFX6-GFX9
1427 +++++++++++++++++++++++++++++++
1429 CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
1431 .. table:: Kernel Descriptor for GFX6-GFX9
1432 :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
1434 ======= ======= =============================== ===========================
1435 Bits Size Field Name Description
1436 ======= ======= =============================== ===========================
1437 31:0 4 bytes GroupSegmentFixedSize The amount of fixed local
1438 address space memory
1439 required for a work-group
1440 in bytes. This does not
1441 include any dynamically
1442 allocated local address
1443 space memory that may be
1444 added when the kernel is
1446 63:32 4 bytes PrivateSegmentFixedSize The amount of fixed
1447 private address space
1448 memory required for a
1449 work-item in bytes. If
1450 is_dynamic_callstack is 1
1451 then additional space must
1452 be added to this value for
1454 95:64 4 bytes MaxFlatWorkGroupSize Maximum flat work-group
1455 size supported by the
1456 kernel in work-items.
1457 96 1 bit IsDynamicCallStack Indicates if the generated
1458 machine code is using a
1459 dynamically sized call
1461 97 1 bit IsXNACKEnabled Indicates if the generated
1462 machine code is capable of
1464 127:98 30 bits Reserved. Must be 0.
1465 191:128 8 bytes KernelCodeEntryByteOffset Byte offset (possibly
1468 descriptor to kernel's
1469 entry point instruction
1470 which must be 256 byte
1472 383:192 24 Reserved. Must be 0.
1474 415:384 4 bytes ComputePgmRsrc1 Compute Shader (CS)
1475 program settings used by
1477 ``COMPUTE_PGM_RSRC1``
1480 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1_t-gfx6-gfx9-table`.
1481 447:416 4 bytes ComputePgmRsrc2 Compute Shader (CS)
1482 program settings used by
1484 ``COMPUTE_PGM_RSRC2``
1487 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
1488 448 1 bit EnableSGPRPrivateSegmentBuffer Enable the setup of the
1489 SGPR user data registers
1491 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1493 The total number of SGPR
1495 requested must not exceed
1496 16 and match value in
1497 ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
1498 Any requests beyond 16
1500 449 1 bit EnableSGPRDispatchPtr *see above*
1501 450 1 bit EnableSGPRQueuePtr *see above*
1502 451 1 bit EnableSGPRKernargSegmentPtr *see above*
1503 452 1 bit EnableSGPRDispatchID *see above*
1504 453 1 bit EnableSGPRFlatScratchInit *see above*
1505 454 1 bit EnableSGPRPrivateSegmentSize *see above*
1506 455 1 bit EnableSGPRGridWorkgroupCountX Not implemented in CP and
1508 456 1 bit EnableSGPRGridWorkgroupCountY Not implemented in CP and
1510 457 1 bit EnableSGPRGridWorkgroupCountZ Not implemented in CP and
1512 463:458 6 bits Reserved. Must be 0.
1513 511:464 6 Reserved. Must be 0.
1515 512 **Total size 64 bytes.**
1516 ======= ===================================================================
1520 .. table:: compute_pgm_rsrc1 for GFX6-GFX9
1521 :name: amdgpu-amdhsa-compute_pgm_rsrc1_t-gfx6-gfx9-table
1523 ======= ======= =============================== ===========================================================================
1524 Bits Size Field Name Description
1525 ======= ======= =============================== ===========================================================================
1526 5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector registers
1527 used by each work-item,
1528 granularity is device
1532 roundup((max-vgpg + 1)
1535 Used by CP to set up
1536 ``COMPUTE_PGM_RSRC1.VGPRS``.
1537 9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar registers
1538 used by a wavefront,
1539 granularity is device
1543 roundup((max-sgpg + 1)
1546 roundup((max-sgpg + 1)
1549 Includes the special SGPRs
1550 for VCC, Flat Scratch (for
1551 GFX7 onwards) and XNACK
1552 (for GFX8 onwards). It does
1553 not include the 16 SGPR
1554 added if a trap handler is
1557 Used by CP to set up
1558 ``COMPUTE_PGM_RSRC1.SGPRS``.
1559 11:10 2 bits PRIORITY Must be 0.
1561 Start executing wavefront
1562 at the specified priority.
1564 CP is responsible for
1566 ``COMPUTE_PGM_RSRC1.PRIORITY``.
1567 13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution
1568 with specified rounding
1571 precision floating point
1574 Floating point rounding
1575 mode values are defined in
1576 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1578 Used by CP to set up
1579 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1580 15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution
1581 with specified rounding
1582 denorm mode for half/double (16
1583 and 64 bit) floating point
1584 precision floating point
1587 Floating point rounding
1588 mode values are defined in
1589 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1591 Used by CP to set up
1592 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1593 17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution
1594 with specified denorm mode
1597 precision floating point
1600 Floating point denorm mode
1601 values are defined in
1602 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1604 Used by CP to set up
1605 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1606 19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution
1607 with specified denorm mode
1609 and 64 bit) floating point
1610 precision floating point
1613 Floating point denorm mode
1614 values are defined in
1615 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1617 Used by CP to set up
1618 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1619 20 1 bit PRIV Must be 0.
1621 Start executing wavefront
1622 in privilege trap handler
1625 CP is responsible for
1627 ``COMPUTE_PGM_RSRC1.PRIV``.
1628 21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution
1629 with DX10 clamp mode
1630 enabled. Used by the vector
1631 ALU to force DX-10 style
1632 treatment of NaN's (when
1633 set, clamp NaN to zero,
1637 Used by CP to set up
1638 ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
1639 22 1 bit DEBUG_MODE Must be 0.
1641 Start executing wavefront
1642 in single step mode.
1644 CP is responsible for
1646 ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
1647 23 1 bit ENABLE_IEEE_MODE Wavefront starts execution
1649 enabled. Floating point
1650 opcodes that support
1651 exception flag gathering
1652 will quiet and propagate
1653 signaling-NaN inputs per
1654 IEEE 754-2008. Min_dx10 and
1655 max_dx10 become IEEE
1656 754-2008 compliant due to
1657 signaling-NaN propagation
1660 Used by CP to set up
1661 ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
1662 24 1 bit BULKY Must be 0.
1664 Only one work-group allowed
1665 to execute on a compute
1668 CP is responsible for
1670 ``COMPUTE_PGM_RSRC1.BULKY``.
1671 25 1 bit CDBG_USER Must be 0.
1673 Flag that can be used to
1674 control debugging code.
1676 CP is responsible for
1678 ``COMPUTE_PGM_RSRC1.CDBG_USER``.
1679 26 1 bit FP16_OVFL GFX6-8:
1680 Reserved. Must be 0.
1683 execution with specified
1690 overflow that is the
1692 input value or divide
1696 overflow to +/-MAX_FP16
1699 Used by CP to set up
1700 ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
1701 31:27 5 bits Reserved. Must be 0.
1702 32 **Total size 4 bytes**
1703 ======= ===================================================================================================================
1707 .. table:: compute_pgm_rsrc2 for GFX6-GFX9
1708 :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
1710 ======= ======= =============================== ===========================================================================
1711 Bits Size Field Name Description
1712 ======= ======= =============================== ===========================================================================
1713 0 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the
1714 _WAVE_OFFSET SGPR wave scratch offset
1715 system register (see
1716 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1718 Used by CP to set up
1719 ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
1720 5:1 5 bits USER_SGPR_COUNT The total number of SGPR
1722 requested. This number must
1723 match the number of user
1724 data registers enabled.
1726 Used by CP to set up
1727 ``COMPUTE_PGM_RSRC2.USER_SGPR``.
1728 6 1 bit ENABLE_TRAP_HANDLER Set to 1 if code contains a
1729 TRAP instruction which
1730 requires a trap handler to
1734 ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``
1736 installed a trap handler
1737 regardless of the setting
1739 7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the
1740 system SGPR register for
1741 the work-group id in the X
1743 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1745 Used by CP to set up
1746 ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
1747 8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the
1748 system SGPR register for
1749 the work-group id in the Y
1751 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1753 Used by CP to set up
1754 ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
1755 9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the
1756 system SGPR register for
1757 the work-group id in the Z
1759 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1761 Used by CP to set up
1762 ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
1763 10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the
1764 system SGPR register for
1765 work-group information (see
1766 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1768 Used by CP to set up
1769 ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
1770 12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the
1771 VGPR system registers used
1772 for the work-item ID.
1773 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
1776 Used by CP to set up
1777 ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
1778 13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0.
1780 Wavefront starts execution
1782 exceptions enabled which
1783 are generated when L1 has
1784 witnessed a thread access
1788 CP is responsible for
1789 filling in the address
1791 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1792 according to what the
1794 14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0.
1796 Wavefront starts execution
1797 with memory violation
1798 exceptions exceptions
1799 enabled which are generated
1800 when a memory violation has
1801 occurred for this wave from
1803 (write-to-read-only-memory,
1804 mis-aligned atomic, LDS
1805 address out of range,
1806 illegal address, etc.).
1810 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1811 according to what the
1813 23:15 9 bits GRANULATED_LDS_SIZE Must be 0.
1815 CP uses the rounded value
1816 from the dispatch packet,
1817 not this value, as the
1818 dispatch may contain
1819 dynamically allocated group
1820 segment memory. CP writes
1822 ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
1824 Amount of group segment
1825 (LDS) to allocate for each
1826 work-group. Granularity is
1830 roundup(lds-size / (64 * 4))
1832 roundup(lds-size / (128 * 4))
1834 24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution
1835 _INVALID_OPERATION with specified exceptions
1838 Used by CP to set up
1839 ``COMPUTE_PGM_RSRC2.EXCP_EN``
1840 (set from bits 0..6).
1844 25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more
1845 _SOURCE input operands is a
1847 26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by
1848 _DIVISION_BY_ZERO Zero
1849 27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow
1851 28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow
1853 29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact
1855 30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero
1856 _ZERO (rcp_iflag_f32 instruction
1858 31 1 bit Reserved. Must be 0.
1859 32 **Total size 4 bytes.**
1860 ======= ===================================================================================================================
1864 .. table:: Floating Point Rounding Mode Enumeration Values
1865 :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
1867 ====================================== ===== ==============================
1868 Enumeration Name Value Description
1869 ====================================== ===== ==============================
1870 AMDGPU_FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
1871 AMDGPU_FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
1872 AMDGPU_FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
1873 AMDGPU_FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
1874 ====================================== ===== ==============================
1878 .. table:: Floating Point Denorm Mode Enumeration Values
1879 :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
1881 ====================================== ===== ==============================
1882 Enumeration Name Value Description
1883 ====================================== ===== ==============================
1884 AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination
1886 AMDGPU_FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
1887 AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
1888 AMDGPU_FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
1889 ====================================== ===== ==============================
1893 .. table:: System VGPR Work-Item ID Enumeration Values
1894 :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
1896 ======================================== ===== ============================
1897 Enumeration Name Value Description
1898 ======================================== ===== ============================
1899 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension
1901 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
1903 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
1905 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
1906 ======================================== ===== ============================
1908 .. _amdgpu-amdhsa-initial-kernel-execution-state:
1910 Initial Kernel Execution State
1911 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1913 This section defines the register state that will be set up by the packet
1914 processor prior to the start of execution of every wavefront. This is limited by
1915 the constraints of the hardware controllers of CP/ADC/SPI.
1917 The order of the SGPR registers is defined, but the compiler can specify which
1918 ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
1919 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
1920 for enabled registers are dense starting at SGPR0: the first enabled register is
1921 SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
1924 The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
1925 all waves of the grid. It is possible to specify more than 16 User SGPRs using
1926 the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
1927 initialized. These are then immediately followed by the System SGPRs that are
1928 set up by ADC/SPI and can have different values for each wave of the grid
1931 SGPR register initial state is defined in
1932 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
1934 .. table:: SGPR Register Set Up Order
1935 :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
1937 ========== ========================== ====== ==============================
1938 SGPR Order Name Number Description
1939 (kernel descriptor enable of
1941 ========== ========================== ====== ==============================
1942 First Private Segment Buffer 4 V# that can be used, together
1943 (enable_sgpr_private with Scratch Wave Offset as an
1944 _segment_buffer) offset, to access the private
1945 memory space using a segment
1948 CP uses the value provided by
1950 then Dispatch Ptr 2 64 bit address of AQL dispatch
1951 (enable_sgpr_dispatch_ptr) packet for kernel dispatch
1953 then Queue Ptr 2 64 bit address of amd_queue_t
1954 (enable_sgpr_queue_ptr) object for AQL queue on which
1955 the dispatch packet was
1957 then Kernarg Segment Ptr 2 64 bit address of Kernarg
1958 (enable_sgpr_kernarg segment. This is directly
1959 _segment_ptr) copied from the
1960 kernarg_address in the kernel
1963 Having CP load it once avoids
1964 loading it at the beginning of
1966 then Dispatch Id 2 64 bit Dispatch ID of the
1967 (enable_sgpr_dispatch_id) dispatch packet being
1969 then Flat Scratch Init 2 This is 2 SGPRs:
1970 (enable_sgpr_flat_scratch
1974 The first SGPR is a 32 bit
1976 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
1977 to per SPI base of memory
1978 for scratch for the queue
1979 executing the kernel
1980 dispatch. CP obtains this
1981 from the runtime. (The
1982 Scratch Segment Buffer base
1984 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
1985 plus this offset.) The value
1986 of Scratch Wave Offset must
1987 be added to this offset by
1988 the kernel machine code,
1989 right shifted by 8, and
1990 moved to the FLAT_SCRATCH_HI
1992 FLAT_SCRATCH_HI corresponds
1993 to SGPRn-4 on GFX7, and
1994 SGPRn-6 on GFX8 (where SGPRn
1995 is the highest numbered SGPR
1996 allocated to the wave).
1998 multiplied by 256 (as it is
1999 in units of 256 bytes) and
2001 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2002 to calculate the per wave
2003 FLAT SCRATCH BASE in flat
2004 memory instructions that
2008 The second SGPR is 32 bit
2009 byte size of a single
2010 work-item’s scratch memory
2011 usage. CP obtains this from
2012 the runtime, and it is
2013 always a multiple of DWORD.
2014 CP checks that the value in
2015 the kernel dispatch packet
2016 Private Segment Byte Size is
2017 not larger, and requests the
2018 runtime to increase the
2019 queue's scratch size if
2020 necessary. The kernel code
2022 FLAT_SCRATCH_LO which is
2023 SGPRn-3 on GFX7 and SGPRn-5
2024 on GFX8. FLAT_SCRATCH_LO is
2025 used as the FLAT SCRATCH
2027 instructions. Having CP load
2028 it once avoids loading it at
2029 the beginning of every
2030 wavefront. GFX9 This is the
2031 64 bit base address of the
2032 per SPI scratch backing
2033 memory managed by SPI for
2034 the queue executing the
2035 kernel dispatch. CP obtains
2036 this from the runtime (and
2037 divides it if there are
2038 multiple Shader Arrays each
2039 with its own SPI). The value
2040 of Scratch Wave Offset must
2041 be added by the kernel
2042 machine code and the result
2043 moved to the FLAT_SCRATCH
2044 SGPR which is SGPRn-6 and
2045 SGPRn-5. It is used as the
2046 FLAT SCRATCH BASE in flat
2047 memory instructions. then
2048 Private Segment Size 1 The
2049 32 bit byte size of a
2050 (enable_sgpr_private single
2052 scratch_segment_size) memory
2053 allocation. This is the
2054 value from the kernel
2055 dispatch packet Private
2056 Segment Byte Size rounded up
2057 by CP to a multiple of
2060 Having CP load it once avoids
2061 loading it at the beginning of
2064 This is not used for
2065 GFX7-GFX8 since it is the same
2066 value as the second SGPR of
2067 Flat Scratch Init. However, it
2068 may be needed for GFX9 which
2069 changes the meaning of the
2070 Flat Scratch Init value.
2071 then Grid Work-Group Count X 1 32 bit count of the number of
2072 (enable_sgpr_grid work-groups in the X dimension
2073 _workgroup_count_X) for the grid being
2074 executed. Computed from the
2075 fields in the kernel dispatch
2076 packet as ((grid_size.x +
2077 workgroup_size.x - 1) /
2079 then Grid Work-Group Count Y 1 32 bit count of the number of
2080 (enable_sgpr_grid work-groups in the Y dimension
2081 _workgroup_count_Y && for the grid being
2082 less than 16 previous executed. Computed from the
2083 SGPRs) fields in the kernel dispatch
2084 packet as ((grid_size.y +
2085 workgroup_size.y - 1) /
2088 Only initialized if <16
2089 previous SGPRs initialized.
2090 then Grid Work-Group Count Z 1 32 bit count of the number of
2091 (enable_sgpr_grid work-groups in the Z dimension
2092 _workgroup_count_Z && for the grid being
2093 less than 16 previous executed. Computed from the
2094 SGPRs) fields in the kernel dispatch
2095 packet as ((grid_size.z +
2096 workgroup_size.z - 1) /
2099 Only initialized if <16
2100 previous SGPRs initialized.
2101 then Work-Group Id X 1 32 bit work-group id in X
2102 (enable_sgpr_workgroup_id dimension of grid for
2104 then Work-Group Id Y 1 32 bit work-group id in Y
2105 (enable_sgpr_workgroup_id dimension of grid for
2107 then Work-Group Id Z 1 32 bit work-group id in Z
2108 (enable_sgpr_workgroup_id dimension of grid for
2110 then Work-Group Info 1 {first_wave, 14’b0000,
2111 (enable_sgpr_workgroup ordered_append_term[10:0],
2112 _info) threadgroup_size_in_waves[5:0]}
2113 then Scratch Wave Offset 1 32 bit byte offset from base
2114 (enable_sgpr_private of scratch base of queue
2115 _segment_wave_offset) executing the kernel
2116 dispatch. Must be used as an
2118 segment address when using
2119 Scratch Segment Buffer. It
2120 must be used to set up FLAT
2121 SCRATCH for flat addressing
2123 :ref:`amdgpu-amdhsa-flat-scratch`).
2124 ========== ========================== ====== ==============================
2126 The order of the VGPR registers is defined, but the compiler can specify which
2127 ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
2128 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2129 for enabled registers are dense starting at VGPR0: the first enabled register is
2130 VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
2133 VGPR register initial state is defined in
2134 :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
2136 .. table:: VGPR Register Set Up Order
2137 :name: amdgpu-amdhsa-vgpr-register-set-up-order-table
2139 ========== ========================== ====== ==============================
2140 VGPR Order Name Number Description
2141 (kernel descriptor enable of
2143 ========== ========================== ====== ==============================
2144 First Work-Item Id X 1 32 bit work item id in X
2145 (Always initialized) dimension of work-group for
2147 then Work-Item Id Y 1 32 bit work item id in Y
2148 (enable_vgpr_workitem_id dimension of work-group for
2149 > 0) wavefront lane.
2150 then Work-Item Id Z 1 32 bit work item id in Z
2151 (enable_vgpr_workitem_id dimension of work-group for
2152 > 1) wavefront lane.
2153 ========== ========================== ====== ==============================
2155 The setting of registers is is done by GPU CP/ADC/SPI hardware as follows:
2157 1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
2159 2. Work-group Id registers X, Y, Z are set by ADC which supports any
2160 combination including none.
2161 3. Scratch Wave Offset is set by SPI in a per wave basis which is why its value
2162 cannot included with the flat scratch init value which is per queue.
2163 4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
2166 Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
2167 value to the hardware required SGPRn-3 and SGPRn-4 respectively.
2169 The global segment can be accessed either using buffer instructions (GFX6 which
2170 has V# 64 bit address support), flat instructions (GFX7-9), or global
2171 instructions (GFX9).
2173 If buffer operations are used then the compiler can generate a V# with the
2174 following properties:
2178 * ATC: 1 if IOMMU present (such as APU)
2180 * MTYPE set to support memory coherence that matches the runtime (such as CC for
2181 APU and NC for dGPU).
2183 .. _amdgpu-amdhsa-kernel-prolog:
2188 .. _amdgpu-amdhsa-m0:
2194 The M0 register must be initialized with a value at least the total LDS size
2195 if the kernel may access LDS via DS or flat operations. Total LDS size is
2196 available in dispatch packet. For M0, it is also possible to use maximum
2197 possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
2200 The M0 register is not used for range checking LDS accesses and so does not
2201 need to be initialized in the prolog.
2203 .. _amdgpu-amdhsa-flat-scratch:
2208 If the kernel may use flat operations to access scratch memory, the prolog code
2209 must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
2210 are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wave
2211 Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
2214 Flat scratch is not supported.
2217 1. The low word of Flat Scratch Init is 32 bit byte offset from
2218 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
2219 being managed by SPI for the queue executing the kernel dispatch. This is
2220 the same value used in the Scratch Segment Buffer V# base address. The
2221 prolog must add the value of Scratch Wave Offset to get the wave's byte
2222 scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
2223 FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
2224 by 8 before moving into FLAT_SCRATCH_LO.
2225 2. The second word of Flat Scratch Init is 32 bit byte size of a single
2226 work-items scratch memory usage. This is directly loaded from the kernel
2227 dispatch packet Private Segment Byte Size and rounded up to a multiple of
2228 DWORD. Having CP load it once avoids loading it at the beginning of every
2229 wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
2232 The Flat Scratch Init is the 64 bit address of the base of scratch backing
2233 memory being managed by SPI for the queue executing the kernel dispatch. The
2234 prolog must add the value of Scratch Wave Offset and moved to the FLAT_SCRATCH
2235 pair for use as the flat scratch base in flat memory instructions.
2237 .. _amdgpu-amdhsa-memory-model:
2242 This section describes the mapping of LLVM memory model onto AMDGPU machine code
2243 (see :ref:`memmodel`). *The implementation is WIP.*
2246 Update when implementation complete.
2248 Support more relaxed OpenCL memory model to be controlled by environment
2249 component of target triple.
2251 The AMDGPU backend supports the memory synchronization scopes specified in
2252 :ref:`amdgpu-memory-scopes`.
2254 The code sequences used to implement the memory model are defined in table
2255 :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
2257 The sequences specify the order of instructions that a single thread must
2258 execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
2259 to other memory instructions executed by the same thread. This allows them to be
2260 moved earlier or later which can allow them to be combined with other instances
2261 of the same instruction, or hoisted/sunk out of loops to improve
2262 performance. Only the instructions related to the memory model are given;
2263 additional ``s_waitcnt`` instructions are required to ensure registers are
2264 defined before being used. These may be able to be combined with the memory
2265 model ``s_waitcnt`` instructions as described above.
2267 The AMDGPU memory model supports both the HSA [HSA]_ memory model, and the
2268 OpenCL [OpenCL]_ memory model. The HSA memory model uses a single happens-before
2269 relation for all address spaces (see :ref:`amdgpu-address-spaces`). The OpenCL
2270 memory model which has separate happens-before relations for the global and
2271 local address spaces, and only a fence specifying both global and local address
2272 space joins the relationships. Since the LLVM ``memfence`` instruction does not
2273 allow an address space to be specified the OpenCL fence has to convervatively
2274 assume both local and global address space was specified. However, optimizations
2275 can often be done to eliminate the additional ``s_waitcnt``instructions when
2276 there are no intervening corresponding ``ds/flat_load/store/atomic`` memory
2277 instructions. The code sequences in the table indicate what can be omitted for
2278 the OpenCL memory. The target triple environment is used to determine if the
2279 source language is OpenCL (see :ref:`amdgpu-opencl`).
2281 ``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
2284 ``buffer/global/flat_load/store/atomic`` instructions to global memory are
2285 termed vector memory operations.
2289 * Each agent has multiple compute units (CU).
2290 * Each CU has multiple SIMDs that execute wavefronts.
2291 * The wavefronts for a single work-group are executed in the same CU but may be
2292 executed by different SIMDs.
2293 * Each CU has a single LDS memory shared by the wavefronts of the work-groups
2295 * All LDS operations of a CU are performed as wavefront wide operations in a
2296 global order and involve no caching. Completion is reported to a wavefront in
2298 * The LDS memory has multiple request queues shared by the SIMDs of a
2299 CU. Therefore, the LDS operations performed by different waves of a work-group
2300 can be reordered relative to each other, which can result in reordering the
2301 visibility of vector memory operations with respect to LDS operations of other
2302 wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
2303 ensure synchronization between LDS operations and vector memory operations
2304 between waves of a work-group, but not between operations performed by the
2306 * The vector memory operations are performed as wavefront wide operations and
2307 completion is reported to a wavefront in execution order. The exception is
2308 that for GFX7-9 ``flat_load/store/atomic`` instructions can report out of
2309 vector memory order if they access LDS memory, and out of LDS operation order
2310 if they access global memory.
2311 * The vector memory operations access a vector L1 cache shared by all wavefronts
2312 on a CU. Therefore, no special action is required for coherence between
2313 wavefronts in the same work-group. A ``buffer_wbinvl1_vol`` is required for
2314 coherence between waves executing in different work-groups as they may be
2315 executing on different CUs.
2316 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
2317 on a group of CUs. The scalar and vector L1 caches are not coherent. However,
2318 scalar operations are used in a restricted way so do not impact the memory
2319 model. See :ref:`amdgpu-amdhsa-memory-spaces`.
2320 * The vector and scalar memory operations use an L2 cache shared by all CUs on
2322 * The L2 cache has independent channels to service disjoint ranges of virtual
2324 * Each CU has a separate request queue per channel. Therefore, the vector and
2325 scalar memory operations performed by waves executing in different work-groups
2326 (which may be executing on different CUs) of an agent can be reordered
2327 relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
2328 synchronization between vector memory operations of different CUs. It ensures a
2329 previous vector memory operation has completed before executing a subsequent
2330 vector memory or LDS operation and so can be used to meet the requirements of
2331 acquire and release.
2332 * The L2 cache can be kept coherent with other agents on some targets, or ranges
2333 of virtual addresses can be set up to bypass it to ensure system coherence.
2335 Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-8),
2336 or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
2337 memory, atomic memory orderings are not meaningful and all accesses are treated
2340 Constant address space uses ``buffer/global_load`` instructions (or equivalent
2341 scalar memory instructions). Since the constant address space contents do not
2342 change during the execution of a kernel dispatch it is not legal to perform
2343 stores, and atomic memory orderings are not meaningful and all access are
2344 treated as non-atomic.
2346 A memory synchronization scope wider than work-group is not meaningful for the
2347 group (LDS) address space and is treated as work-group.
2349 The memory model does not support the region address space which is treated as
2352 Acquire memory ordering is not meaningful on store atomic instructions and is
2353 treated as non-atomic.
2355 Release memory ordering is not meaningful on load atomic instructions and is
2356 treated a non-atomic.
2358 Acquire-release memory ordering is not meaningful on load or store atomic
2359 instructions and is treated as acquire and release respectively.
2361 AMDGPU backend only uses scalar memory operations to access memory that is
2362 proven to not change during the execution of the kernel dispatch. This includes
2363 constant address space and global address space for program scope const
2364 variables. Therefore the kernel machine code does not have to maintain the
2365 scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
2366 and vector L1 caches are invalidated between kernel dispatches by CP since
2367 constant address space data may change between kernel dispatch executions. See
2368 :ref:`amdgpu-amdhsa-memory-spaces`.
2370 The one execption is if scalar writes are used to spill SGPR registers. In this
2371 case the AMDGPU backend ensures the memory location used to spill is never
2372 accessed by vector memory operations at the same time. If scalar writes are used
2373 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
2374 return since the locations may be used for vector memory instructions by a
2375 future wave that uses the same scratch area, or a function call that creates a
2376 frame at the same address, respectively. There is no need for a ``s_dcache_inv``
2377 as all scalar writes are write-before-read in the same thread.
2379 Scratch backing memory (which is used for the private address space) is accessed
2380 with MTYPE NC_NV (non-coherenent non-volatile). Since the private address space
2381 is only accessed by a single thread, and is always write-before-read,
2382 there is never a need to invalidate these entries from the L1 cache. Hence all
2383 cache invalidates are done as ``*_vol`` to only invalidate the volatile cache
2386 On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
2387 to invalidate the L2 cache. This also causes it to be treated as non-volatile
2388 and so is not invalidated by ``*_vol``. On APU it is accessed as CC (cache
2389 coherent) and so the L2 cache will coherent with the CPU and other agents.
2391 .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
2392 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
2394 ============ ============ ============== ========== =======================
2395 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
2396 Ordering Sync Scope Address
2398 ============ ============ ============== ========== =======================
2400 ---------------------------------------------------------------------------
2401 load *none* *none* - global non-volatile
2402 - generic 1. buffer/global/flat_load
2404 1. buffer/global/flat_load
2406 load *none* *none* - local 1. ds_load
2407 store *none* *none* - global 1. buffer/global/flat_store
2409 store *none* *none* - local 1. ds_store
2410 **Unordered Atomic**
2411 ---------------------------------------------------------------------------
2412 load atomic unordered *any* *any* *Same as non-atomic*.
2413 store atomic unordered *any* *any* *Same as non-atomic*.
2414 atomicrmw unordered *any* *any* *Same as monotonic
2416 **Monotonic Atomic**
2417 ---------------------------------------------------------------------------
2418 load atomic monotonic - singlethread - global 1. buffer/global/flat_load
2419 - wavefront - generic
2421 load atomic monotonic - singlethread - local 1. ds_load
2424 load atomic monotonic - agent - global 1. buffer/global/flat_load
2425 - system - generic glc=1
2426 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
2427 - wavefront - generic
2431 store atomic monotonic - singlethread - local 1. ds_store
2434 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
2435 - wavefront - generic
2439 atomicrmw monotonic - singlethread - local 1. ds_atomic
2443 ---------------------------------------------------------------------------
2444 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
2447 load atomic acquire - workgroup - global 1. buffer/global_load
2448 load atomic acquire - workgroup - local 1. ds/flat_load
2449 - generic 2. s_waitcnt lgkmcnt(0)
2453 - Must happen before
2466 load atomic acquire - agent - global 1. buffer/global_load
2468 2. s_waitcnt vmcnt(0)
2470 - Must happen before
2478 3. buffer_wbinvl1_vol
2480 - Must happen before
2490 load atomic acquire - agent - generic 1. flat_load glc=1
2491 - system 2. s_waitcnt vmcnt(0) &
2496 - Must happen before
2499 - Ensures the flat_load
2504 3. buffer_wbinvl1_vol
2506 - Must happen before
2516 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
2519 atomicrmw acquire - workgroup - global 1. buffer/global_atomic
2520 atomicrmw acquire - workgroup - local 1. ds/flat_atomic
2521 - generic 2. waitcnt lgkmcnt(0)
2525 - Must happen before
2538 atomicrmw acquire - agent - global 1. buffer/global_atomic
2539 - system 2. s_waitcnt vmcnt(0)
2541 - Must happen before
2550 3. buffer_wbinvl1_vol
2552 - Must happen before
2562 atomicrmw acquire - agent - generic 1. flat_atomic
2563 - system 2. s_waitcnt vmcnt(0) &
2568 - Must happen before
2577 3. buffer_wbinvl1_vol
2579 - Must happen before
2589 fence acquire - singlethread *none* *none*
2591 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0)
2622 fence-paired-atomic).
2623 - Must happen before
2634 fence-paired-atomic.
2636 fence acquire - agent *none* 1. s_waitcnt vmcnt(0) &
2651 - Could be split into
2660 - s_waitcnt vmcnt(0)
2671 fence-paired-atomic).
2672 - s_waitcnt lgkmcnt(0)
2683 fence-paired-atomic).
2684 - Must happen before
2698 fence-paired-atomic.
2700 2. buffer_wbinvl1_vol
2702 - Must happen before
2703 any following global/generic
2713 ---------------------------------------------------------------------------
2714 store atomic release - singlethread - global 1. buffer/global/ds/flat_store
2717 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
2727 - Must happen before
2738 2. buffer/global/flat_store
2739 store atomic release - workgroup - local 1. ds_store
2740 store atomic release - agent - global 1. s_waitcnt vmcnt(0) &
2741 - system - generic lgkmcnt(0)
2745 - Could be split into
2754 - s_waitcnt vmcnt(0)
2761 - s_waitcnt lgkmcnt(0)
2768 - Must happen before
2779 2. buffer/global/ds/flat_store
2780 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
2783 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
2793 - Must happen before
2804 2. buffer/global/flat_atomic
2805 atomicrmw release - workgroup - local 1. ds_atomic
2806 atomicrmw release - agent - global 1. s_waitcnt vmcnt(0) &
2807 - system - generic lgkmcnt(0)
2811 - Could be split into
2820 - s_waitcnt vmcnt(0)
2827 - s_waitcnt lgkmcnt(0)
2834 - Must happen before
2845 2. buffer/global/ds/flat_atomic*
2846 fence release - singlethread *none* *none*
2848 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0)
2868 - Must happen before
2877 fence-paired-atomic).
2884 fence-paired-atomic.
2886 fence release - agent *none* 1. s_waitcnt vmcnt(0) &
2901 - Could be split into
2910 - s_waitcnt vmcnt(0)
2917 - s_waitcnt lgkmcnt(0)
2924 - Must happen before
2933 fence-paired-atomic).
2940 fence-paired-atomic.
2942 **Acquire-Release Atomic**
2943 ---------------------------------------------------------------------------
2944 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
2947 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
2957 - Must happen before
2968 2. buffer/global_atomic
2969 atomicrmw acq_rel - workgroup - local 1. ds_atomic
2970 2. s_waitcnt lgkmcnt(0)
2974 - Must happen before
2987 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
2997 - Must happen before
3009 3. s_waitcnt lgkmcnt(0)
3013 - Must happen before
3025 atomicrmw acq_rel - agent - global 1. s_waitcnt vmcnt(0) &
3030 - Could be split into
3039 - s_waitcnt vmcnt(0)
3046 - s_waitcnt lgkmcnt(0)
3053 - Must happen before
3064 2. buffer/global_atomic
3065 3. s_waitcnt vmcnt(0)
3067 - Must happen before
3076 4. buffer_wbinvl1_vol
3078 - Must happen before
3088 atomicrmw acq_rel - agent - generic 1. s_waitcnt vmcnt(0) &
3093 - Could be split into
3102 - s_waitcnt vmcnt(0)
3109 - s_waitcnt lgkmcnt(0)
3116 - Must happen before
3128 3. s_waitcnt vmcnt(0) &
3133 - Must happen before
3142 4. buffer_wbinvl1_vol
3144 - Must happen before
3154 fence acq_rel - singlethread *none* *none*
3156 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0)
3176 - Must happen before
3199 fence-paired-atomic)
3220 fence-paired-atomic).
3225 fence acq_rel - agent *none* 1. s_waitcnt vmcnt(0) &
3240 - Could be split into
3249 - s_waitcnt vmcnt(0)
3256 - s_waitcnt lgkmcnt(0)
3263 - Must happen before
3268 global/local/generic
3277 fence-paired-atomic)
3289 global/local/generic
3298 fence-paired-atomic).
3303 2. buffer_wbinvl1_vol
3305 - Must happen before
3319 **Sequential Consistent Atomic**
3320 ---------------------------------------------------------------------------
3321 load atomic seq_cst - singlethread - global *Same as corresponding
3322 - wavefront - local load atomic acquire*.
3323 - workgroup - generic
3324 load atomic seq_cst - agent - global 1. s_waitcnt vmcnt(0)
3326 - generic - Must happen after
3373 instructions same as
3377 store atomic seq_cst - singlethread - global *Same as corresponding
3378 - wavefront - local store atomic release*.
3379 - workgroup - generic
3380 store atomic seq_cst - agent - global *Same as corresponding
3381 - system - generic store atomic release*.
3382 atomicrmw seq_cst - singlethread - global *Same as corresponding
3383 - wavefront - local atomicrmw acq_rel*.
3384 - workgroup - generic
3385 atomicrmw seq_cst - agent - global *Same as corresponding
3386 - system - generic atomicrmw acq_rel*.
3387 fence seq_cst - singlethread *none* *Same as corresponding
3388 - wavefront fence acq_rel*.
3392 ============ ============ ============== ========== =======================
3394 The memory order also adds the single thread optimization constrains defined in
3396 :ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
3398 .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
3399 :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
3401 ============ ==============================================================
3402 LLVM Memory Optimization Constraints
3404 ============ ==============================================================
3407 acquire - If a load atomic/atomicrmw then no following load/load
3408 atomic/store/ store atomic/atomicrmw/fence instruction can
3409 be moved before the acquire.
3410 - If a fence then same as load atomic, plus no preceding
3411 associated fence-paired-atomic can be moved after the fence.
3412 release - If a store atomic/atomicrmw then no preceding load/load
3413 atomic/store/ store atomic/atomicrmw/fence instruction can
3414 be moved after the release.
3415 - If a fence then same as store atomic, plus no following
3416 associated fence-paired-atomic can be moved before the
3418 acq_rel Same constraints as both acquire and release.
3419 seq_cst - If a load atomic then same constraints as acquire, plus no
3420 preceding sequentially consistent load atomic/store
3421 atomic/atomicrmw/fence instruction can be moved after the
3423 - If a store atomic then the same constraints as release, plus
3424 no following sequentially consistent load atomic/store
3425 atomic/atomicrmw/fence instruction can be moved before the
3427 - If an atomicrmw/fence then same constraints as acq_rel.
3428 ============ ==============================================================
3433 For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
3434 (such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
3435 the ``s_trap`` instruction with the following usage:
3437 .. table:: AMDGPU Trap Handler for AMDHSA OS
3438 :name: amdgpu-trap-handler-for-amdhsa-os-table
3440 =================== =============== =============== =======================
3441 Usage Code Sequence Trap Handler Description
3443 =================== =============== =============== =======================
3444 reserved ``s_trap 0x00`` Reserved by hardware.
3445 ``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA
3446 ``queue_ptr`` ``debugtrap``
3447 ``VGPR0``: intrinsic (not
3448 ``arg`` implemented).
3449 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be
3450 ``queue_ptr`` terminated and its
3451 associated queue put
3452 into the error state.
3453 ``llvm.debugtrap`` ``s_trap 0x03`` ``SGPR0-1``: If debugger not
3454 ``queue_ptr`` installed handled
3455 same as ``llvm.trap``.
3456 debugger breakpoint ``s_trap 0x07`` Reserved for debugger
3458 debugger ``s_trap 0x08`` Reserved for debugger.
3459 debugger ``s_trap 0xfe`` Reserved for debugger.
3460 debugger ``s_trap 0xff`` Reserved for debugger.
3461 =================== =============== =============== =======================
3466 This section provides code conventions used when the target triple OS is
3467 empty (see :ref:`amdgpu-target-triples`).
3472 For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
3473 not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
3474 instructions are handled as follows:
3476 .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
3477 :name: amdgpu-trap-handler-for-non-amdhsa-os-table
3479 =============== =============== ===========================================
3480 Usage Code Sequence Description
3481 =============== =============== ===========================================
3482 llvm.trap s_endpgm Causes wavefront to be terminated.
3483 llvm.debugtrap *none* Compiler warning given that there is no
3484 trap handler installed.
3485 =============== =============== ===========================================
3495 When generating code for the OpenCL language the target triple environment
3496 should be ``opencl`` or ``amdgizcl`` (see :ref:`amdgpu-target-triples`).
3498 When the language is OpenCL the following differences occur:
3500 1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
3501 2. The AMDGPU backend adds additional arguments to the kernel.
3502 3. Additional metadata is generated
3503 (:ref:`amdgpu-amdhsa-hsa-code-object-metadata`).
3506 Specify what affect this has. Hidden arguments added. Additional metadata
3514 When generating code for the OpenCL language the target triple environment
3515 should be ``hcc`` (see :ref:`amdgpu-target-triples`).
3517 When the language is OpenCL the following differences occur:
3519 1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
3522 Specify what affect this has.
3527 AMDGPU backend has LLVM-MC based assembler which is currently in development.
3528 It supports AMDGCN GFX6-GFX8.
3530 This section describes general syntax for instructions and operands. For more
3531 information about instructions, their semantics and supported combinations of
3532 operands, refer to one of instruction set architecture manuals
3533 [AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_ and [AMD-GCN-GFX9]_.
3535 An instruction has the following syntax (register operands are normally
3536 comma-separated while extra operands are space-separated):
3538 *<opcode> <register_operand0>, ... <extra_operand0> ...*
3543 The following syntax for register operands is supported:
3545 * SGPR registers: s0, ... or s[0], ...
3546 * VGPR registers: v0, ... or v[0], ...
3547 * TTMP registers: ttmp0, ... or ttmp[0], ...
3548 * Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
3549 * Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
3550 * Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
3551 * Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
3552 * Register index expressions: v[2*2], s[1-1:2-1]
3553 * 'off' indicates that an operand is not enabled
3555 The following extra operands are supported:
3557 * offset, offset0, offset1
3559 * glc, slc, tfe bits
3560 * waitcnt: integer or combination of counter values
3563 - abs (\| \|), neg (\-)
3567 - row_shl, row_shr, row_ror, row_rol
3568 - row_mirror, row_half_mirror, row_bcast
3569 - wave_shl, wave_shr, wave_ror, wave_rol, quad_perm
3570 - row_mask, bank_mask, bound_ctrl
3574 - dst_sel, src0_sel, src1_sel (BYTE_N, WORD_M, DWORD)
3575 - dst_unused (UNUSED_PAD, UNUSED_SEXT, UNUSED_PRESERVE)
3578 Instruction Examples
3579 ~~~~~~~~~~~~~~~~~~~~
3584 .. code-block:: nasm
3586 ds_add_u32 v2, v4 offset:16
3587 ds_write_src2_b64 v2 offset0:4 offset1:8
3588 ds_cmpst_f32 v2, v4, v6
3589 ds_min_rtn_f64 v[8:9], v2, v[4:5]
3592 For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
3597 .. code-block:: nasm
3599 flat_load_dword v1, v[3:4]
3600 flat_store_dwordx3 v[3:4], v[5:7]
3601 flat_atomic_swap v1, v[3:4], v5 glc
3602 flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
3603 flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
3605 For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
3610 .. code-block:: nasm
3612 buffer_load_dword v1, off, s[4:7], s1
3613 buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
3614 buffer_store_format_xy v[1:2], off, s[4:7], s1
3616 buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
3618 For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
3623 .. code-block:: nasm
3625 s_load_dword s1, s[2:3], 0xfc
3626 s_load_dwordx8 s[8:15], s[2:3], s4
3627 s_load_dwordx16 s[88:103], s[2:3], s4
3631 For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
3636 .. code-block:: nasm
3639 s_mov_b64 s[0:1], 0x80000000
3641 s_wqm_b64 s[2:3], s[4:5]
3642 s_bcnt0_i32_b64 s1, s[2:3]
3643 s_swappc_b64 s[2:3], s[4:5]
3644 s_cbranch_join s[4:5]
3646 For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
3651 .. code-block:: nasm
3653 s_add_u32 s1, s2, s3
3654 s_and_b64 s[2:3], s[4:5], s[6:7]
3655 s_cselect_b32 s1, s2, s3
3656 s_andn2_b32 s2, s4, s6
3657 s_lshr_b64 s[2:3], s[4:5], s6
3658 s_ashr_i32 s2, s4, s6
3659 s_bfm_b64 s[2:3], s4, s6
3660 s_bfe_i64 s[2:3], s[4:5], s6
3661 s_cbranch_g_fork s[4:5], s[6:7]
3663 For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
3668 .. code-block:: nasm
3671 s_bitcmp1_b32 s1, s2
3672 s_bitcmp0_b64 s[2:3], s4
3675 For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
3680 .. code-block:: nasm
3685 s_waitcnt 0 ; Wait for all counters to be 0
3686 s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
3687 s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
3691 s_sendmsg sendmsg(MSG_INTERRUPT)
3694 For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
3696 Unless otherwise mentioned, little verification is performed on the operands
3697 of SOPP Instructions, so it is up to the programmer to be familiar with the
3698 range or acceptable values.
3703 For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
3704 the assembler will automatically use optimal encoding based on its operands.
3705 To force specific encoding, one can add a suffix to the opcode of the instruction:
3707 * _e32 for 32-bit VOP1/VOP2/VOPC
3708 * _e64 for 64-bit VOP3
3710 * _sdwa for VOP_SDWA
3712 VOP1/VOP2/VOP3/VOPC examples:
3714 .. code-block:: nasm
3717 v_mov_b32_e32 v1, v2
3719 v_cvt_f64_i32_e32 v[1:2], v2
3720 v_floor_f32_e32 v1, v2
3721 v_bfrev_b32_e32 v1, v2
3722 v_add_f32_e32 v1, v2, v3
3723 v_mul_i32_i24_e64 v1, v2, 3
3724 v_mul_i32_i24_e32 v1, -3, v3
3725 v_mul_i32_i24_e32 v1, -100, v3
3726 v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
3727 v_max_f16_e32 v1, v2, v3
3731 .. code-block:: nasm
3733 v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
3734 v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
3735 v_mov_b32 v0, v0 wave_shl:1
3736 v_mov_b32 v0, v0 row_mirror
3737 v_mov_b32 v0, v0 row_bcast:31
3738 v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
3739 v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
3740 v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
3744 .. code-block:: nasm
3746 v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
3747 v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
3748 v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
3749 v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
3750 v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
3752 For full list of supported instructions, refer to "Vector ALU instructions".
3754 HSA Code Object Directives
3755 ~~~~~~~~~~~~~~~~~~~~~~~~~~
3757 AMDGPU ABI defines auxiliary data in output code object. In assembly source,
3758 one can specify them with assembler directives.
3760 .hsa_code_object_version major, minor
3761 +++++++++++++++++++++++++++++++++++++
3763 *major* and *minor* are integers that specify the version of the HSA code
3764 object that will be generated by the assembler.
3766 .hsa_code_object_isa [major, minor, stepping, vendor, arch]
3767 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3770 *major*, *minor*, and *stepping* are all integers that describe the instruction
3771 set architecture (ISA) version of the assembly program.
3773 *vendor* and *arch* are quoted strings. *vendor* should always be equal to
3774 "AMD" and *arch* should always be equal to "AMDGPU".
3776 By default, the assembler will derive the ISA version, *vendor*, and *arch*
3777 from the value of the -mcpu option that is passed to the assembler.
3779 .amdgpu_hsa_kernel (name)
3780 +++++++++++++++++++++++++
3782 This directives specifies that the symbol with given name is a kernel entry point
3783 (label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
3788 This directive marks the beginning of a list of key / value pairs that are used
3789 to specify the amd_kernel_code_t object that will be emitted by the assembler.
3790 The list must be terminated by the *.end_amd_kernel_code_t* directive. For
3791 any amd_kernel_code_t values that are unspecified a default value will be
3792 used. The default value for all keys is 0, with the following exceptions:
3794 - *kernel_code_version_major* defaults to 1.
3795 - *machine_kind* defaults to 1.
3796 - *machine_version_major*, *machine_version_minor*, and
3797 *machine_version_stepping* are derived from the value of the -mcpu option
3798 that is passed to the assembler.
3799 - *kernel_code_entry_byte_offset* defaults to 256.
3800 - *wavefront_size* defaults to 6.
3801 - *kernarg_segment_alignment*, *group_segment_alignment*, and
3802 *private_segment_alignment* default to 4. Note that alignments are specified
3803 as a power of two, so a value of **n** means an alignment of 2^ **n**.
3805 The *.amd_kernel_code_t* directive must be placed immediately after the
3806 function label and before any instructions.
3808 For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
3809 comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
3811 Here is an example of a minimal amd_kernel_code_t specification:
3813 .. code-block:: none
3815 .hsa_code_object_version 1,0
3816 .hsa_code_object_isa
3821 .amdgpu_hsa_kernel hello_world
3826 enable_sgpr_kernarg_segment_ptr = 1
3828 compute_pgm_rsrc1_vgprs = 0
3829 compute_pgm_rsrc1_sgprs = 0
3830 compute_pgm_rsrc2_user_sgpr = 2
3831 kernarg_segment_byte_size = 8
3832 wavefront_sgpr_count = 2
3833 workitem_vgpr_count = 3
3834 .end_amd_kernel_code_t
3836 s_load_dwordx2 s[0:1], s[0:1] 0x0
3837 v_mov_b32 v0, 3.14159
3838 s_waitcnt lgkmcnt(0)
3841 flat_store_dword v[1:2], v0
3844 .size hello_world, .Lfunc_end0-hello_world
3846 Additional Documentation
3847 ========================
3849 .. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
3850 .. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
3851 .. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
3852 .. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
3853 .. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
3854 .. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
3855 .. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
3856 .. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
3857 .. [AMD-OpenCL_Programming-Guide] `AMD Accelerated Parallel Processing OpenCL Programming Guide <http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf>`_
3858 .. [AMD-APP-SDK] `AMD Accelerated Parallel Processing APP SDK Documentation <http://developer.amd.com/tools/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/documentation/>`__
3859 .. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
3860 .. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
3861 .. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
3862 .. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
3863 .. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
3864 .. [YAML] `YAML Ain’t Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
3865 .. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
3866 .. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
3867 .. [AMD-AMDGPU-Compute-Application-Binary-Interface] `AMDGPU Compute Application Binary Interface <https://github.com/RadeonOpenCompute/ROCm-ComputeABI-Doc/blob/master/AMDGPU-ABI.md>`__