1 =============================
2 User Guide for AMDGPU Backend
3 =============================
11 The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
12 R600 family up until the current GCN families. It lives in the
13 ``lib/Target/AMDGPU`` directory.
18 .. _amdgpu-target-triples:
23 Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
24 specify the target triple:
26 .. table:: AMDGPU Architectures
27 :name: amdgpu-architecture-table
29 ============ ==============================================================
30 Architecture Description
31 ============ ==============================================================
32 ``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
33 ``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
34 ============ ==============================================================
36 .. table:: AMDGPU Vendors
37 :name: amdgpu-vendor-table
39 ============ ==============================================================
41 ============ ==============================================================
42 ``amd`` Can be used for all AMD GPU usage.
43 ``mesa3d`` Can be used if the OS is ``mesa3d``.
44 ============ ==============================================================
46 .. table:: AMDGPU Operating Systems
47 :name: amdgpu-os-table
49 ============== ============================================================
51 ============== ============================================================
52 *<empty>* Defaults to the *unknown* OS.
53 ``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes
54 such as AMD's ROCm [AMD-ROCm]_.
55 ``amdpal`` Graphic shaders and compute kernels executed on AMD PAL
57 ``mesa3d`` Graphic shaders and compute kernels executed on Mesa 3D
59 ============== ============================================================
61 .. table:: AMDGPU Environments
62 :name: amdgpu-environment-table
64 ============ ==============================================================
65 Environment Description
66 ============ ==============================================================
67 *<empty>* Defaults to ``opencl``.
68 ``opencl`` OpenCL compute kernel (see :ref:`amdgpu-opencl`).
69 ``hcc`` AMD HC language compute kernel (see :ref:`amdgpu-hcc`).
70 ============ ==============================================================
72 .. _amdgpu-processors:
77 Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
78 names from both the *Processor* and *Alternative Processor* can be used.
80 .. table:: AMDGPU Processors
81 :name: amdgpu-processor-table
83 =========== =============== ============ ===== ========= ======= ==================
84 Processor Alternative Target dGPU/ Target ROCm Example
85 Processor Triple APU Features Support Products
86 Architecture Supported
88 =========== =============== ============ ===== ========= ======= ==================
89 **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
90 -----------------------------------------------------------------------------------
91 ``r600`` ``r600`` dGPU
92 ``r630`` ``r600`` dGPU
93 ``rs880`` ``r600`` dGPU
94 ``rv670`` ``r600`` dGPU
95 **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
96 -----------------------------------------------------------------------------------
97 ``rv710`` ``r600`` dGPU
98 ``rv730`` ``r600`` dGPU
99 ``rv770`` ``r600`` dGPU
100 **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
101 -----------------------------------------------------------------------------------
102 ``cedar`` ``r600`` dGPU
103 ``cypress`` ``r600`` dGPU
104 ``juniper`` ``r600`` dGPU
105 ``redwood`` ``r600`` dGPU
106 ``sumo`` ``r600`` dGPU
107 **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
108 -----------------------------------------------------------------------------------
109 ``barts`` ``r600`` dGPU
110 ``caicos`` ``r600`` dGPU
111 ``cayman`` ``r600`` dGPU
112 ``turks`` ``r600`` dGPU
113 **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
114 -----------------------------------------------------------------------------------
115 ``gfx600`` - ``tahiti`` ``amdgcn`` dGPU
116 ``gfx601`` - ``hainan`` ``amdgcn`` dGPU
120 **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
121 -----------------------------------------------------------------------------------
122 ``gfx700`` - ``kaveri`` ``amdgcn`` APU - A6-7000
132 ``gfx701`` - ``hawaii`` ``amdgcn`` dGPU ROCm - FirePro W8100
136 ``gfx702`` ``amdgcn`` dGPU ROCm - Radeon R9 290
140 ``gfx703`` - ``kabini`` ``amdgcn`` APU - E1-2100
141 - ``mullins`` - E1-2200
149 ``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Radeon HD 7790
153 **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
154 -----------------------------------------------------------------------------------
155 ``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - A6-8500P
161 \ ``amdgcn`` APU - xnack ROCm - A10-8700P
164 \ ``amdgcn`` APU - xnack - A10-9600P
170 \ ``amdgcn`` APU - xnack - E2-9010
173 ``gfx802`` - ``iceland`` ``amdgcn`` dGPU - xnack ROCm - FirePro S7150
174 - ``tonga`` [off] - FirePro S7100
181 ``gfx803`` - ``fiji`` ``amdgcn`` dGPU - xnack ROCm - Radeon R9 Nano
182 [off] - Radeon R9 Fury
186 - Radeon Instinct MI8
187 \ - ``polaris10`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 470
188 [off] - Radeon RX 480
189 - Radeon Instinct MI6
190 \ - ``polaris11`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 460
192 ``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack
194 **GCN GFX9** [AMD-GCN-GFX9]_
195 -----------------------------------------------------------------------------------
196 ``gfx900`` ``amdgcn`` dGPU - xnack ROCm - Radeon Vega
197 [off] Frontier Edition
202 - Radeon Instinct MI25
203 ``gfx902`` ``amdgcn`` APU - xnack *TBA*
208 =========== =============== ============ ===== ========= ======= ==================
210 .. _amdgpu-target-features:
215 Target features control how code is generated to support certain
216 processor specific features. Not all target features are supported by
217 all processors. The runtime must ensure that the features supported by
218 the device used to execute the code match the features enabled when
219 generating the code. A mismatch of features may result in incorrect
220 execution, or a reduction in performance.
222 The target features supported by each processor, and the default value
223 used if not specified explicitly, is listed in
224 :ref:`amdgpu-processor-table`.
226 Use the ``clang -m[no-]<TargetFeature>`` option to specify the AMD GPU
232 Enable the ``xnack`` feature.
234 Disable the ``xnack`` feature.
236 .. table:: AMDGPU Target Features
237 :name: amdgpu-target-feature-table
239 ============== ==================================================
240 Target Feature Description
241 ============== ==================================================
242 -m[no-]xnack Enable/disable generating code that has
243 memory clauses that are compatible with
244 having XNACK replay enabled.
246 This is used for demand paging and page
247 migration. If XNACK replay is enabled in
248 the device, then if a page fault occurs
249 the code may execute incorrectly if the
250 ``xnack`` feature is not enabled. Executing
251 code that has the feature enabled on a
252 device that does not have XNACK replay
253 enabled will execute correctly, but may
254 be less performant than code with the
256 ============== ==================================================
258 .. _amdgpu-address-spaces:
263 The AMDGPU backend uses the following address space mappings.
265 The memory space names used in the table, aside from the region memory space, is
266 from the OpenCL standard.
268 LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
270 .. table:: Address Space Mapping
271 :name: amdgpu-address-space-mapping-table
273 ================== =================
274 LLVM Address Space Memory Space
275 ================== =================
283 ================== =================
285 .. _amdgpu-memory-scopes:
290 This section provides LLVM memory synchronization scopes supported by the AMDGPU
291 backend memory model when the target triple OS is ``amdhsa`` (see
292 :ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
294 The memory model supported is based on the HSA memory model [HSA]_ which is
295 based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
296 relation is transitive over the synchonizes-with relation independent of scope,
297 and synchonizes-with allows the memory scope instances to be inclusive (see
298 table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
300 This is different to the OpenCL [OpenCL]_ memory model which does not have scope
301 inclusion and requires the memory scopes to exactly match. However, this
302 is conservatively correct for OpenCL.
304 .. table:: AMDHSA LLVM Sync Scopes
305 :name: amdgpu-amdhsa-llvm-sync-scopes-table
307 ================ ==========================================================
308 LLVM Sync Scope Description
309 ================ ==========================================================
310 *none* The default: ``system``.
312 Synchronizes with, and participates in modification and
313 seq_cst total orderings with, other operations (except
314 image operations) for all address spaces (except private,
315 or generic that accesses private) provided the other
316 operation's sync scope is:
319 - ``agent`` and executed by a thread on the same agent.
320 - ``workgroup`` and executed by a thread in the same
322 - ``wavefront`` and executed by a thread in the same
325 ``agent`` Synchronizes with, and participates in modification and
326 seq_cst total orderings with, other operations (except
327 image operations) for all address spaces (except private,
328 or generic that accesses private) provided the other
329 operation's sync scope is:
331 - ``system`` or ``agent`` and executed by a thread on the
333 - ``workgroup`` and executed by a thread in the same
335 - ``wavefront`` and executed by a thread in the same
338 ``workgroup`` Synchronizes with, and participates in modification and
339 seq_cst total orderings with, other operations (except
340 image operations) for all address spaces (except private,
341 or generic that accesses private) provided the other
342 operation's sync scope is:
344 - ``system``, ``agent`` or ``workgroup`` and executed by a
345 thread in the same workgroup.
346 - ``wavefront`` and executed by a thread in the same
349 ``wavefront`` Synchronizes with, and participates in modification and
350 seq_cst total orderings with, other operations (except
351 image operations) for all address spaces (except private,
352 or generic that accesses private) provided the other
353 operation's sync scope is:
355 - ``system``, ``agent``, ``workgroup`` or ``wavefront``
356 and executed by a thread in the same wavefront.
358 ``singlethread`` Only synchronizes with, and participates in modification
359 and seq_cst total orderings with, other operations (except
360 image operations) running in the same thread for all
361 address spaces (for example, in signal handlers).
362 ================ ==========================================================
367 The AMDGPU backend implements the following intrinsics.
369 *This section is WIP.*
372 List AMDGPU intrinsics
377 The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
378 can be linked by ``lld`` to produce a standard ELF shared code object which can
379 be loaded and executed on an AMDGPU target.
384 The AMDGPU backend uses the following ELF header:
386 .. table:: AMDGPU ELF Header
387 :name: amdgpu-elf-header-table
389 ========================== ===============================
391 ========================== ===============================
392 ``e_ident[EI_CLASS]`` ``ELFCLASS64``
393 ``e_ident[EI_DATA]`` ``ELFDATA2LSB``
394 ``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE``
395 - ``ELFOSABI_AMDGPU_HSA``
396 - ``ELFOSABI_AMDGPU_PAL``
397 - ``ELFOSABI_AMDGPU_MESA3D``
398 ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA``
399 - ``ELFABIVERSION_AMDGPU_PAL``
400 - ``ELFABIVERSION_AMDGPU_MESA3D``
401 ``e_type`` - ``ET_REL``
403 ``e_machine`` ``EM_AMDGPU``
405 ``e_flags`` See :ref:`amdgpu-elf-header-e_flags-table`
406 ========================== ===============================
410 .. table:: AMDGPU ELF Header Enumeration Values
411 :name: amdgpu-elf-header-enumeration-values-table
413 =============================== =====
415 =============================== =====
418 ``ELFOSABI_AMDGPU_HSA`` 64
419 ``ELFOSABI_AMDGPU_PAL`` 65
420 ``ELFOSABI_AMDGPU_MESA3D`` 66
421 ``ELFABIVERSION_AMDGPU_HSA`` 1
422 ``ELFABIVERSION_AMDGPU_PAL`` 0
423 ``ELFABIVERSION_AMDGPU_MESA3D`` 0
424 =============================== =====
426 ``e_ident[EI_CLASS]``
429 * ``ELFCLASS32`` for ``r600`` architecture.
431 * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64
435 All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
437 ``e_ident[EI_OSABI]``
438 One of the following AMD GPU architecture specific OS ABIs
439 (see :ref:`amdgpu-os-table`):
441 * ``ELFOSABI_NONE`` for *unknown* OS.
443 * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
445 * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
447 * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
449 ``e_ident[EI_ABIVERSION]``
450 The ABI version of the AMD GPU architecture specific OS ABI to which the code
453 * ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
456 * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
459 * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
463 Can be one of the following values:
467 The type produced by the AMD GPU backend compiler as it is relocatable code
471 The type produced by the linker as it is a shared code object.
473 The AMD HSA runtime loader requires a ``ET_DYN`` code object.
476 The value ``EM_AMDGPU`` is used for the machine for all processors supported
477 by the ``r600`` and ``amdgcn`` architectures (see
478 :ref:`amdgpu-processor-table`). The specific processor is specified in the
479 ``EF_AMDGPU_MACH`` bit field of the ``e_flags`` (see
480 :ref:`amdgpu-elf-header-e_flags-table`).
483 The entry point is 0 as the entry points for individual kernels must be
484 selected in order to invoke them through AQL packets.
487 The AMDGPU backend uses the following ELF header flags:
489 .. table:: AMDGPU ELF Header ``e_flags``
490 :name: amdgpu-elf-header-e_flags-table
492 ================================= ========== =============================
493 Name Value Description
494 ================================= ========== =============================
495 **AMDGPU Processor Flag** See :ref:`amdgpu-processor-table`.
496 -------------------------------------------- -----------------------------
497 ``EF_AMDGPU_MACH`` 0x000000ff AMDGPU processor selection
499 ``EF_AMDGPU_MACH_xxx`` values
501 :ref:`amdgpu-ef-amdgpu-mach-table`.
502 ``EF_AMDGPU_XNACK`` 0x00000100 Indicates if the ``xnack``
505 contained in the code object.
512 :ref:`amdgpu-target-features`.
513 ================================= ========== =============================
515 .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
516 :name: amdgpu-ef-amdgpu-mach-table
518 ================================= ========== =============================
519 Name Value Description (see
520 :ref:`amdgpu-processor-table`)
521 ================================= ========== =============================
522 ``EF_AMDGPU_MACH_NONE`` 0x000 *not specified*
523 ``EF_AMDGPU_MACH_R600_R600`` 0x001 ``r600``
524 ``EF_AMDGPU_MACH_R600_R630`` 0x002 ``r630``
525 ``EF_AMDGPU_MACH_R600_RS880`` 0x003 ``rs880``
526 ``EF_AMDGPU_MACH_R600_RV670`` 0x004 ``rv670``
527 ``EF_AMDGPU_MACH_R600_RV710`` 0x005 ``rv710``
528 ``EF_AMDGPU_MACH_R600_RV730`` 0x006 ``rv730``
529 ``EF_AMDGPU_MACH_R600_RV770`` 0x007 ``rv770``
530 ``EF_AMDGPU_MACH_R600_CEDAR`` 0x008 ``cedar``
531 ``EF_AMDGPU_MACH_R600_CYPRESS`` 0x009 ``cypress``
532 ``EF_AMDGPU_MACH_R600_JUNIPER`` 0x00a ``juniper``
533 ``EF_AMDGPU_MACH_R600_REDWOOD`` 0x00b ``redwood``
534 ``EF_AMDGPU_MACH_R600_SUMO`` 0x00c ``sumo``
535 ``EF_AMDGPU_MACH_R600_BARTS`` 0x00d ``barts``
536 ``EF_AMDGPU_MACH_R600_CAICOS`` 0x00e ``caicos``
537 ``EF_AMDGPU_MACH_R600_CAYMAN`` 0x00f ``cayman``
538 ``EF_AMDGPU_MACH_R600_TURKS`` 0x010 ``turks``
539 *reserved* 0x011 - Reserved for ``r600``
540 0x01f architecture processors.
541 ``EF_AMDGPU_MACH_AMDGCN_GFX600`` 0x020 ``gfx600``
542 ``EF_AMDGPU_MACH_AMDGCN_GFX601`` 0x021 ``gfx601``
543 ``EF_AMDGPU_MACH_AMDGCN_GFX700`` 0x022 ``gfx700``
544 ``EF_AMDGPU_MACH_AMDGCN_GFX701`` 0x023 ``gfx701``
545 ``EF_AMDGPU_MACH_AMDGCN_GFX702`` 0x024 ``gfx702``
546 ``EF_AMDGPU_MACH_AMDGCN_GFX703`` 0x025 ``gfx703``
547 ``EF_AMDGPU_MACH_AMDGCN_GFX704`` 0x026 ``gfx704``
548 *reserved* 0x027 Reserved.
549 ``EF_AMDGPU_MACH_AMDGCN_GFX801`` 0x028 ``gfx801``
550 ``EF_AMDGPU_MACH_AMDGCN_GFX802`` 0x029 ``gfx802``
551 ``EF_AMDGPU_MACH_AMDGCN_GFX803`` 0x02a ``gfx803``
552 ``EF_AMDGPU_MACH_AMDGCN_GFX810`` 0x02b ``gfx810``
553 ``EF_AMDGPU_MACH_AMDGCN_GFX900`` 0x02c ``gfx900``
554 ``EF_AMDGPU_MACH_AMDGCN_GFX902`` 0x02d ``gfx902``
555 *reserved* 0x02e Reserved.
556 *reserved* 0x02f Reserved.
557 *reserved* 0x030 Reserved.
558 ================================= ========== =============================
563 An AMDGPU target ELF code object has the standard ELF sections which include:
565 .. table:: AMDGPU ELF Sections
566 :name: amdgpu-elf-sections-table
568 ================== ================ =================================
570 ================== ================ =================================
571 ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
572 ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
573 ``.debug_``\ *\** ``SHT_PROGBITS`` *none*
574 ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
575 ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
576 ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
577 ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
578 ``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
579 ``.note`` ``SHT_NOTE`` *none*
580 ``.rela``\ *name* ``SHT_RELA`` *none*
581 ``.rela.dyn`` ``SHT_RELA`` *none*
582 ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
583 ``.shstrtab`` ``SHT_STRTAB`` *none*
584 ``.strtab`` ``SHT_STRTAB`` *none*
585 ``.symtab`` ``SHT_SYMTAB`` *none*
586 ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
587 ================== ================ =================================
589 These sections have their standard meanings (see [ELF]_) and are only generated
593 The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
594 DWARF produced by the AMDGPU backend.
596 ``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
597 The standard sections used by a dynamic loader.
600 See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
603 ``.rela``\ *name*, ``.rela.dyn``
604 For relocatable code objects, *name* is the name of the section that the
605 relocation records apply. For example, ``.rela.text`` is the section name for
606 relocation records associated with the ``.text`` section.
608 For linked shared code objects, ``.rela.dyn`` contains all the relocation
609 records from each of the relocatable code object's ``.rela``\ *name* sections.
611 See :ref:`amdgpu-relocation-records` for the relocation records supported by
615 The executable machine code for the kernels and functions they call. Generated
616 as position independent code. See :ref:`amdgpu-code-conventions` for
617 information on conventions used in the isa generation.
619 .. _amdgpu-note-records:
624 As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero byte padding must
625 be generated after the ``name`` field to ensure the ``desc`` field is 4 byte
626 aligned. In addition, minimal zero byte padding must be generated to ensure the
627 ``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` field of the
628 ``.note`` section must be at least 4 to indicate at least 8 byte alignment.
630 The AMDGPU backend code object uses the following ELF note records in the
631 ``.note`` section. The *Description* column specifies the layout of the note
632 record's ``desc`` field. All fields are consecutive bytes. Note records with
633 variable size strings have a corresponding ``*_size`` field that specifies the
634 number of bytes, including the terminating null character, in the string. The
635 string(s) come immediately after the preceding fields.
637 Additional note records can be present.
639 .. table:: AMDGPU ELF Note Records
640 :name: amdgpu-elf-note-records-table
642 ===== ============================== ======================================
643 Name Type Description
644 ===== ============================== ======================================
645 "AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
646 ===== ============================== ======================================
650 .. table:: AMDGPU ELF Note Record Enumeration Values
651 :name: amdgpu-elf-note-record-enumeration-values-table
653 ============================== =====
655 ============================== =====
657 ``NT_AMD_AMDGPU_HSA_METADATA`` 10
659 ============================== =====
661 ``NT_AMD_AMDGPU_HSA_METADATA``
662 Specifies extensible metadata associated with the code objects executed on HSA
663 [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
664 the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
665 :ref:`amdgpu-amdhsa-hsa-code-object-metadata` for the syntax of the code
666 object metadata string.
673 Symbols include the following:
675 .. table:: AMDGPU ELF Symbols
676 :name: amdgpu-elf-symbols-table
678 ===================== ============== ============= ==================
679 Name Type Section Description
680 ===================== ============== ============= ==================
681 *link-name* ``STT_OBJECT`` - ``.data`` Global variable
684 *link-name*\ ``@kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
685 *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point
686 ===================== ============== ============= ==================
689 Global variables both used and defined by the compilation unit.
691 If the symbol is defined in the compilation unit then it is allocated in the
692 appropriate section according to if it has initialized data or is readonly.
694 If the symbol is external then its section is ``STN_UNDEF`` and the loader
695 will resolve relocations using the definition provided by another code object
696 or explicitly defined by the runtime.
698 All global symbols, whether defined in the compilation unit or external, are
699 accessed by the machine code indirectly through a GOT table entry. This
700 allows them to be preemptable. The GOT table is only supported when the target
701 triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
704 Add description of linked shared object symbols. Seems undefined symbols
705 are marked as STT_NOTYPE.
708 Every HSA kernel has an associated kernel descriptor. It is the address of the
709 kernel descriptor that is used in the AQL dispatch packet used to invoke the
710 kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
711 defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
714 Every HSA kernel also has a symbol for its machine code entry point.
716 .. _amdgpu-relocation-records:
721 AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
722 relocatable fields are:
725 This specifies a 32-bit field occupying 4 bytes with arbitrary byte
726 alignment. These values use the same byte order as other word values in the
727 AMD GPU architecture.
730 This specifies a 64-bit field occupying 8 bytes with arbitrary byte
731 alignment. These values use the same byte order as other word values in the
732 AMD GPU architecture.
734 Following notations are used for specifying relocation calculations:
737 Represents the addend used to compute the value of the relocatable field.
740 Represents the offset into the global offset table at which the relocation
741 entry's symbol will reside during execution.
744 Represents the address of the global offset table.
747 Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
748 of the storage unit being relocated (computed using ``r_offset``).
751 Represents the value of the symbol whose index resides in the relocation
752 entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``.
755 Represents the base address of a loaded executable or shared object which is
756 the difference between the ELF address and the actual load address. Relocations
757 using this are only valid in executable or shared objects.
759 The following relocation types are supported:
761 .. table:: AMDGPU ELF Relocation Records
762 :name: amdgpu-elf-relocation-records-table
764 ========================== ======= ===== ========== ==============================
765 Relocation Type Kind Value Field Calculation
766 ========================== ======= ===== ========== ==============================
767 ``R_AMDGPU_NONE`` 0 *none* *none*
768 ``R_AMDGPU_ABS32_LO`` Dynamic 1 ``word32`` (S + A) & 0xFFFFFFFF
769 ``R_AMDGPU_ABS32_HI`` Dynamic 2 ``word32`` (S + A) >> 32
770 ``R_AMDGPU_ABS64`` Dynamic 3 ``word64`` S + A
771 ``R_AMDGPU_REL32`` Static 4 ``word32`` S + A - P
772 ``R_AMDGPU_REL64`` Static 5 ``word64`` S + A - P
773 ``R_AMDGPU_ABS32`` Static 6 ``word32`` S + A
774 ``R_AMDGPU_GOTPCREL`` Static 7 ``word32`` G + GOT + A - P
775 ``R_AMDGPU_GOTPCREL32_LO`` Static 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
776 ``R_AMDGPU_GOTPCREL32_HI`` Static 9 ``word32`` (G + GOT + A - P) >> 32
777 ``R_AMDGPU_REL32_LO`` Static 10 ``word32`` (S + A - P) & 0xFFFFFFFF
778 ``R_AMDGPU_REL32_HI`` Static 11 ``word32`` (S + A - P) >> 32
780 ``R_AMDGPU_RELATIVE64`` Dynamic 13 ``word64`` B + A
781 ========================== ======= ===== ========== ==============================
788 Standard DWARF [DWARF]_ Version 5 sections can be generated. These contain
789 information that maps the code object executable code and data to the source
790 language constructs. It can be used by tools such as debuggers and profilers.
792 Address Space Mapping
793 ~~~~~~~~~~~~~~~~~~~~~
795 The following address space mapping is used:
797 .. table:: AMDGPU DWARF Address Space Mapping
798 :name: amdgpu-dwarf-address-space-mapping-table
800 =================== =================
801 DWARF Address Space Memory Space
802 =================== =================
807 *omitted* Generic (Flat)
808 *not supported* Region (GDS)
809 =================== =================
811 See :ref:`amdgpu-address-spaces` for information on the memory space terminology
814 An ``address_class`` attribute is generated on pointer type DIEs to specify the
815 DWARF address space of the value of the pointer when it is in the *private* or
816 *local* address space. Otherwise the attribute is omitted.
818 An ``XDEREF`` operation is generated in location list expressions for variables
819 that are allocated in the *private* and *local* address space. Otherwise no
820 ``XDREF`` is omitted.
825 *This section is WIP.*
828 Define DWARF register enumeration.
830 If want to present a wavefront state then should expose vector registers as
831 64 wide (rather than per work-item view that LLVM uses). Either as separate
832 registers, or a 64x4 byte single register. In either case use a new LANE op
833 (akin to XDREF) to select the current lane usage in a location
834 expression. This would also allow scalar register spilling to vector register
835 lanes to be expressed (currently no debug information is being generated for
836 spilling). If choose a wide single register approach then use LANE in
837 conjunction with PIECE operation to select the dword part of the register for
838 the current lane. If the separate register approach then use LANE to select
844 Source text for online-compiled programs (e.g. those compiled by the OpenCL
845 runtime) may be embedded into the DWARF v5 line table using the ``clang
846 -gembed-source`` option, described in table :ref:`amdgpu-debug-options`.
851 Enable the embedded source DWARF v5 extension.
852 ``-gno-embed-source``
853 Disable the embedded source DWARF v5 extension.
855 .. table:: AMDGPU Debug Options
856 :name: amdgpu-debug-options
858 ==================== ==================================================
859 Debug Flag Description
860 ==================== ==================================================
861 -g[no-]embed-source Enable/disable embedding source text in DWARF
862 debug sections. Useful for environments where
863 source cannot be written to disk, such as
864 when performing online compilation.
865 ==================== ==================================================
867 This option enables one extended content types in the DWARF v5 Line Number
868 Program Header, which is used to encode embedded source.
870 .. table:: AMDGPU DWARF Line Number Program Header Extended Content Types
871 :name: amdgpu-dwarf-extended-content-types
873 ============================ ======================
875 ============================ ======================
876 ``DW_LNCT_LLVM_source`` ``DW_FORM_line_strp``
877 ============================ ======================
879 The source field will contain the UTF-8 encoded, null-terminated source text
880 with ``'\n'`` line endings. When the source field is present, consumers can use
881 the embedded source instead of attempting to discover the source on disk. When
882 the source field is absent, consumers can access the file to get the source
885 The above content type appears in the ``file_name_entry_format`` field of the
886 line table prologue, and its corresponding value appear in the ``file_names``
887 field. The current encoding of the content type is documented in table
888 :ref:`amdgpu-dwarf-extended-content-types-encoding`
890 .. table:: AMDGPU DWARF Line Number Program Header Extended Content Types Encoding
891 :name: amdgpu-dwarf-extended-content-types-encoding
893 ============================ ====================
895 ============================ ====================
896 ``DW_LNCT_LLVM_source`` 0x2001
897 ============================ ====================
899 .. _amdgpu-code-conventions:
904 This section provides code conventions used for each supported target triple OS
905 (see :ref:`amdgpu-target-triples`).
910 This section provides code conventions used when the target triple OS is
911 ``amdhsa`` (see :ref:`amdgpu-target-triples`).
913 .. _amdgpu-amdhsa-hsa-code-object-metadata:
918 The code object metadata specifies extensible metadata associated with the code
919 objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
920 [AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record
921 (see :ref:`amdgpu-note-records`) and is required when the target triple OS is
922 ``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
923 information necessary to support the ROCM kernel queries. For example, the
924 segment sizes needed in a dispatch packet. In addition, a high level language
925 runtime may require other information to be included. For example, the AMD
926 OpenCL runtime records kernel argument information.
928 The metadata is specified as a YAML formatted string (see [YAML]_ and
932 Is the string null terminated? It probably should not if YAML allows it to
933 contain null characters, otherwise it should be.
935 The metadata is represented as a single YAML document comprised of the mapping
936 defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
939 For boolean values, the string values of ``false`` and ``true`` are used for
940 false and true respectively.
942 Additional information can be added to the mappings. To avoid conflicts, any
943 non-AMD key names should be prefixed by "*vendor-name*.".
945 .. table:: AMDHSA Code Object Metadata Mapping
946 :name: amdgpu-amdhsa-code-object-metadata-mapping-table
948 ========== ============== ========= =======================================
949 String Key Value Type Required? Description
950 ========== ============== ========= =======================================
951 "Version" sequence of Required - The first integer is the major
952 2 integers version. Currently 1.
953 - The second integer is the minor
954 version. Currently 0.
955 "Printf" sequence of Each string is encoded information
956 strings about a printf function call. The
957 encoded information is organized as
958 fields separated by colon (':'):
960 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
965 A 32 bit integer as a unique id for
966 each printf function call
969 A 32 bit integer equal to the number
970 of arguments of printf function call
973 ``S[i]`` (where i = 0, 1, ... , N-1)
974 32 bit integers for the size in bytes
975 of the i-th FormatString argument of
976 the printf function call
979 The format string passed to the
980 printf function call.
981 "Kernels" sequence of Required Sequence of the mappings for each
982 mapping kernel in the code object. See
983 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
984 for the definition of the mapping.
985 ========== ============== ========= =======================================
989 .. table:: AMDHSA Code Object Kernel Metadata Mapping
990 :name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
992 ================= ============== ========= ================================
993 String Key Value Type Required? Description
994 ================= ============== ========= ================================
995 "Name" string Required Source name of the kernel.
996 "SymbolName" string Required Name of the kernel
997 descriptor ELF symbol.
998 "Language" string Source language of the kernel.
1006 "LanguageVersion" sequence of - The first integer is the major
1008 - The second integer is the
1010 "Attrs" mapping Mapping of kernel attributes.
1012 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
1013 for the mapping definition.
1014 "Args" sequence of Sequence of mappings of the
1015 mapping kernel arguments. See
1016 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
1017 for the definition of the mapping.
1018 "CodeProps" mapping Mapping of properties related to
1019 the kernel code. See
1020 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
1021 for the mapping definition.
1022 ================= ============== ========= ================================
1026 .. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
1027 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
1029 =================== ============== ========= ==============================
1030 String Key Value Type Required? Description
1031 =================== ============== ========= ==============================
1032 "ReqdWorkGroupSize" sequence of If not 0, 0, 0 then all values
1033 3 integers must be >=1 and the dispatch
1034 work-group size X, Y, Z must
1035 correspond to the specified
1036 values. Defaults to 0, 0, 0.
1038 Corresponds to the OpenCL
1039 ``reqd_work_group_size``
1041 "WorkGroupSizeHint" sequence of The dispatch work-group size
1042 3 integers X, Y, Z is likely to be the
1045 Corresponds to the OpenCL
1046 ``work_group_size_hint``
1048 "VecTypeHint" string The name of a scalar or vector
1051 Corresponds to the OpenCL
1052 ``vec_type_hint`` attribute.
1054 "RuntimeHandle" string The external symbol name
1055 associated with a kernel.
1056 OpenCL runtime allocates a
1057 global buffer for the symbol
1058 and saves the kernel's address
1059 to it, which is used for
1060 device side enqueueing. Only
1061 available for device side
1063 =================== ============== ========= ==============================
1067 .. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
1068 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
1070 ================= ============== ========= ================================
1071 String Key Value Type Required? Description
1072 ================= ============== ========= ================================
1073 "Name" string Kernel argument name.
1074 "TypeName" string Kernel argument type name.
1075 "Size" integer Required Kernel argument size in bytes.
1076 "Align" integer Required Kernel argument alignment in
1077 bytes. Must be a power of two.
1078 "ValueKind" string Required Kernel argument kind that
1079 specifies how to set up the
1080 corresponding argument.
1084 The argument is copied
1085 directly into the kernarg.
1088 A global address space pointer
1089 to the buffer data is passed
1092 "DynamicSharedPointer"
1093 A group address space pointer
1094 to dynamically allocated LDS
1095 is passed in the kernarg.
1098 A global address space
1099 pointer to a S# is passed in
1103 A global address space
1104 pointer to a T# is passed in
1108 A global address space pointer
1109 to an OpenCL pipe is passed in
1113 A global address space pointer
1114 to an OpenCL device enqueue
1115 queue is passed in the
1118 "HiddenGlobalOffsetX"
1119 The OpenCL grid dispatch
1120 global offset for the X
1121 dimension is passed in the
1124 "HiddenGlobalOffsetY"
1125 The OpenCL grid dispatch
1126 global offset for the Y
1127 dimension is passed in the
1130 "HiddenGlobalOffsetZ"
1131 The OpenCL grid dispatch
1132 global offset for the Z
1133 dimension is passed in the
1137 An argument that is not used
1138 by the kernel. Space needs to
1139 be left for it, but it does
1140 not need to be set up.
1142 "HiddenPrintfBuffer"
1143 A global address space pointer
1144 to the runtime printf buffer
1145 is passed in kernarg.
1147 "HiddenDefaultQueue"
1148 A global address space pointer
1149 to the OpenCL device enqueue
1150 queue that should be used by
1151 the kernel by default is
1152 passed in the kernarg.
1154 "HiddenCompletionAction"
1155 A global address space pointer
1156 to help link enqueued kernels into
1157 the ancestor tree for determining
1158 when the parent kernel has finished.
1160 "ValueType" string Required Kernel argument value type. Only
1161 present if "ValueKind" is
1162 "ByValue". For vector data
1163 types, the value is for the
1164 element type. Values include:
1180 How can it be determined if a
1181 vector type, and what size
1183 "PointeeAlign" integer Alignment in bytes of pointee
1184 type for pointer type kernel
1185 argument. Must be a power
1186 of 2. Only present if
1188 "DynamicSharedPointer".
1189 "AddrSpaceQual" string Kernel argument address space
1190 qualifier. Only present if
1191 "ValueKind" is "GlobalBuffer" or
1192 "DynamicSharedPointer". Values
1203 Is GlobalBuffer only Global
1205 DynamicSharedPointer always
1206 Local? Can HCC allow Generic?
1207 How can Private or Region
1209 "AccQual" string Kernel argument access
1210 qualifier. Only present if
1211 "ValueKind" is "Image" or
1222 "ActualAccQual" string The actual memory accesses
1223 performed by the kernel on the
1224 kernel argument. Only present if
1225 "ValueKind" is "GlobalBuffer",
1226 "Image", or "Pipe". This may be
1227 more restrictive than indicated
1228 by "AccQual" to reflect what the
1229 kernel actual does. If not
1230 present then the runtime must
1231 assume what is implied by
1232 "AccQual" and "IsConst". Values
1239 "IsConst" boolean Indicates if the kernel argument
1240 is const qualified. Only present
1244 "IsRestrict" boolean Indicates if the kernel argument
1245 is restrict qualified. Only
1246 present if "ValueKind" is
1249 "IsVolatile" boolean Indicates if the kernel argument
1250 is volatile qualified. Only
1251 present if "ValueKind" is
1254 "IsPipe" boolean Indicates if the kernel argument
1255 is pipe qualified. Only present
1256 if "ValueKind" is "Pipe".
1259 Can GlobalBuffer be pipe
1261 ================= ============== ========= ================================
1265 .. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
1266 :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
1268 ============================ ============== ========= =====================
1269 String Key Value Type Required? Description
1270 ============================ ============== ========= =====================
1271 "KernargSegmentSize" integer Required The size in bytes of
1273 that holds the values
1276 "GroupSegmentFixedSize" integer Required The amount of group
1280 bytes. This does not
1282 dynamically allocated
1283 group segment memory
1287 "PrivateSegmentFixedSize" integer Required The amount of fixed
1288 private address space
1289 memory required for a
1291 bytes. If the kernel
1293 stack then additional
1295 to this value for the
1297 "KernargSegmentAlign" integer Required The maximum byte
1300 kernarg segment. Must
1302 "WavefrontSize" integer Required Wavefront size. Must
1304 "NumSGPRs" integer Required Number of scalar
1308 includes the special
1314 SGPR added if a trap
1320 "NumVGPRs" integer Required Number of vector
1324 "MaxFlatWorkGroupSize" integer Required Maximum flat
1327 kernel in work-items.
1330 ReqdWorkGroupSize if
1332 "NumSpilledSGPRs" integer Number of stores from
1333 a scalar register to
1334 a register allocator
1337 "NumSpilledVGPRs" integer Number of stores from
1338 a vector register to
1339 a register allocator
1342 ============================ ============== ========= =====================
1349 The HSA architected queuing language (AQL) defines a user space memory interface
1350 that can be used to control the dispatch of kernels, in an agent independent
1351 way. An agent can have zero or more AQL queues created for it using the ROCm
1352 runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
1353 *HSA Platform System Architecture Specification* [HSA]_ for the AQL queue
1354 mechanics and packet layouts.
1356 The packet processor of a kernel agent is responsible for detecting and
1357 dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
1358 packet processor is implemented by the hardware command processor (CP),
1359 asynchronous dispatch controller (ADC) and shader processor input controller
1362 The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
1363 mode driver to initialize and register the AQL queue with CP.
1365 To dispatch a kernel the following actions are performed. This can occur in the
1366 CPU host program, or from an HSA kernel executing on a GPU.
1368 1. A pointer to an AQL queue for the kernel agent on which the kernel is to be
1369 executed is obtained.
1370 2. A pointer to the kernel descriptor (see
1371 :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
1372 obtained. It must be for a kernel that is contained in a code object that that
1373 was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
1375 3. Space is allocated for the kernel arguments using the ROCm runtime allocator
1376 for a memory region with the kernarg property for the kernel agent that will
1377 execute the kernel. It must be at least 16 byte aligned.
1378 4. Kernel argument values are assigned to the kernel argument memory
1379 allocation. The layout is defined in the *HSA Programmer's Language Reference*
1380 [HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
1381 memory in the same way constant memory is accessed. (Note that the HSA
1382 specification allows an implementation to copy the kernel argument contents to
1383 another location that is accessed by the kernel.)
1384 5. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
1385 api uses 64 bit atomic operations to reserve space in the AQL queue for the
1386 packet. The packet must be set up, and the final write must use an atomic
1387 store release to set the packet kind to ensure the packet contents are
1388 visible to the kernel agent. AQL defines a doorbell signal mechanism to
1389 notify the kernel agent that the AQL queue has been updated. These rules, and
1390 the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
1391 System Architecture Specification* [HSA]_.
1392 6. A kernel dispatch packet includes information about the actual dispatch,
1393 such as grid and work-group size, together with information from the code
1394 object about the kernel, such as segment sizes. The ROCm runtime queries on
1395 the kernel symbol can be used to obtain the code object values which are
1396 recorded in the :ref:`amdgpu-amdhsa-hsa-code-object-metadata`.
1397 7. CP executes micro-code and is responsible for detecting and setting up the
1398 GPU to execute the wavefronts of a kernel dispatch.
1399 8. CP ensures that when the a wavefront starts executing the kernel machine
1400 code, the scalar general purpose registers (SGPR) and vector general purpose
1401 registers (VGPR) are set up as required by the machine code. The required
1402 setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
1403 register state is defined in
1404 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
1405 9. The prolog of the kernel machine code (see
1406 :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
1407 before continuing executing the machine code that corresponds to the kernel.
1408 10. When the kernel dispatch has completed execution, CP signals the completion
1409 signal specified in the kernel dispatch packet if not 0.
1411 .. _amdgpu-amdhsa-memory-spaces:
1416 The memory space properties are:
1418 .. table:: AMDHSA Memory Spaces
1419 :name: amdgpu-amdhsa-memory-spaces-table
1421 ================= =========== ======== ======= ==================
1422 Memory Space Name HSA Segment Hardware Address NULL Value
1424 ================= =========== ======== ======= ==================
1425 Private private scratch 32 0x00000000
1426 Local group LDS 32 0xFFFFFFFF
1427 Global global global 64 0x0000000000000000
1428 Constant constant *same as 64 0x0000000000000000
1430 Generic flat flat 64 0x0000000000000000
1431 Region N/A GDS 32 *not implemented
1433 ================= =========== ======== ======= ==================
1435 The global and constant memory spaces both use global virtual addresses, which
1436 are the same virtual address space used by the CPU. However, some virtual
1437 addresses may only be accessible to the CPU, some only accessible by the GPU,
1440 Using the constant memory space indicates that the data will not change during
1441 the execution of the kernel. This allows scalar read instructions to be
1442 used. The vector and scalar L1 caches are invalidated of volatile data before
1443 each kernel dispatch execution to allow constant memory to change values between
1446 The local memory space uses the hardware Local Data Store (LDS) which is
1447 automatically allocated when the hardware creates work-groups of wavefronts, and
1448 freed when all the wavefronts of a work-group have terminated. The data store
1449 (DS) instructions can be used to access it.
1451 The private memory space uses the hardware scratch memory support. If the kernel
1452 uses scratch, then the hardware allocates memory that is accessed using
1453 wavefront lane dword (4 byte) interleaving. The mapping used from private
1454 address to physical address is:
1456 ``wavefront-scratch-base +
1457 (private-address * wavefront-size * 4) +
1458 (wavefront-lane-id * 4)``
1460 There are different ways that the wavefront scratch base address is determined
1461 by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
1462 memory can be accessed in an interleaved manner using buffer instruction with
1463 the scratch buffer descriptor and per wavefront scratch offset, by the scratch
1464 instructions, or by flat instructions. If each lane of a wavefront accesses the
1465 same private address, the interleaving results in adjacent dwords being accessed
1466 and hence requires fewer cache lines to be fetched. Multi-dword access is not
1467 supported except by flat and scratch instructions in GFX9.
1469 The generic address space uses the hardware flat address support available in
1470 GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
1471 local appertures), that are outside the range of addressible global memory, to
1472 map from a flat address to a private or local address.
1474 FLAT instructions can take a flat address and access global, private (scratch)
1475 and group (LDS) memory depending in if the address is within one of the
1476 apperture ranges. Flat access to scratch requires hardware aperture setup and
1477 setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
1478 access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
1479 (see :ref:`amdgpu-amdhsa-m0`).
1481 To convert between a segment address and a flat address the base address of the
1482 appertures address can be used. For GFX7-GFX8 these are available in the
1483 :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
1484 Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
1485 GFX9 the appature base addresses are directly available as inline constant
1486 registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
1487 address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
1488 which makes it easier to convert from flat to segment or segment to flat.
1493 Image and sample handles created by the ROCm runtime are 64 bit addresses of a
1494 hardware 32 byte V# and 48 byte S# object respectively. In order to support the
1495 HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
1496 enumeration values for the queries that are not trivially deducible from the S#
1502 HSA signal handles created by the ROCm runtime are 64 bit addresses of a
1503 structure allocated in memory accessible from both the CPU and GPU. The
1504 structure is defined by the ROCm runtime and subject to change between releases
1505 (see [AMD-ROCm-github]_).
1507 .. _amdgpu-amdhsa-hsa-aql-queue:
1512 The HSA AQL queue structure is defined by the ROCm runtime and subject to change
1513 between releases (see [AMD-ROCm-github]_). For some processors it contains
1514 fields needed to implement certain language features such as the flat address
1515 aperture bases. It also contains fields used by CP such as managing the
1516 allocation of scratch memory.
1518 .. _amdgpu-amdhsa-kernel-descriptor:
1523 A kernel descriptor consists of the information needed by CP to initiate the
1524 execution of a kernel, including the entry point address of the machine code
1525 that implements the kernel.
1527 Kernel Descriptor for GFX6-GFX9
1528 +++++++++++++++++++++++++++++++
1530 CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
1532 .. table:: Kernel Descriptor for GFX6-GFX9
1533 :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
1535 ======= ======= =============================== ============================
1536 Bits Size Field Name Description
1537 ======= ======= =============================== ============================
1538 31:0 4 bytes GroupSegmentFixedSize The amount of fixed local
1539 address space memory
1540 required for a work-group
1541 in bytes. This does not
1542 include any dynamically
1543 allocated local address
1544 space memory that may be
1545 added when the kernel is
1547 63:32 4 bytes PrivateSegmentFixedSize The amount of fixed
1548 private address space
1549 memory required for a
1550 work-item in bytes. If
1551 is_dynamic_callstack is 1
1552 then additional space must
1553 be added to this value for
1555 127:64 8 bytes Reserved, must be 0.
1556 191:128 8 bytes KernelCodeEntryByteOffset Byte offset (possibly
1559 descriptor to kernel's
1560 entry point instruction
1561 which must be 256 byte
1563 383:192 24 Reserved, must be 0.
1565 415:384 4 bytes ComputePgmRsrc1 Compute Shader (CS)
1566 program settings used by
1568 ``COMPUTE_PGM_RSRC1``
1571 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
1572 447:416 4 bytes ComputePgmRsrc2 Compute Shader (CS)
1573 program settings used by
1575 ``COMPUTE_PGM_RSRC2``
1578 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
1579 448 1 bit EnableSGPRPrivateSegmentBuffer Enable the setup of the
1580 SGPR user data registers
1582 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1584 The total number of SGPR
1586 requested must not exceed
1587 16 and match value in
1588 ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
1589 Any requests beyond 16
1591 449 1 bit EnableSGPRDispatchPtr *see above*
1592 450 1 bit EnableSGPRQueuePtr *see above*
1593 451 1 bit EnableSGPRKernargSegmentPtr *see above*
1594 452 1 bit EnableSGPRDispatchID *see above*
1595 453 1 bit EnableSGPRFlatScratchInit *see above*
1596 454 1 bit EnableSGPRPrivateSegmentSize *see above*
1597 455 1 bit EnableSGPRGridWorkgroupCountX Not implemented in CP and
1599 456 1 bit EnableSGPRGridWorkgroupCountY Not implemented in CP and
1601 457 1 bit EnableSGPRGridWorkgroupCountZ Not implemented in CP and
1603 463:458 6 bits Reserved, must be 0.
1604 511:464 6 Reserved, must be 0.
1606 512 **Total size 64 bytes.**
1607 ======= ====================================================================
1611 .. table:: compute_pgm_rsrc1 for GFX6-GFX9
1612 :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table
1614 ======= ======= =============================== ===========================================================================
1615 Bits Size Field Name Description
1616 ======= ======= =============================== ===========================================================================
1617 5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector registers
1618 used by each work-item,
1619 granularity is device
1624 - roundup((max_vgpg + 1)
1627 Used by CP to set up
1628 ``COMPUTE_PGM_RSRC1.VGPRS``.
1629 9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar registers
1630 used by a wavefront,
1631 granularity is device
1636 - roundup((max_sgpg + 1)
1640 - roundup((max_sgpg + 1)
1643 Includes the special SGPRs
1644 for VCC, Flat Scratch (for
1645 GFX7 onwards) and XNACK
1646 (for GFX8 onwards). It does
1647 not include the 16 SGPR
1648 added if a trap handler is
1651 Used by CP to set up
1652 ``COMPUTE_PGM_RSRC1.SGPRS``.
1653 11:10 2 bits PRIORITY Must be 0.
1655 Start executing wavefront
1656 at the specified priority.
1658 CP is responsible for
1660 ``COMPUTE_PGM_RSRC1.PRIORITY``.
1661 13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution
1662 with specified rounding
1665 precision floating point
1668 Floating point rounding
1669 mode values are defined in
1670 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1672 Used by CP to set up
1673 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1674 15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution
1675 with specified rounding
1676 denorm mode for half/double (16
1677 and 64 bit) floating point
1678 precision floating point
1681 Floating point rounding
1682 mode values are defined in
1683 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1685 Used by CP to set up
1686 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1687 17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution
1688 with specified denorm mode
1691 precision floating point
1694 Floating point denorm mode
1695 values are defined in
1696 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1698 Used by CP to set up
1699 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1700 19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution
1701 with specified denorm mode
1703 and 64 bit) floating point
1704 precision floating point
1707 Floating point denorm mode
1708 values are defined in
1709 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1711 Used by CP to set up
1712 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1713 20 1 bit PRIV Must be 0.
1715 Start executing wavefront
1716 in privilege trap handler
1719 CP is responsible for
1721 ``COMPUTE_PGM_RSRC1.PRIV``.
1722 21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution
1723 with DX10 clamp mode
1724 enabled. Used by the vector
1725 ALU to force DX10 style
1726 treatment of NaN's (when
1727 set, clamp NaN to zero,
1731 Used by CP to set up
1732 ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
1733 22 1 bit DEBUG_MODE Must be 0.
1735 Start executing wavefront
1736 in single step mode.
1738 CP is responsible for
1740 ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
1741 23 1 bit ENABLE_IEEE_MODE Wavefront starts execution
1743 enabled. Floating point
1744 opcodes that support
1745 exception flag gathering
1746 will quiet and propagate
1747 signaling-NaN inputs per
1748 IEEE 754-2008. Min_dx10 and
1749 max_dx10 become IEEE
1750 754-2008 compliant due to
1751 signaling-NaN propagation
1754 Used by CP to set up
1755 ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
1756 24 1 bit BULKY Must be 0.
1758 Only one work-group allowed
1759 to execute on a compute
1762 CP is responsible for
1764 ``COMPUTE_PGM_RSRC1.BULKY``.
1765 25 1 bit CDBG_USER Must be 0.
1767 Flag that can be used to
1768 control debugging code.
1770 CP is responsible for
1772 ``COMPUTE_PGM_RSRC1.CDBG_USER``.
1773 26 1 bit FP16_OVFL GFX6-GFX8
1774 Reserved, must be 0.
1776 Wavefront starts execution
1777 with specified fp16 overflow
1780 - If 0, fp16 overflow generates
1782 - If 1, fp16 overflow that is the
1783 result of an +/-INF input value
1784 or divide by 0 produces a +/-INF,
1785 otherwise clamps computed
1786 overflow to +/-MAX_FP16 as
1789 Used by CP to set up
1790 ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
1791 31:27 5 bits Reserved, must be 0.
1792 32 **Total size 4 bytes**
1793 ======= ===================================================================================================================
1797 .. table:: compute_pgm_rsrc2 for GFX6-GFX9
1798 :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
1800 ======= ======= =============================== ===========================================================================
1801 Bits Size Field Name Description
1802 ======= ======= =============================== ===========================================================================
1803 0 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the
1804 _WAVEFRONT_OFFSET SGPR wavefront scratch offset
1805 system register (see
1806 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1808 Used by CP to set up
1809 ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
1810 5:1 5 bits USER_SGPR_COUNT The total number of SGPR
1812 requested. This number must
1813 match the number of user
1814 data registers enabled.
1816 Used by CP to set up
1817 ``COMPUTE_PGM_RSRC2.USER_SGPR``.
1818 6 1 bit ENABLE_TRAP_HANDLER Set to 1 if code contains a
1819 TRAP instruction which
1820 requires a trap handler to
1824 ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``
1826 installed a trap handler
1827 regardless of the setting
1829 7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the
1830 system SGPR register for
1831 the work-group id in the X
1833 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1835 Used by CP to set up
1836 ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
1837 8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the
1838 system SGPR register for
1839 the work-group id in the Y
1841 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1843 Used by CP to set up
1844 ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
1845 9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the
1846 system SGPR register for
1847 the work-group id in the Z
1849 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1851 Used by CP to set up
1852 ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
1853 10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the
1854 system SGPR register for
1855 work-group information (see
1856 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1858 Used by CP to set up
1859 ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
1860 12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the
1861 VGPR system registers used
1862 for the work-item ID.
1863 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
1866 Used by CP to set up
1867 ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
1868 13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0.
1870 Wavefront starts execution
1872 exceptions enabled which
1873 are generated when L1 has
1874 witnessed a thread access
1878 CP is responsible for
1879 filling in the address
1881 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1882 according to what the
1884 14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0.
1886 Wavefront starts execution
1887 with memory violation
1888 exceptions exceptions
1889 enabled which are generated
1890 when a memory violation has
1891 occurred for this wavefront from
1893 (write-to-read-only-memory,
1894 mis-aligned atomic, LDS
1895 address out of range,
1896 illegal address, etc.).
1900 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1901 according to what the
1903 23:15 9 bits GRANULATED_LDS_SIZE Must be 0.
1905 CP uses the rounded value
1906 from the dispatch packet,
1907 not this value, as the
1908 dispatch may contain
1909 dynamically allocated group
1910 segment memory. CP writes
1912 ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
1914 Amount of group segment
1915 (LDS) to allocate for each
1916 work-group. Granularity is
1920 roundup(lds-size / (64 * 4))
1922 roundup(lds-size / (128 * 4))
1924 24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution
1925 _INVALID_OPERATION with specified exceptions
1928 Used by CP to set up
1929 ``COMPUTE_PGM_RSRC2.EXCP_EN``
1930 (set from bits 0..6).
1934 25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more
1935 _SOURCE input operands is a
1937 26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by
1938 _DIVISION_BY_ZERO Zero
1939 27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow
1941 28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow
1943 29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact
1945 30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero
1946 _ZERO (rcp_iflag_f32 instruction
1948 31 1 bit Reserved, must be 0.
1949 32 **Total size 4 bytes.**
1950 ======= ===================================================================================================================
1954 .. table:: Floating Point Rounding Mode Enumeration Values
1955 :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
1957 ====================================== ===== ==============================
1958 Enumeration Name Value Description
1959 ====================================== ===== ==============================
1960 AMDGPU_FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
1961 AMDGPU_FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
1962 AMDGPU_FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
1963 AMDGPU_FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
1964 ====================================== ===== ==============================
1968 .. table:: Floating Point Denorm Mode Enumeration Values
1969 :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
1971 ====================================== ===== ==============================
1972 Enumeration Name Value Description
1973 ====================================== ===== ==============================
1974 AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination
1976 AMDGPU_FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
1977 AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
1978 AMDGPU_FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
1979 ====================================== ===== ==============================
1983 .. table:: System VGPR Work-Item ID Enumeration Values
1984 :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
1986 ======================================== ===== ============================
1987 Enumeration Name Value Description
1988 ======================================== ===== ============================
1989 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension
1991 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
1993 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
1995 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
1996 ======================================== ===== ============================
1998 .. _amdgpu-amdhsa-initial-kernel-execution-state:
2000 Initial Kernel Execution State
2001 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2003 This section defines the register state that will be set up by the packet
2004 processor prior to the start of execution of every wavefront. This is limited by
2005 the constraints of the hardware controllers of CP/ADC/SPI.
2007 The order of the SGPR registers is defined, but the compiler can specify which
2008 ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
2009 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2010 for enabled registers are dense starting at SGPR0: the first enabled register is
2011 SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
2014 The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
2015 all wavefronts of the grid. It is possible to specify more than 16 User SGPRs using
2016 the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
2017 initialized. These are then immediately followed by the System SGPRs that are
2018 set up by ADC/SPI and can have different values for each wavefront of the grid
2021 SGPR register initial state is defined in
2022 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
2024 .. table:: SGPR Register Set Up Order
2025 :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
2027 ========== ========================== ====== ==============================
2028 SGPR Order Name Number Description
2029 (kernel descriptor enable of
2031 ========== ========================== ====== ==============================
2032 First Private Segment Buffer 4 V# that can be used, together
2033 (enable_sgpr_private with Scratch Wavefront Offset
2034 _segment_buffer) as an offset, to access the
2035 private memory space using a
2038 CP uses the value provided by
2040 then Dispatch Ptr 2 64 bit address of AQL dispatch
2041 (enable_sgpr_dispatch_ptr) packet for kernel dispatch
2043 then Queue Ptr 2 64 bit address of amd_queue_t
2044 (enable_sgpr_queue_ptr) object for AQL queue on which
2045 the dispatch packet was
2047 then Kernarg Segment Ptr 2 64 bit address of Kernarg
2048 (enable_sgpr_kernarg segment. This is directly
2049 _segment_ptr) copied from the
2050 kernarg_address in the kernel
2053 Having CP load it once avoids
2054 loading it at the beginning of
2056 then Dispatch Id 2 64 bit Dispatch ID of the
2057 (enable_sgpr_dispatch_id) dispatch packet being
2059 then Flat Scratch Init 2 This is 2 SGPRs:
2060 (enable_sgpr_flat_scratch
2064 The first SGPR is a 32 bit
2066 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2067 to per SPI base of memory
2068 for scratch for the queue
2069 executing the kernel
2070 dispatch. CP obtains this
2071 from the runtime. (The
2072 Scratch Segment Buffer base
2074 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2075 plus this offset.) The value
2076 of Scratch Wavefront Offset must
2077 be added to this offset by
2078 the kernel machine code,
2079 right shifted by 8, and
2080 moved to the FLAT_SCRATCH_HI
2082 FLAT_SCRATCH_HI corresponds
2083 to SGPRn-4 on GFX7, and
2084 SGPRn-6 on GFX8 (where SGPRn
2085 is the highest numbered SGPR
2086 allocated to the wavefront).
2088 multiplied by 256 (as it is
2089 in units of 256 bytes) and
2091 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2092 to calculate the per wavefront
2093 FLAT SCRATCH BASE in flat
2094 memory instructions that
2098 The second SGPR is 32 bit
2099 byte size of a single
2100 work-item's scratch memory
2101 usage. CP obtains this from
2102 the runtime, and it is
2103 always a multiple of DWORD.
2104 CP checks that the value in
2105 the kernel dispatch packet
2106 Private Segment Byte Size is
2107 not larger, and requests the
2108 runtime to increase the
2109 queue's scratch size if
2110 necessary. The kernel code
2112 FLAT_SCRATCH_LO which is
2113 SGPRn-3 on GFX7 and SGPRn-5
2114 on GFX8. FLAT_SCRATCH_LO is
2115 used as the FLAT SCRATCH
2117 instructions. Having CP load
2118 it once avoids loading it at
2119 the beginning of every
2123 64 bit base address of the
2124 per SPI scratch backing
2125 memory managed by SPI for
2126 the queue executing the
2127 kernel dispatch. CP obtains
2128 this from the runtime (and
2129 divides it if there are
2130 multiple Shader Arrays each
2131 with its own SPI). The value
2132 of Scratch Wavefront Offset must
2133 be added by the kernel
2134 machine code and the result
2135 moved to the FLAT_SCRATCH
2136 SGPR which is SGPRn-6 and
2137 SGPRn-5. It is used as the
2138 FLAT SCRATCH BASE in flat
2139 memory instructions.
2140 then Private Segment Size 1 The 32 bit byte size of a
2141 (enable_sgpr_private single
2143 scratch_segment_size) memory
2144 allocation. This is the
2145 value from the kernel
2146 dispatch packet Private
2147 Segment Byte Size rounded up
2148 by CP to a multiple of
2151 Having CP load it once avoids
2152 loading it at the beginning of
2155 This is not used for
2156 GFX7-GFX8 since it is the same
2157 value as the second SGPR of
2158 Flat Scratch Init. However, it
2159 may be needed for GFX9 which
2160 changes the meaning of the
2161 Flat Scratch Init value.
2162 then Grid Work-Group Count X 1 32 bit count of the number of
2163 (enable_sgpr_grid work-groups in the X dimension
2164 _workgroup_count_X) for the grid being
2165 executed. Computed from the
2166 fields in the kernel dispatch
2167 packet as ((grid_size.x +
2168 workgroup_size.x - 1) /
2170 then Grid Work-Group Count Y 1 32 bit count of the number of
2171 (enable_sgpr_grid work-groups in the Y dimension
2172 _workgroup_count_Y && for the grid being
2173 less than 16 previous executed. Computed from the
2174 SGPRs) fields in the kernel dispatch
2175 packet as ((grid_size.y +
2176 workgroup_size.y - 1) /
2179 Only initialized if <16
2180 previous SGPRs initialized.
2181 then Grid Work-Group Count Z 1 32 bit count of the number of
2182 (enable_sgpr_grid work-groups in the Z dimension
2183 _workgroup_count_Z && for the grid being
2184 less than 16 previous executed. Computed from the
2185 SGPRs) fields in the kernel dispatch
2186 packet as ((grid_size.z +
2187 workgroup_size.z - 1) /
2190 Only initialized if <16
2191 previous SGPRs initialized.
2192 then Work-Group Id X 1 32 bit work-group id in X
2193 (enable_sgpr_workgroup_id dimension of grid for
2195 then Work-Group Id Y 1 32 bit work-group id in Y
2196 (enable_sgpr_workgroup_id dimension of grid for
2198 then Work-Group Id Z 1 32 bit work-group id in Z
2199 (enable_sgpr_workgroup_id dimension of grid for
2201 then Work-Group Info 1 {first_wavefront, 14'b0000,
2202 (enable_sgpr_workgroup ordered_append_term[10:0],
2203 _info) threadgroup_size_in_wavefronts[5:0]}
2204 then Scratch Wavefront Offset 1 32 bit byte offset from base
2205 (enable_sgpr_private of scratch base of queue
2206 _segment_wavefront_offset) executing the kernel
2207 dispatch. Must be used as an
2209 segment address when using
2210 Scratch Segment Buffer. It
2211 must be used to set up FLAT
2212 SCRATCH for flat addressing
2214 :ref:`amdgpu-amdhsa-flat-scratch`).
2215 ========== ========================== ====== ==============================
2217 The order of the VGPR registers is defined, but the compiler can specify which
2218 ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
2219 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2220 for enabled registers are dense starting at VGPR0: the first enabled register is
2221 VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
2224 VGPR register initial state is defined in
2225 :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
2227 .. table:: VGPR Register Set Up Order
2228 :name: amdgpu-amdhsa-vgpr-register-set-up-order-table
2230 ========== ========================== ====== ==============================
2231 VGPR Order Name Number Description
2232 (kernel descriptor enable of
2234 ========== ========================== ====== ==============================
2235 First Work-Item Id X 1 32 bit work item id in X
2236 (Always initialized) dimension of work-group for
2238 then Work-Item Id Y 1 32 bit work item id in Y
2239 (enable_vgpr_workitem_id dimension of work-group for
2240 > 0) wavefront lane.
2241 then Work-Item Id Z 1 32 bit work item id in Z
2242 (enable_vgpr_workitem_id dimension of work-group for
2243 > 1) wavefront lane.
2244 ========== ========================== ====== ==============================
2246 The setting of registers is is done by GPU CP/ADC/SPI hardware as follows:
2248 1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
2250 2. Work-group Id registers X, Y, Z are set by ADC which supports any
2251 combination including none.
2252 3. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why
2253 its value cannot included with the flat scratch init value which is per queue.
2254 4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
2257 Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
2258 value to the hardware required SGPRn-3 and SGPRn-4 respectively.
2260 The global segment can be accessed either using buffer instructions (GFX6 which
2261 has V# 64 bit address support), flat instructions (GFX7-GFX9), or global
2262 instructions (GFX9).
2264 If buffer operations are used then the compiler can generate a V# with the
2265 following properties:
2269 * ATC: 1 if IOMMU present (such as APU)
2271 * MTYPE set to support memory coherence that matches the runtime (such as CC for
2272 APU and NC for dGPU).
2274 .. _amdgpu-amdhsa-kernel-prolog:
2279 .. _amdgpu-amdhsa-m0:
2285 The M0 register must be initialized with a value at least the total LDS size
2286 if the kernel may access LDS via DS or flat operations. Total LDS size is
2287 available in dispatch packet. For M0, it is also possible to use maximum
2288 possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
2291 The M0 register is not used for range checking LDS accesses and so does not
2292 need to be initialized in the prolog.
2294 .. _amdgpu-amdhsa-flat-scratch:
2299 If the kernel may use flat operations to access scratch memory, the prolog code
2300 must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
2301 are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wavefront
2302 Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
2305 Flat scratch is not supported.
2308 1. The low word of Flat Scratch Init is 32 bit byte offset from
2309 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
2310 being managed by SPI for the queue executing the kernel dispatch. This is
2311 the same value used in the Scratch Segment Buffer V# base address. The
2312 prolog must add the value of Scratch Wavefront Offset to get the wavefront's byte
2313 scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
2314 FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
2315 by 8 before moving into FLAT_SCRATCH_LO.
2316 2. The second word of Flat Scratch Init is 32 bit byte size of a single
2317 work-items scratch memory usage. This is directly loaded from the kernel
2318 dispatch packet Private Segment Byte Size and rounded up to a multiple of
2319 DWORD. Having CP load it once avoids loading it at the beginning of every
2320 wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
2324 The Flat Scratch Init is the 64 bit address of the base of scratch backing
2325 memory being managed by SPI for the queue executing the kernel dispatch. The
2326 prolog must add the value of Scratch Wavefront Offset and moved to the FLAT_SCRATCH
2327 pair for use as the flat scratch base in flat memory instructions.
2329 .. _amdgpu-amdhsa-memory-model:
2334 This section describes the mapping of LLVM memory model onto AMDGPU machine code
2335 (see :ref:`memmodel`). *The implementation is WIP.*
2338 Update when implementation complete.
2340 The AMDGPU backend supports the memory synchronization scopes specified in
2341 :ref:`amdgpu-memory-scopes`.
2343 The code sequences used to implement the memory model are defined in table
2344 :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
2346 The sequences specify the order of instructions that a single thread must
2347 execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
2348 to other memory instructions executed by the same thread. This allows them to be
2349 moved earlier or later which can allow them to be combined with other instances
2350 of the same instruction, or hoisted/sunk out of loops to improve
2351 performance. Only the instructions related to the memory model are given;
2352 additional ``s_waitcnt`` instructions are required to ensure registers are
2353 defined before being used. These may be able to be combined with the memory
2354 model ``s_waitcnt`` instructions as described above.
2356 The AMDGPU backend supports the following memory models:
2358 HSA Memory Model [HSA]_
2359 The HSA memory model uses a single happens-before relation for all address
2360 spaces (see :ref:`amdgpu-address-spaces`).
2361 OpenCL Memory Model [OpenCL]_
2362 The OpenCL memory model which has separate happens-before relations for the
2363 global and local address spaces. Only a fence specifying both global and
2364 local address space, and seq_cst instructions join the relationships. Since
2365 the LLVM ``memfence`` instruction does not allow an address space to be
2366 specified the OpenCL fence has to convervatively assume both local and
2367 global address space was specified. However, optimizations can often be
2368 done to eliminate the additional ``s_waitcnt`` instructions when there are
2369 no intervening memory instructions which access the corresponding address
2370 space. The code sequences in the table indicate what can be omitted for the
2371 OpenCL memory. The target triple environment is used to determine if the
2372 source language is OpenCL (see :ref:`amdgpu-opencl`).
2374 ``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
2377 ``buffer/global/flat_load/store/atomic`` instructions to global memory are
2378 termed vector memory operations.
2382 * Each agent has multiple compute units (CU).
2383 * Each CU has multiple SIMDs that execute wavefronts.
2384 * The wavefronts for a single work-group are executed in the same CU but may be
2385 executed by different SIMDs.
2386 * Each CU has a single LDS memory shared by the wavefronts of the work-groups
2388 * All LDS operations of a CU are performed as wavefront wide operations in a
2389 global order and involve no caching. Completion is reported to a wavefront in
2391 * The LDS memory has multiple request queues shared by the SIMDs of a
2392 CU. Therefore, the LDS operations performed by different wavefronts of a work-group
2393 can be reordered relative to each other, which can result in reordering the
2394 visibility of vector memory operations with respect to LDS operations of other
2395 wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
2396 ensure synchronization between LDS operations and vector memory operations
2397 between wavefronts of a work-group, but not between operations performed by the
2399 * The vector memory operations are performed as wavefront wide operations and
2400 completion is reported to a wavefront in execution order. The exception is
2401 that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
2402 vector memory order if they access LDS memory, and out of LDS operation order
2403 if they access global memory.
2404 * The vector memory operations access a single vector L1 cache shared by all
2405 SIMDs a CU. Therefore, no special action is required for coherence between the
2406 lanes of a single wavefront, or for coherence between wavefronts in the same
2407 work-group. A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts
2408 executing in different work-groups as they may be executing on different CUs.
2409 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
2410 on a group of CUs. The scalar and vector L1 caches are not coherent. However,
2411 scalar operations are used in a restricted way so do not impact the memory
2412 model. See :ref:`amdgpu-amdhsa-memory-spaces`.
2413 * The vector and scalar memory operations use an L2 cache shared by all CUs on
2415 * The L2 cache has independent channels to service disjoint ranges of virtual
2417 * Each CU has a separate request queue per channel. Therefore, the vector and
2418 scalar memory operations performed by wavefronts executing in different work-groups
2419 (which may be executing on different CUs) of an agent can be reordered
2420 relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
2421 synchronization between vector memory operations of different CUs. It ensures a
2422 previous vector memory operation has completed before executing a subsequent
2423 vector memory or LDS operation and so can be used to meet the requirements of
2424 acquire and release.
2425 * The L2 cache can be kept coherent with other agents on some targets, or ranges
2426 of virtual addresses can be set up to bypass it to ensure system coherence.
2428 Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-GFX8),
2429 or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
2430 memory, atomic memory orderings are not meaningful and all accesses are treated
2433 Constant address space uses ``buffer/global_load`` instructions (or equivalent
2434 scalar memory instructions). Since the constant address space contents do not
2435 change during the execution of a kernel dispatch it is not legal to perform
2436 stores, and atomic memory orderings are not meaningful and all access are
2437 treated as non-atomic.
2439 A memory synchronization scope wider than work-group is not meaningful for the
2440 group (LDS) address space and is treated as work-group.
2442 The memory model does not support the region address space which is treated as
2445 Acquire memory ordering is not meaningful on store atomic instructions and is
2446 treated as non-atomic.
2448 Release memory ordering is not meaningful on load atomic instructions and is
2449 treated a non-atomic.
2451 Acquire-release memory ordering is not meaningful on load or store atomic
2452 instructions and is treated as acquire and release respectively.
2454 AMDGPU backend only uses scalar memory operations to access memory that is
2455 proven to not change during the execution of the kernel dispatch. This includes
2456 constant address space and global address space for program scope const
2457 variables. Therefore the kernel machine code does not have to maintain the
2458 scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
2459 and vector L1 caches are invalidated between kernel dispatches by CP since
2460 constant address space data may change between kernel dispatch executions. See
2461 :ref:`amdgpu-amdhsa-memory-spaces`.
2463 The one execption is if scalar writes are used to spill SGPR registers. In this
2464 case the AMDGPU backend ensures the memory location used to spill is never
2465 accessed by vector memory operations at the same time. If scalar writes are used
2466 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
2467 return since the locations may be used for vector memory instructions by a
2468 future wavefront that uses the same scratch area, or a function call that creates a
2469 frame at the same address, respectively. There is no need for a ``s_dcache_inv``
2470 as all scalar writes are write-before-read in the same thread.
2472 Scratch backing memory (which is used for the private address space)
2473 is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private
2474 address space is only accessed by a single thread, and is always
2475 write-before-read, there is never a need to invalidate these entries from the L1
2476 cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the
2477 volatile cache lines.
2479 On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
2480 to invalidate the L2 cache. This also causes it to be treated as
2481 non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC
2482 (cache coherent) and so the L2 cache will coherent with the CPU and other
2485 .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
2486 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
2488 ============ ============ ============== ========== ===============================
2489 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
2490 Ordering Sync Scope Address
2492 ============ ============ ============== ========== ===============================
2494 -----------------------------------------------------------------------------------
2495 load *none* *none* - global - !volatile & !nontemporal
2497 - private 1. buffer/global/flat_load
2499 - volatile & !nontemporal
2501 1. buffer/global/flat_load
2506 1. buffer/global/flat_load
2509 load *none* *none* - local 1. ds_load
2510 store *none* *none* - global - !nontemporal
2512 - private 1. buffer/global/flat_store
2516 1. buffer/global/flat_stote
2519 store *none* *none* - local 1. ds_store
2520 **Unordered Atomic**
2521 -----------------------------------------------------------------------------------
2522 load atomic unordered *any* *any* *Same as non-atomic*.
2523 store atomic unordered *any* *any* *Same as non-atomic*.
2524 atomicrmw unordered *any* *any* *Same as monotonic
2526 **Monotonic Atomic**
2527 -----------------------------------------------------------------------------------
2528 load atomic monotonic - singlethread - global 1. buffer/global/flat_load
2529 - wavefront - generic
2531 load atomic monotonic - singlethread - local 1. ds_load
2534 load atomic monotonic - agent - global 1. buffer/global/flat_load
2535 - system - generic glc=1
2536 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
2537 - wavefront - generic
2541 store atomic monotonic - singlethread - local 1. ds_store
2544 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
2545 - wavefront - generic
2549 atomicrmw monotonic - singlethread - local 1. ds_atomic
2553 -----------------------------------------------------------------------------------
2554 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
2557 load atomic acquire - workgroup - global 1. buffer/global/flat_load
2558 load atomic acquire - workgroup - local 1. ds_load
2559 2. s_waitcnt lgkmcnt(0)
2562 - Must happen before
2574 load atomic acquire - workgroup - generic 1. flat_load
2575 2. s_waitcnt lgkmcnt(0)
2578 - Must happen before
2590 load atomic acquire - agent - global 1. buffer/global/flat_load
2592 2. s_waitcnt vmcnt(0)
2594 - Must happen before
2602 3. buffer_wbinvl1_vol
2604 - Must happen before
2614 load atomic acquire - agent - generic 1. flat_load glc=1
2615 - system 2. s_waitcnt vmcnt(0) &
2620 - Must happen before
2623 - Ensures the flat_load
2628 3. buffer_wbinvl1_vol
2630 - Must happen before
2640 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
2643 atomicrmw acquire - workgroup - global 1. buffer/global/flat_atomic
2644 atomicrmw acquire - workgroup - local 1. ds_atomic
2645 2. waitcnt lgkmcnt(0)
2648 - Must happen before
2661 atomicrmw acquire - workgroup - generic 1. flat_atomic
2662 2. waitcnt lgkmcnt(0)
2665 - Must happen before
2678 atomicrmw acquire - agent - global 1. buffer/global/flat_atomic
2679 - system 2. s_waitcnt vmcnt(0)
2681 - Must happen before
2690 3. buffer_wbinvl1_vol
2692 - Must happen before
2702 atomicrmw acquire - agent - generic 1. flat_atomic
2703 - system 2. s_waitcnt vmcnt(0) &
2708 - Must happen before
2717 3. buffer_wbinvl1_vol
2719 - Must happen before
2729 fence acquire - singlethread *none* *none*
2731 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0)
2736 - However, since LLVM
2761 fence-paired-atomic).
2762 - Must happen before
2773 fence-paired-atomic.
2775 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
2782 - However, since LLVM
2790 - Could be split into
2799 - s_waitcnt vmcnt(0)
2810 fence-paired-atomic).
2811 - s_waitcnt lgkmcnt(0)
2822 fence-paired-atomic).
2823 - Must happen before
2837 fence-paired-atomic.
2839 2. buffer_wbinvl1_vol
2841 - Must happen before any
2842 following global/generic
2852 -----------------------------------------------------------------------------------
2853 store atomic release - singlethread - global 1. buffer/global/ds/flat_store
2856 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
2865 - Must happen before
2876 2. buffer/global/flat_store
2877 store atomic release - workgroup - local 1. ds_store
2878 store atomic release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
2887 - Must happen before
2899 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
2900 - system - generic vmcnt(0)
2904 - Could be split into
2913 - s_waitcnt vmcnt(0)
2920 - s_waitcnt lgkmcnt(0)
2927 - Must happen before
2938 2. buffer/global/ds/flat_store
2939 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
2942 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
2951 - Must happen before
2962 2. buffer/global/flat_atomic
2963 atomicrmw release - workgroup - local 1. ds_atomic
2964 atomicrmw release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
2973 - Must happen before
2985 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
2986 - system - generic vmcnt(0)
2990 - Could be split into
2999 - s_waitcnt vmcnt(0)
3006 - s_waitcnt lgkmcnt(0)
3013 - Must happen before
3024 2. buffer/global/ds/flat_atomic
3025 fence release - singlethread *none* *none*
3027 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0)
3032 - However, since LLVM
3053 - Must happen before
3062 fence-paired-atomic).
3069 fence-paired-atomic.
3071 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) &
3082 - However, since LLVM
3097 - Could be split into
3106 - s_waitcnt vmcnt(0)
3113 - s_waitcnt lgkmcnt(0)
3120 - Must happen before
3129 fence-paired-atomic).
3136 fence-paired-atomic.
3138 **Acquire-Release Atomic**
3139 -----------------------------------------------------------------------------------
3140 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
3143 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
3152 - Must happen before
3163 2. buffer/global/flat_atomic
3164 atomicrmw acq_rel - workgroup - local 1. ds_atomic
3165 2. s_waitcnt lgkmcnt(0)
3168 - Must happen before
3181 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
3190 - Must happen before
3202 3. s_waitcnt lgkmcnt(0)
3205 - Must happen before
3218 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
3223 - Could be split into
3232 - s_waitcnt vmcnt(0)
3239 - s_waitcnt lgkmcnt(0)
3246 - Must happen before
3257 2. buffer/global/flat_atomic
3258 3. s_waitcnt vmcnt(0)
3260 - Must happen before
3269 4. buffer_wbinvl1_vol
3271 - Must happen before
3281 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
3286 - Could be split into
3295 - s_waitcnt vmcnt(0)
3302 - s_waitcnt lgkmcnt(0)
3309 - Must happen before
3321 3. s_waitcnt vmcnt(0) &
3326 - Must happen before
3335 4. buffer_wbinvl1_vol
3337 - Must happen before
3347 fence acq_rel - singlethread *none* *none*
3349 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0)
3369 - Must happen before
3392 acquire-fence-paired-atomic
3413 release-fence-paired-atomic
3414 ). This satisfies the
3418 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) &
3425 - However, since LLVM
3433 - Could be split into
3442 - s_waitcnt vmcnt(0)
3449 - s_waitcnt lgkmcnt(0)
3456 - Must happen before
3461 global/local/generic
3470 acquire-fence-paired-atomic
3482 global/local/generic
3491 release-fence-paired-atomic
3492 ). This satisfies the
3496 2. buffer_wbinvl1_vol
3498 - Must happen before
3512 **Sequential Consistent Atomic**
3513 -----------------------------------------------------------------------------------
3514 load atomic seq_cst - singlethread - global *Same as corresponding
3515 - wavefront - local load atomic acquire,
3516 - generic except must generated
3517 all instructions even
3519 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0)
3534 lgkmcnt(0) and so do
3569 instructions same as
3572 except must generated
3573 all instructions even
3575 load atomic seq_cst - workgroup - local *Same as corresponding
3576 load atomic acquire,
3577 except must generated
3578 all instructions even
3580 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
3581 - system - generic vmcnt(0)
3583 - Could be split into
3592 - waitcnt lgkmcnt(0)
3605 lgkmcnt(0) and so do
3656 instructions same as
3659 except must generated
3660 all instructions even
3662 store atomic seq_cst - singlethread - global *Same as corresponding
3663 - wavefront - local store atomic release,
3664 - workgroup - generic except must generated
3665 all instructions even
3667 store atomic seq_cst - agent - global *Same as corresponding
3668 - system - generic store atomic release,
3669 except must generated
3670 all instructions even
3672 atomicrmw seq_cst - singlethread - global *Same as corresponding
3673 - wavefront - local atomicrmw acq_rel,
3674 - workgroup - generic except must generated
3675 all instructions even
3677 atomicrmw seq_cst - agent - global *Same as corresponding
3678 - system - generic atomicrmw acq_rel,
3679 except must generated
3680 all instructions even
3682 fence seq_cst - singlethread *none* *Same as corresponding
3683 - wavefront fence acq_rel,
3684 - workgroup except must generated
3685 - agent all instructions even
3686 - system for OpenCL.*
3687 ============ ============ ============== ========== ===============================
3689 The memory order also adds the single thread optimization constrains defined in
3691 :ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
3693 .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
3694 :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
3696 ============ ==============================================================
3697 LLVM Memory Optimization Constraints
3699 ============ ==============================================================
3702 acquire - If a load atomic/atomicrmw then no following load/load
3703 atomic/store/ store atomic/atomicrmw/fence instruction can
3704 be moved before the acquire.
3705 - If a fence then same as load atomic, plus no preceding
3706 associated fence-paired-atomic can be moved after the fence.
3707 release - If a store atomic/atomicrmw then no preceding load/load
3708 atomic/store/ store atomic/atomicrmw/fence instruction can
3709 be moved after the release.
3710 - If a fence then same as store atomic, plus no following
3711 associated fence-paired-atomic can be moved before the
3713 acq_rel Same constraints as both acquire and release.
3714 seq_cst - If a load atomic then same constraints as acquire, plus no
3715 preceding sequentially consistent load atomic/store
3716 atomic/atomicrmw/fence instruction can be moved after the
3718 - If a store atomic then the same constraints as release, plus
3719 no following sequentially consistent load atomic/store
3720 atomic/atomicrmw/fence instruction can be moved before the
3722 - If an atomicrmw/fence then same constraints as acq_rel.
3723 ============ ==============================================================
3728 For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
3729 (such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
3730 the ``s_trap`` instruction with the following usage:
3732 .. table:: AMDGPU Trap Handler for AMDHSA OS
3733 :name: amdgpu-trap-handler-for-amdhsa-os-table
3735 =================== =============== =============== =======================
3736 Usage Code Sequence Trap Handler Description
3738 =================== =============== =============== =======================
3739 reserved ``s_trap 0x00`` Reserved by hardware.
3740 ``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA
3741 ``queue_ptr`` ``debugtrap``
3742 ``VGPR0``: intrinsic (not
3743 ``arg`` implemented).
3744 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be
3745 ``queue_ptr`` terminated and its
3746 associated queue put
3747 into the error state.
3748 ``llvm.debugtrap`` ``s_trap 0x03`` ``SGPR0-1``: If debugger not
3749 ``queue_ptr`` installed handled
3750 same as ``llvm.trap``.
3751 debugger breakpoint ``s_trap 0x07`` Reserved for debugger
3753 debugger ``s_trap 0x08`` Reserved for debugger.
3754 debugger ``s_trap 0xfe`` Reserved for debugger.
3755 debugger ``s_trap 0xff`` Reserved for debugger.
3756 =================== =============== =============== =======================
3761 This section provides code conventions used when the target triple OS is
3762 empty (see :ref:`amdgpu-target-triples`).
3767 For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
3768 not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
3769 instructions are handled as follows:
3771 .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
3772 :name: amdgpu-trap-handler-for-non-amdhsa-os-table
3774 =============== =============== ===========================================
3775 Usage Code Sequence Description
3776 =============== =============== ===========================================
3777 llvm.trap s_endpgm Causes wavefront to be terminated.
3778 llvm.debugtrap *none* Compiler warning given that there is no
3779 trap handler installed.
3780 =============== =============== ===========================================
3790 When generating code for the OpenCL language the target triple environment
3791 should be ``opencl`` or ``amdgizcl`` (see :ref:`amdgpu-target-triples`).
3793 When the language is OpenCL the following differences occur:
3795 1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
3796 2. The AMDGPU backend adds additional arguments to the kernel.
3797 3. Additional metadata is generated
3798 (:ref:`amdgpu-amdhsa-hsa-code-object-metadata`).
3801 Specify what affect this has. Hidden arguments added. Additional metadata
3809 When generating code for the OpenCL language the target triple environment
3810 should be ``hcc`` (see :ref:`amdgpu-target-triples`).
3812 When the language is OpenCL the following differences occur:
3814 1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
3817 Specify what affect this has.
3822 AMDGPU backend has LLVM-MC based assembler which is currently in development.
3823 It supports AMDGCN GFX6-GFX9.
3825 This section describes general syntax for instructions and operands.
3838 An instruction has the following syntax:
3840 *<opcode> <operand0>, <operand1>,... <modifier0> <modifier1>...*
3842 Note that operands are normally comma-separated while modifiers are space-separated.
3844 The order of operands and modifiers is fixed. Most modifiers are optional and may be omitted.
3846 See detailed instruction syntax description for :doc:`GFX7<AMDGPUAsmGFX7>`,
3847 :doc:`GFX8<AMDGPUAsmGFX8>` and :doc:`GFX9<AMDGPUAsmGFX9>`.
3849 Note that features under development are not included in this description.
3851 For more information about instructions, their semantics and supported combinations of
3852 operands, refer to one of instruction set architecture manuals
3853 [AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_ and [AMD-GCN-GFX9]_.
3858 The following syntax for register operands is supported:
3860 * SGPR registers: s0, ... or s[0], ...
3861 * VGPR registers: v0, ... or v[0], ...
3862 * TTMP registers: ttmp0, ... or ttmp[0], ...
3863 * Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
3864 * Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
3865 * Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
3866 * Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
3867 * Register index expressions: v[2*2], s[1-1:2-1]
3868 * 'off' indicates that an operand is not enabled
3873 Detailed description of modifiers may be found :doc:`here<AMDGPUOperandSyntax>`.
3875 Instruction Examples
3876 ~~~~~~~~~~~~~~~~~~~~
3881 .. code-block:: nasm
3883 ds_add_u32 v2, v4 offset:16
3884 ds_write_src2_b64 v2 offset0:4 offset1:8
3885 ds_cmpst_f32 v2, v4, v6
3886 ds_min_rtn_f64 v[8:9], v2, v[4:5]
3889 For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
3894 .. code-block:: nasm
3896 flat_load_dword v1, v[3:4]
3897 flat_store_dwordx3 v[3:4], v[5:7]
3898 flat_atomic_swap v1, v[3:4], v5 glc
3899 flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
3900 flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
3902 For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
3907 .. code-block:: nasm
3909 buffer_load_dword v1, off, s[4:7], s1
3910 buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
3911 buffer_store_format_xy v[1:2], off, s[4:7], s1
3913 buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
3915 For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
3920 .. code-block:: nasm
3922 s_load_dword s1, s[2:3], 0xfc
3923 s_load_dwordx8 s[8:15], s[2:3], s4
3924 s_load_dwordx16 s[88:103], s[2:3], s4
3928 For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
3933 .. code-block:: nasm
3936 s_mov_b64 s[0:1], 0x80000000
3938 s_wqm_b64 s[2:3], s[4:5]
3939 s_bcnt0_i32_b64 s1, s[2:3]
3940 s_swappc_b64 s[2:3], s[4:5]
3941 s_cbranch_join s[4:5]
3943 For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
3948 .. code-block:: nasm
3950 s_add_u32 s1, s2, s3
3951 s_and_b64 s[2:3], s[4:5], s[6:7]
3952 s_cselect_b32 s1, s2, s3
3953 s_andn2_b32 s2, s4, s6
3954 s_lshr_b64 s[2:3], s[4:5], s6
3955 s_ashr_i32 s2, s4, s6
3956 s_bfm_b64 s[2:3], s4, s6
3957 s_bfe_i64 s[2:3], s[4:5], s6
3958 s_cbranch_g_fork s[4:5], s[6:7]
3960 For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
3965 .. code-block:: nasm
3968 s_bitcmp1_b32 s1, s2
3969 s_bitcmp0_b64 s[2:3], s4
3972 For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
3977 .. code-block:: nasm
3982 s_waitcnt 0 ; Wait for all counters to be 0
3983 s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
3984 s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
3988 s_sendmsg sendmsg(MSG_INTERRUPT)
3991 For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
3993 Unless otherwise mentioned, little verification is performed on the operands
3994 of SOPP Instructions, so it is up to the programmer to be familiar with the
3995 range or acceptable values.
4000 For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
4001 the assembler will automatically use optimal encoding based on its operands.
4002 To force specific encoding, one can add a suffix to the opcode of the instruction:
4004 * _e32 for 32-bit VOP1/VOP2/VOPC
4005 * _e64 for 64-bit VOP3
4007 * _sdwa for VOP_SDWA
4009 VOP1/VOP2/VOP3/VOPC examples:
4011 .. code-block:: nasm
4014 v_mov_b32_e32 v1, v2
4016 v_cvt_f64_i32_e32 v[1:2], v2
4017 v_floor_f32_e32 v1, v2
4018 v_bfrev_b32_e32 v1, v2
4019 v_add_f32_e32 v1, v2, v3
4020 v_mul_i32_i24_e64 v1, v2, 3
4021 v_mul_i32_i24_e32 v1, -3, v3
4022 v_mul_i32_i24_e32 v1, -100, v3
4023 v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
4024 v_max_f16_e32 v1, v2, v3
4028 .. code-block:: nasm
4030 v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
4031 v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
4032 v_mov_b32 v0, v0 wave_shl:1
4033 v_mov_b32 v0, v0 row_mirror
4034 v_mov_b32 v0, v0 row_bcast:31
4035 v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
4036 v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
4037 v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
4041 .. code-block:: nasm
4043 v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
4044 v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
4045 v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
4046 v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
4047 v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
4049 For full list of supported instructions, refer to "Vector ALU instructions".
4051 HSA Code Object Directives
4052 ~~~~~~~~~~~~~~~~~~~~~~~~~~
4054 AMDGPU ABI defines auxiliary data in output code object. In assembly source,
4055 one can specify them with assembler directives.
4057 .hsa_code_object_version major, minor
4058 +++++++++++++++++++++++++++++++++++++
4060 *major* and *minor* are integers that specify the version of the HSA code
4061 object that will be generated by the assembler.
4063 .hsa_code_object_isa [major, minor, stepping, vendor, arch]
4064 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4067 *major*, *minor*, and *stepping* are all integers that describe the instruction
4068 set architecture (ISA) version of the assembly program.
4070 *vendor* and *arch* are quoted strings. *vendor* should always be equal to
4071 "AMD" and *arch* should always be equal to "AMDGPU".
4073 By default, the assembler will derive the ISA version, *vendor*, and *arch*
4074 from the value of the -mcpu option that is passed to the assembler.
4076 .amdgpu_hsa_kernel (name)
4077 +++++++++++++++++++++++++
4079 This directives specifies that the symbol with given name is a kernel entry point
4080 (label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
4085 This directive marks the beginning of a list of key / value pairs that are used
4086 to specify the amd_kernel_code_t object that will be emitted by the assembler.
4087 The list must be terminated by the *.end_amd_kernel_code_t* directive. For
4088 any amd_kernel_code_t values that are unspecified a default value will be
4089 used. The default value for all keys is 0, with the following exceptions:
4091 - *kernel_code_version_major* defaults to 1.
4092 - *machine_kind* defaults to 1.
4093 - *machine_version_major*, *machine_version_minor*, and
4094 *machine_version_stepping* are derived from the value of the -mcpu option
4095 that is passed to the assembler.
4096 - *kernel_code_entry_byte_offset* defaults to 256.
4097 - *wavefront_size* defaults to 6.
4098 - *kernarg_segment_alignment*, *group_segment_alignment*, and
4099 *private_segment_alignment* default to 4. Note that alignments are specified
4100 as a power of two, so a value of **n** means an alignment of 2^ **n**.
4102 The *.amd_kernel_code_t* directive must be placed immediately after the
4103 function label and before any instructions.
4105 For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
4106 comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
4108 Here is an example of a minimal amd_kernel_code_t specification:
4110 .. code-block:: none
4112 .hsa_code_object_version 1,0
4113 .hsa_code_object_isa
4118 .amdgpu_hsa_kernel hello_world
4123 enable_sgpr_kernarg_segment_ptr = 1
4125 compute_pgm_rsrc1_vgprs = 0
4126 compute_pgm_rsrc1_sgprs = 0
4127 compute_pgm_rsrc2_user_sgpr = 2
4128 kernarg_segment_byte_size = 8
4129 wavefront_sgpr_count = 2
4130 workitem_vgpr_count = 3
4131 .end_amd_kernel_code_t
4133 s_load_dwordx2 s[0:1], s[0:1] 0x0
4134 v_mov_b32 v0, 3.14159
4135 s_waitcnt lgkmcnt(0)
4138 flat_store_dword v[1:2], v0
4141 .size hello_world, .Lfunc_end0-hello_world
4143 Additional Documentation
4144 ========================
4146 .. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
4147 .. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
4148 .. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
4149 .. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
4150 .. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
4151 .. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
4152 .. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
4153 .. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
4154 .. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
4155 .. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
4156 .. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
4157 .. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
4158 .. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
4159 .. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
4160 .. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
4161 .. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__