1 =============================
2 User Guide for AMDGPU Backend
3 =============================
11 The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
12 R600 family up until the current GCN families. It lives in the
13 ``lib/Target/AMDGPU`` directory.
18 .. _amdgpu-target-triples:
23 Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
24 specify the target triple:
26 .. table:: AMDGPU Architectures
27 :name: amdgpu-architecture-table
29 ============ ==============================================================
30 Architecture Description
31 ============ ==============================================================
32 ``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
33 ``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
34 ============ ==============================================================
36 .. table:: AMDGPU Vendors
37 :name: amdgpu-vendor-table
39 ============ ==============================================================
41 ============ ==============================================================
42 ``amd`` Can be used for all AMD GPU usage.
43 ``mesa3d`` Can be used if the OS is ``mesa3d``.
44 ============ ==============================================================
46 .. table:: AMDGPU Operating Systems
47 :name: amdgpu-os-table
49 ============== ============================================================
51 ============== ============================================================
52 *<empty>* Defaults to the *unknown* OS.
53 ``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes
54 such as AMD's ROCm [AMD-ROCm]_.
55 ``amdpal`` Graphic shaders and compute kernels executed on AMD PAL
57 ``mesa3d`` Graphic shaders and compute kernels executed on Mesa 3D
59 ============== ============================================================
61 .. table:: AMDGPU Environments
62 :name: amdgpu-environment-table
64 ============ ==============================================================
65 Environment Description
66 ============ ==============================================================
67 *<empty>* Defaults to ``opencl``.
68 ``opencl`` OpenCL compute kernel (see :ref:`amdgpu-opencl`).
69 ``hcc`` AMD HC language compute kernel (see :ref:`amdgpu-hcc`).
70 ============ ==============================================================
72 .. _amdgpu-processors:
77 Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
78 names from both the *Processor* and *Alternative Processor* can be used.
80 .. table:: AMDGPU Processors
81 :name: amdgpu-processor-table
83 =========== =============== ============ ===== ========= ======= ==================
84 Processor Alternative Target dGPU/ Target ROCm Example
85 Processor Triple APU Features Support Products
86 Architecture Supported
88 =========== =============== ============ ===== ========= ======= ==================
89 **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
90 -----------------------------------------------------------------------------------
91 ``r600`` ``r600`` dGPU
92 ``r630`` ``r600`` dGPU
93 ``rs880`` ``r600`` dGPU
94 ``rv670`` ``r600`` dGPU
95 **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
96 -----------------------------------------------------------------------------------
97 ``rv710`` ``r600`` dGPU
98 ``rv730`` ``r600`` dGPU
99 ``rv770`` ``r600`` dGPU
100 **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
101 -----------------------------------------------------------------------------------
102 ``cedar`` ``r600`` dGPU
103 ``cypress`` ``r600`` dGPU
104 ``juniper`` ``r600`` dGPU
105 ``redwood`` ``r600`` dGPU
106 ``sumo`` ``r600`` dGPU
107 **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
108 -----------------------------------------------------------------------------------
109 ``barts`` ``r600`` dGPU
110 ``caicos`` ``r600`` dGPU
111 ``cayman`` ``r600`` dGPU
112 ``turks`` ``r600`` dGPU
113 **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
114 -----------------------------------------------------------------------------------
115 ``gfx600`` - ``tahiti`` ``amdgcn`` dGPU
116 ``gfx601`` - ``hainan`` ``amdgcn`` dGPU
120 **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
121 -----------------------------------------------------------------------------------
122 ``gfx700`` - ``kaveri`` ``amdgcn`` APU - A6-7000
132 ``gfx701`` - ``hawaii`` ``amdgcn`` dGPU ROCm - FirePro W8100
136 ``gfx702`` ``amdgcn`` dGPU ROCm - Radeon R9 290
140 ``gfx703`` - ``kabini`` ``amdgcn`` APU - E1-2100
141 - ``mullins`` - E1-2200
149 ``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Radeon HD 7790
153 **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
154 -----------------------------------------------------------------------------------
155 ``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - A6-8500P
161 \ ``amdgcn`` APU - xnack ROCm - A10-8700P
164 \ ``amdgcn`` APU - xnack - A10-9600P
170 \ ``amdgcn`` APU - xnack - E2-9010
173 ``gfx802`` - ``iceland`` ``amdgcn`` dGPU - xnack ROCm - FirePro S7150
174 - ``tonga`` [off] - FirePro S7100
181 ``gfx803`` - ``fiji`` ``amdgcn`` dGPU - xnack ROCm - Radeon R9 Nano
182 [off] - Radeon R9 Fury
186 - Radeon Instinct MI8
187 \ - ``polaris10`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 470
188 [off] - Radeon RX 480
189 - Radeon Instinct MI6
190 \ - ``polaris11`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 460
192 ``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack
194 **GCN GFX9** [AMD-GCN-GFX9]_
195 -----------------------------------------------------------------------------------
196 ``gfx900`` ``amdgcn`` dGPU - xnack ROCm - Radeon Vega
197 [off] Frontier Edition
202 - Radeon Instinct MI25
203 ``gfx902`` ``amdgcn`` APU - xnack *TBA*
208 =========== =============== ============ ===== ========= ======= ==================
210 .. _amdgpu-target-features:
215 Target features control how code is generated to support certain
216 processor specific features. Not all target features are supported by
217 all processors. The runtime must ensure that the features supported by
218 the device used to execute the code match the features enabled when
219 generating the code. A mismatch of features may result in incorrect
220 execution, or a reduction in performance.
222 The target features supported by each processor, and the default value
223 used if not specified explicitly, is listed in
224 :ref:`amdgpu-processor-table`.
226 Use the ``clang -m[no-]<TargetFeature>`` option to specify the AMD GPU
232 Enable the ``xnack`` feature.
234 Disable the ``xnack`` feature.
236 .. table:: AMDGPU Target Features
237 :name: amdgpu-target-feature-table
239 ============== ==================================================
240 Target Feature Description
241 ============== ==================================================
242 -m[no-]xnack Enable/disable generating code that has
243 memory clauses that are compatible with
244 having XNACK replay enabled.
246 This is used for demand paging and page
247 migration. If XNACK replay is enabled in
248 the device, then if a page fault occurs
249 the code may execute incorrectly if the
250 ``xnack`` feature is not enabled. Executing
251 code that has the feature enabled on a
252 device that does not have XNACK replay
253 enabled will execute correctly, but may
254 be less performant than code with the
256 ============== ==================================================
258 .. _amdgpu-address-spaces:
263 The AMDGPU backend uses the following address space mappings.
265 The memory space names used in the table, aside from the region memory space, is
266 from the OpenCL standard.
268 LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
270 .. table:: Address Space Mapping
271 :name: amdgpu-address-space-mapping-table
273 ================== =================
274 LLVM Address Space Memory Space
275 ================== =================
283 ================== =================
285 .. _amdgpu-memory-scopes:
290 This section provides LLVM memory synchronization scopes supported by the AMDGPU
291 backend memory model when the target triple OS is ``amdhsa`` (see
292 :ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
294 The memory model supported is based on the HSA memory model [HSA]_ which is
295 based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
296 relation is transitive over the synchonizes-with relation independent of scope,
297 and synchonizes-with allows the memory scope instances to be inclusive (see
298 table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
300 This is different to the OpenCL [OpenCL]_ memory model which does not have scope
301 inclusion and requires the memory scopes to exactly match. However, this
302 is conservatively correct for OpenCL.
304 .. table:: AMDHSA LLVM Sync Scopes
305 :name: amdgpu-amdhsa-llvm-sync-scopes-table
307 ================ ==========================================================
308 LLVM Sync Scope Description
309 ================ ==========================================================
310 *none* The default: ``system``.
312 Synchronizes with, and participates in modification and
313 seq_cst total orderings with, other operations (except
314 image operations) for all address spaces (except private,
315 or generic that accesses private) provided the other
316 operation's sync scope is:
319 - ``agent`` and executed by a thread on the same agent.
320 - ``workgroup`` and executed by a thread in the same
322 - ``wavefront`` and executed by a thread in the same
325 ``agent`` Synchronizes with, and participates in modification and
326 seq_cst total orderings with, other operations (except
327 image operations) for all address spaces (except private,
328 or generic that accesses private) provided the other
329 operation's sync scope is:
331 - ``system`` or ``agent`` and executed by a thread on the
333 - ``workgroup`` and executed by a thread in the same
335 - ``wavefront`` and executed by a thread in the same
338 ``workgroup`` Synchronizes with, and participates in modification and
339 seq_cst total orderings with, other operations (except
340 image operations) for all address spaces (except private,
341 or generic that accesses private) provided the other
342 operation's sync scope is:
344 - ``system``, ``agent`` or ``workgroup`` and executed by a
345 thread in the same workgroup.
346 - ``wavefront`` and executed by a thread in the same
349 ``wavefront`` Synchronizes with, and participates in modification and
350 seq_cst total orderings with, other operations (except
351 image operations) for all address spaces (except private,
352 or generic that accesses private) provided the other
353 operation's sync scope is:
355 - ``system``, ``agent``, ``workgroup`` or ``wavefront``
356 and executed by a thread in the same wavefront.
358 ``singlethread`` Only synchronizes with, and participates in modification
359 and seq_cst total orderings with, other operations (except
360 image operations) running in the same thread for all
361 address spaces (for example, in signal handlers).
362 ================ ==========================================================
367 The AMDGPU backend implements the following intrinsics.
369 *This section is WIP.*
372 List AMDGPU intrinsics
377 The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
378 can be linked by ``lld`` to produce a standard ELF shared code object which can
379 be loaded and executed on an AMDGPU target.
384 The AMDGPU backend uses the following ELF header:
386 .. table:: AMDGPU ELF Header
387 :name: amdgpu-elf-header-table
389 ========================== ===============================
391 ========================== ===============================
392 ``e_ident[EI_CLASS]`` ``ELFCLASS64``
393 ``e_ident[EI_DATA]`` ``ELFDATA2LSB``
394 ``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE``
395 - ``ELFOSABI_AMDGPU_HSA``
396 - ``ELFOSABI_AMDGPU_PAL``
397 - ``ELFOSABI_AMDGPU_MESA3D``
398 ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA``
399 - ``ELFABIVERSION_AMDGPU_PAL``
400 - ``ELFABIVERSION_AMDGPU_MESA3D``
401 ``e_type`` - ``ET_REL``
403 ``e_machine`` ``EM_AMDGPU``
405 ``e_flags`` See :ref:`amdgpu-elf-header-e_flags-table`
406 ========================== ===============================
410 .. table:: AMDGPU ELF Header Enumeration Values
411 :name: amdgpu-elf-header-enumeration-values-table
413 =============================== =====
415 =============================== =====
418 ``ELFOSABI_AMDGPU_HSA`` 64
419 ``ELFOSABI_AMDGPU_PAL`` 65
420 ``ELFOSABI_AMDGPU_MESA3D`` 66
421 ``ELFABIVERSION_AMDGPU_HSA`` 1
422 ``ELFABIVERSION_AMDGPU_PAL`` 0
423 ``ELFABIVERSION_AMDGPU_MESA3D`` 0
424 =============================== =====
426 ``e_ident[EI_CLASS]``
429 * ``ELFCLASS32`` for ``r600`` architecture.
431 * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64
435 All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
437 ``e_ident[EI_OSABI]``
438 One of the following AMD GPU architecture specific OS ABIs
439 (see :ref:`amdgpu-os-table`):
441 * ``ELFOSABI_NONE`` for *unknown* OS.
443 * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
445 * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
447 * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
449 ``e_ident[EI_ABIVERSION]``
450 The ABI version of the AMD GPU architecture specific OS ABI to which the code
453 * ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
456 * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
459 * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
463 Can be one of the following values:
467 The type produced by the AMD GPU backend compiler as it is relocatable code
471 The type produced by the linker as it is a shared code object.
473 The AMD HSA runtime loader requires a ``ET_DYN`` code object.
476 The value ``EM_AMDGPU`` is used for the machine for all processors supported
477 by the ``r600`` and ``amdgcn`` architectures (see
478 :ref:`amdgpu-processor-table`). The specific processor is specified in the
479 ``EF_AMDGPU_MACH`` bit field of the ``e_flags`` (see
480 :ref:`amdgpu-elf-header-e_flags-table`).
483 The entry point is 0 as the entry points for individual kernels must be
484 selected in order to invoke them through AQL packets.
487 The AMDGPU backend uses the following ELF header flags:
489 .. table:: AMDGPU ELF Header ``e_flags``
490 :name: amdgpu-elf-header-e_flags-table
492 ================================= ========== =============================
493 Name Value Description
494 ================================= ========== =============================
495 **AMDGPU Processor Flag** See :ref:`amdgpu-processor-table`.
496 -------------------------------------------- -----------------------------
497 ``EF_AMDGPU_MACH`` 0x000000ff AMDGPU processor selection
499 ``EF_AMDGPU_MACH_xxx`` values
501 :ref:`amdgpu-ef-amdgpu-mach-table`.
502 ``EF_AMDGPU_XNACK`` 0x00000100 Indicates if the ``xnack``
505 contained in the code object.
507 :ref:`amdgpu-target-features`.
508 ================================= ========== =============================
510 .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
511 :name: amdgpu-ef-amdgpu-mach-table
513 ================================= ========== =============================
514 Name Value Description (see
515 :ref:`amdgpu-processor-table`)
516 ================================= ========== =============================
517 ``EF_AMDGPU_MACH_NONE`` 0x000 *not specified*
518 ``EF_AMDGPU_MACH_R600_R600`` 0x001 ``r600``
519 ``EF_AMDGPU_MACH_R600_R630`` 0x002 ``r630``
520 ``EF_AMDGPU_MACH_R600_RS880`` 0x003 ``rs880``
521 ``EF_AMDGPU_MACH_R600_RV670`` 0x004 ``rv670``
522 ``EF_AMDGPU_MACH_R600_RV710`` 0x005 ``rv710``
523 ``EF_AMDGPU_MACH_R600_RV730`` 0x006 ``rv730``
524 ``EF_AMDGPU_MACH_R600_RV770`` 0x007 ``rv770``
525 ``EF_AMDGPU_MACH_R600_CEDAR`` 0x008 ``cedar``
526 ``EF_AMDGPU_MACH_R600_CYPRESS`` 0x009 ``cypress``
527 ``EF_AMDGPU_MACH_R600_JUNIPER`` 0x00a ``juniper``
528 ``EF_AMDGPU_MACH_R600_REDWOOD`` 0x00b ``redwood``
529 ``EF_AMDGPU_MACH_R600_SUMO`` 0x00c ``sumo``
530 ``EF_AMDGPU_MACH_R600_BARTS`` 0x00d ``barts``
531 ``EF_AMDGPU_MACH_R600_CAICOS`` 0x00e ``caicos``
532 ``EF_AMDGPU_MACH_R600_CAYMAN`` 0x00f ``cayman``
533 ``EF_AMDGPU_MACH_R600_TURKS`` 0x010 ``turks``
534 *reserved* 0x011 - Reserved for ``r600``
535 0x01f architecture processors.
536 ``EF_AMDGPU_MACH_AMDGCN_GFX600`` 0x020 ``gfx600``
537 ``EF_AMDGPU_MACH_AMDGCN_GFX601`` 0x021 ``gfx601``
538 ``EF_AMDGPU_MACH_AMDGCN_GFX700`` 0x022 ``gfx700``
539 ``EF_AMDGPU_MACH_AMDGCN_GFX701`` 0x023 ``gfx701``
540 ``EF_AMDGPU_MACH_AMDGCN_GFX702`` 0x024 ``gfx702``
541 ``EF_AMDGPU_MACH_AMDGCN_GFX703`` 0x025 ``gfx703``
542 ``EF_AMDGPU_MACH_AMDGCN_GFX704`` 0x026 ``gfx704``
543 *reserved* 0x027 Reserved.
544 ``EF_AMDGPU_MACH_AMDGCN_GFX801`` 0x028 ``gfx801``
545 ``EF_AMDGPU_MACH_AMDGCN_GFX802`` 0x029 ``gfx802``
546 ``EF_AMDGPU_MACH_AMDGCN_GFX803`` 0x02a ``gfx803``
547 ``EF_AMDGPU_MACH_AMDGCN_GFX810`` 0x02b ``gfx810``
548 ``EF_AMDGPU_MACH_AMDGCN_GFX900`` 0x02c ``gfx900``
549 ``EF_AMDGPU_MACH_AMDGCN_GFX902`` 0x02d ``gfx902``
550 *reserved* 0x02e Reserved.
551 *reserved* 0x02f Reserved.
552 *reserved* 0x030 Reserved.
553 ================================= ========== =============================
558 An AMDGPU target ELF code object has the standard ELF sections which include:
560 .. table:: AMDGPU ELF Sections
561 :name: amdgpu-elf-sections-table
563 ================== ================ =================================
565 ================== ================ =================================
566 ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
567 ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
568 ``.debug_``\ *\** ``SHT_PROGBITS`` *none*
569 ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
570 ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
571 ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
572 ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
573 ``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
574 ``.note`` ``SHT_NOTE`` *none*
575 ``.rela``\ *name* ``SHT_RELA`` *none*
576 ``.rela.dyn`` ``SHT_RELA`` *none*
577 ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
578 ``.shstrtab`` ``SHT_STRTAB`` *none*
579 ``.strtab`` ``SHT_STRTAB`` *none*
580 ``.symtab`` ``SHT_SYMTAB`` *none*
581 ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
582 ================== ================ =================================
584 These sections have their standard meanings (see [ELF]_) and are only generated
588 The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
589 DWARF produced by the AMDGPU backend.
591 ``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
592 The standard sections used by a dynamic loader.
595 See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
598 ``.rela``\ *name*, ``.rela.dyn``
599 For relocatable code objects, *name* is the name of the section that the
600 relocation records apply. For example, ``.rela.text`` is the section name for
601 relocation records associated with the ``.text`` section.
603 For linked shared code objects, ``.rela.dyn`` contains all the relocation
604 records from each of the relocatable code object's ``.rela``\ *name* sections.
606 See :ref:`amdgpu-relocation-records` for the relocation records supported by
610 The executable machine code for the kernels and functions they call. Generated
611 as position independent code. See :ref:`amdgpu-code-conventions` for
612 information on conventions used in the isa generation.
614 .. _amdgpu-note-records:
619 As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero byte padding must
620 be generated after the ``name`` field to ensure the ``desc`` field is 4 byte
621 aligned. In addition, minimal zero byte padding must be generated to ensure the
622 ``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` field of the
623 ``.note`` section must be at least 4 to indicate at least 8 byte alignment.
625 The AMDGPU backend code object uses the following ELF note records in the
626 ``.note`` section. The *Description* column specifies the layout of the note
627 record's ``desc`` field. All fields are consecutive bytes. Note records with
628 variable size strings have a corresponding ``*_size`` field that specifies the
629 number of bytes, including the terminating null character, in the string. The
630 string(s) come immediately after the preceding fields.
632 Additional note records can be present.
634 .. table:: AMDGPU ELF Note Records
635 :name: amdgpu-elf-note-records-table
637 ===== ============================== ======================================
638 Name Type Description
639 ===== ============================== ======================================
640 "AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
641 ===== ============================== ======================================
645 .. table:: AMDGPU ELF Note Record Enumeration Values
646 :name: amdgpu-elf-note-record-enumeration-values-table
648 ============================== =====
650 ============================== =====
652 ``NT_AMD_AMDGPU_HSA_METADATA`` 10
654 ============================== =====
656 ``NT_AMD_AMDGPU_HSA_METADATA``
657 Specifies extensible metadata associated with the code objects executed on HSA
658 [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
659 the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
660 :ref:`amdgpu-amdhsa-hsa-code-object-metadata` for the syntax of the code
661 object metadata string.
668 Symbols include the following:
670 .. table:: AMDGPU ELF Symbols
671 :name: amdgpu-elf-symbols-table
673 ===================== ============== ============= ==================
674 Name Type Section Description
675 ===================== ============== ============= ==================
676 *link-name* ``STT_OBJECT`` - ``.data`` Global variable
679 *link-name*\ ``@kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
680 *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point
681 ===================== ============== ============= ==================
684 Global variables both used and defined by the compilation unit.
686 If the symbol is defined in the compilation unit then it is allocated in the
687 appropriate section according to if it has initialized data or is readonly.
689 If the symbol is external then its section is ``STN_UNDEF`` and the loader
690 will resolve relocations using the definition provided by another code object
691 or explicitly defined by the runtime.
693 All global symbols, whether defined in the compilation unit or external, are
694 accessed by the machine code indirectly through a GOT table entry. This
695 allows them to be preemptable. The GOT table is only supported when the target
696 triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
699 Add description of linked shared object symbols. Seems undefined symbols
700 are marked as STT_NOTYPE.
703 Every HSA kernel has an associated kernel descriptor. It is the address of the
704 kernel descriptor that is used in the AQL dispatch packet used to invoke the
705 kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
706 defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
709 Every HSA kernel also has a symbol for its machine code entry point.
711 .. _amdgpu-relocation-records:
716 AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
717 relocatable fields are:
720 This specifies a 32-bit field occupying 4 bytes with arbitrary byte
721 alignment. These values use the same byte order as other word values in the
722 AMD GPU architecture.
725 This specifies a 64-bit field occupying 8 bytes with arbitrary byte
726 alignment. These values use the same byte order as other word values in the
727 AMD GPU architecture.
729 Following notations are used for specifying relocation calculations:
732 Represents the addend used to compute the value of the relocatable field.
735 Represents the offset into the global offset table at which the relocation
736 entry's symbol will reside during execution.
739 Represents the address of the global offset table.
742 Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
743 of the storage unit being relocated (computed using ``r_offset``).
746 Represents the value of the symbol whose index resides in the relocation
747 entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``.
750 Represents the base address of a loaded executable or shared object which is
751 the difference between the ELF address and the actual load address. Relocations
752 using this are only valid in executable or shared objects.
754 The following relocation types are supported:
756 .. table:: AMDGPU ELF Relocation Records
757 :name: amdgpu-elf-relocation-records-table
759 ========================== ======= ===== ========== ==============================
760 Relocation Type Kind Value Field Calculation
761 ========================== ======= ===== ========== ==============================
762 ``R_AMDGPU_NONE`` 0 *none* *none*
763 ``R_AMDGPU_ABS32_LO`` Dynamic 1 ``word32`` (S + A) & 0xFFFFFFFF
764 ``R_AMDGPU_ABS32_HI`` Dynamic 2 ``word32`` (S + A) >> 32
765 ``R_AMDGPU_ABS64`` Dynamic 3 ``word64`` S + A
766 ``R_AMDGPU_REL32`` Static 4 ``word32`` S + A - P
767 ``R_AMDGPU_REL64`` Static 5 ``word64`` S + A - P
768 ``R_AMDGPU_ABS32`` Static 6 ``word32`` S + A
769 ``R_AMDGPU_GOTPCREL`` Static 7 ``word32`` G + GOT + A - P
770 ``R_AMDGPU_GOTPCREL32_LO`` Static 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
771 ``R_AMDGPU_GOTPCREL32_HI`` Static 9 ``word32`` (G + GOT + A - P) >> 32
772 ``R_AMDGPU_REL32_LO`` Static 10 ``word32`` (S + A - P) & 0xFFFFFFFF
773 ``R_AMDGPU_REL32_HI`` Static 11 ``word32`` (S + A - P) >> 32
775 ``R_AMDGPU_RELATIVE64`` Dynamic 13 ``word64`` B + A
776 ========================== ======= ===== ========== ==============================
783 Standard DWARF [DWARF]_ Version 5 sections can be generated. These contain
784 information that maps the code object executable code and data to the source
785 language constructs. It can be used by tools such as debuggers and profilers.
787 Address Space Mapping
788 ~~~~~~~~~~~~~~~~~~~~~
790 The following address space mapping is used:
792 .. table:: AMDGPU DWARF Address Space Mapping
793 :name: amdgpu-dwarf-address-space-mapping-table
795 =================== =================
796 DWARF Address Space Memory Space
797 =================== =================
802 *omitted* Generic (Flat)
803 *not supported* Region (GDS)
804 =================== =================
806 See :ref:`amdgpu-address-spaces` for information on the memory space terminology
809 An ``address_class`` attribute is generated on pointer type DIEs to specify the
810 DWARF address space of the value of the pointer when it is in the *private* or
811 *local* address space. Otherwise the attribute is omitted.
813 An ``XDEREF`` operation is generated in location list expressions for variables
814 that are allocated in the *private* and *local* address space. Otherwise no
815 ``XDREF`` is omitted.
820 *This section is WIP.*
823 Define DWARF register enumeration.
825 If want to present a wavefront state then should expose vector registers as
826 64 wide (rather than per work-item view that LLVM uses). Either as separate
827 registers, or a 64x4 byte single register. In either case use a new LANE op
828 (akin to XDREF) to select the current lane usage in a location
829 expression. This would also allow scalar register spilling to vector register
830 lanes to be expressed (currently no debug information is being generated for
831 spilling). If choose a wide single register approach then use LANE in
832 conjunction with PIECE operation to select the dword part of the register for
833 the current lane. If the separate register approach then use LANE to select
839 Source text for online-compiled programs (e.g. those compiled by the OpenCL
840 runtime) may be embedded into the DWARF v5 line table using the ``clang
841 -gembed-source`` option, described in table :ref:`amdgpu-debug-options`.
846 Enable the embedded source DWARF v5 extension.
847 ``-gno-embed-source``
848 Disable the embedded source DWARF v5 extension.
850 .. table:: AMDGPU Debug Options
851 :name: amdgpu-debug-options
853 ==================== ==================================================
854 Debug Flag Description
855 ==================== ==================================================
856 -g[no-]embed-source Enable/disable embedding source text in DWARF
857 debug sections. Useful for environments where
858 source cannot be written to disk, such as
859 when performing online compilation.
860 ==================== ==================================================
862 This option enables one extended content types in the DWARF v5 Line Number
863 Program Header, which is used to encode embedded source.
865 .. table:: AMDGPU DWARF Line Number Program Header Extended Content Types
866 :name: amdgpu-dwarf-extended-content-types
868 ============================ ======================
870 ============================ ======================
871 ``DW_LNCT_LLVM_source`` ``DW_FORM_line_strp``
872 ============================ ======================
874 The source field will contain the UTF-8 encoded, null-terminated source text
875 with ``'\n'`` line endings. When the source field is present, consumers can use
876 the embedded source instead of attempting to discover the source on disk. When
877 the source field is absent, consumers can access the file to get the source
880 The above content type appears in the ``file_name_entry_format`` field of the
881 line table prologue, and its corresponding value appear in the ``file_names``
882 field. The current encoding of the content type is documented in table
883 :ref:`amdgpu-dwarf-extended-content-types-encoding`
885 .. table:: AMDGPU DWARF Line Number Program Header Extended Content Types Encoding
886 :name: amdgpu-dwarf-extended-content-types-encoding
888 ============================ ====================
890 ============================ ====================
891 ``DW_LNCT_LLVM_source`` 0x2001
892 ============================ ====================
894 .. _amdgpu-code-conventions:
899 This section provides code conventions used for each supported target triple OS
900 (see :ref:`amdgpu-target-triples`).
905 This section provides code conventions used when the target triple OS is
906 ``amdhsa`` (see :ref:`amdgpu-target-triples`).
908 .. _amdgpu-amdhsa-hsa-code-object-metadata:
913 The code object metadata specifies extensible metadata associated with the code
914 objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
915 [AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record
916 (see :ref:`amdgpu-note-records`) and is required when the target triple OS is
917 ``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
918 information necessary to support the ROCM kernel queries. For example, the
919 segment sizes needed in a dispatch packet. In addition, a high level language
920 runtime may require other information to be included. For example, the AMD
921 OpenCL runtime records kernel argument information.
923 The metadata is specified as a YAML formatted string (see [YAML]_ and
927 Is the string null terminated? It probably should not if YAML allows it to
928 contain null characters, otherwise it should be.
930 The metadata is represented as a single YAML document comprised of the mapping
931 defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
934 For boolean values, the string values of ``false`` and ``true`` are used for
935 false and true respectively.
937 Additional information can be added to the mappings. To avoid conflicts, any
938 non-AMD key names should be prefixed by "*vendor-name*.".
940 .. table:: AMDHSA Code Object Metadata Mapping
941 :name: amdgpu-amdhsa-code-object-metadata-mapping-table
943 ========== ============== ========= =======================================
944 String Key Value Type Required? Description
945 ========== ============== ========= =======================================
946 "Version" sequence of Required - The first integer is the major
947 2 integers version. Currently 1.
948 - The second integer is the minor
949 version. Currently 0.
950 "Printf" sequence of Each string is encoded information
951 strings about a printf function call. The
952 encoded information is organized as
953 fields separated by colon (':'):
955 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
960 A 32 bit integer as a unique id for
961 each printf function call
964 A 32 bit integer equal to the number
965 of arguments of printf function call
968 ``S[i]`` (where i = 0, 1, ... , N-1)
969 32 bit integers for the size in bytes
970 of the i-th FormatString argument of
971 the printf function call
974 The format string passed to the
975 printf function call.
976 "Kernels" sequence of Required Sequence of the mappings for each
977 mapping kernel in the code object. See
978 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
979 for the definition of the mapping.
980 ========== ============== ========= =======================================
984 .. table:: AMDHSA Code Object Kernel Metadata Mapping
985 :name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
987 ================= ============== ========= ================================
988 String Key Value Type Required? Description
989 ================= ============== ========= ================================
990 "Name" string Required Source name of the kernel.
991 "SymbolName" string Required Name of the kernel
992 descriptor ELF symbol.
993 "Language" string Source language of the kernel.
1001 "LanguageVersion" sequence of - The first integer is the major
1003 - The second integer is the
1005 "Attrs" mapping Mapping of kernel attributes.
1007 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
1008 for the mapping definition.
1009 "Args" sequence of Sequence of mappings of the
1010 mapping kernel arguments. See
1011 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
1012 for the definition of the mapping.
1013 "CodeProps" mapping Mapping of properties related to
1014 the kernel code. See
1015 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
1016 for the mapping definition.
1017 ================= ============== ========= ================================
1021 .. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
1022 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
1024 =================== ============== ========= ==============================
1025 String Key Value Type Required? Description
1026 =================== ============== ========= ==============================
1027 "ReqdWorkGroupSize" sequence of If not 0, 0, 0 then all values
1028 3 integers must be >=1 and the dispatch
1029 work-group size X, Y, Z must
1030 correspond to the specified
1031 values. Defaults to 0, 0, 0.
1033 Corresponds to the OpenCL
1034 ``reqd_work_group_size``
1036 "WorkGroupSizeHint" sequence of The dispatch work-group size
1037 3 integers X, Y, Z is likely to be the
1040 Corresponds to the OpenCL
1041 ``work_group_size_hint``
1043 "VecTypeHint" string The name of a scalar or vector
1046 Corresponds to the OpenCL
1047 ``vec_type_hint`` attribute.
1049 "RuntimeHandle" string The external symbol name
1050 associated with a kernel.
1051 OpenCL runtime allocates a
1052 global buffer for the symbol
1053 and saves the kernel's address
1054 to it, which is used for
1055 device side enqueueing. Only
1056 available for device side
1058 =================== ============== ========= ==============================
1062 .. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
1063 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
1065 ================= ============== ========= ================================
1066 String Key Value Type Required? Description
1067 ================= ============== ========= ================================
1068 "Name" string Kernel argument name.
1069 "TypeName" string Kernel argument type name.
1070 "Size" integer Required Kernel argument size in bytes.
1071 "Align" integer Required Kernel argument alignment in
1072 bytes. Must be a power of two.
1073 "ValueKind" string Required Kernel argument kind that
1074 specifies how to set up the
1075 corresponding argument.
1079 The argument is copied
1080 directly into the kernarg.
1083 A global address space pointer
1084 to the buffer data is passed
1087 "DynamicSharedPointer"
1088 A group address space pointer
1089 to dynamically allocated LDS
1090 is passed in the kernarg.
1093 A global address space
1094 pointer to a S# is passed in
1098 A global address space
1099 pointer to a T# is passed in
1103 A global address space pointer
1104 to an OpenCL pipe is passed in
1108 A global address space pointer
1109 to an OpenCL device enqueue
1110 queue is passed in the
1113 "HiddenGlobalOffsetX"
1114 The OpenCL grid dispatch
1115 global offset for the X
1116 dimension is passed in the
1119 "HiddenGlobalOffsetY"
1120 The OpenCL grid dispatch
1121 global offset for the Y
1122 dimension is passed in the
1125 "HiddenGlobalOffsetZ"
1126 The OpenCL grid dispatch
1127 global offset for the Z
1128 dimension is passed in the
1132 An argument that is not used
1133 by the kernel. Space needs to
1134 be left for it, but it does
1135 not need to be set up.
1137 "HiddenPrintfBuffer"
1138 A global address space pointer
1139 to the runtime printf buffer
1140 is passed in kernarg.
1142 "HiddenDefaultQueue"
1143 A global address space pointer
1144 to the OpenCL device enqueue
1145 queue that should be used by
1146 the kernel by default is
1147 passed in the kernarg.
1149 "HiddenCompletionAction"
1150 A global address space pointer
1151 to help link enqueued kernels into
1152 the ancestor tree for determining
1153 when the parent kernel has finished.
1155 "ValueType" string Required Kernel argument value type. Only
1156 present if "ValueKind" is
1157 "ByValue". For vector data
1158 types, the value is for the
1159 element type. Values include:
1175 How can it be determined if a
1176 vector type, and what size
1178 "PointeeAlign" integer Alignment in bytes of pointee
1179 type for pointer type kernel
1180 argument. Must be a power
1181 of 2. Only present if
1183 "DynamicSharedPointer".
1184 "AddrSpaceQual" string Kernel argument address space
1185 qualifier. Only present if
1186 "ValueKind" is "GlobalBuffer" or
1187 "DynamicSharedPointer". Values
1198 Is GlobalBuffer only Global
1200 DynamicSharedPointer always
1201 Local? Can HCC allow Generic?
1202 How can Private or Region
1204 "AccQual" string Kernel argument access
1205 qualifier. Only present if
1206 "ValueKind" is "Image" or
1217 "ActualAccQual" string The actual memory accesses
1218 performed by the kernel on the
1219 kernel argument. Only present if
1220 "ValueKind" is "GlobalBuffer",
1221 "Image", or "Pipe". This may be
1222 more restrictive than indicated
1223 by "AccQual" to reflect what the
1224 kernel actual does. If not
1225 present then the runtime must
1226 assume what is implied by
1227 "AccQual" and "IsConst". Values
1234 "IsConst" boolean Indicates if the kernel argument
1235 is const qualified. Only present
1239 "IsRestrict" boolean Indicates if the kernel argument
1240 is restrict qualified. Only
1241 present if "ValueKind" is
1244 "IsVolatile" boolean Indicates if the kernel argument
1245 is volatile qualified. Only
1246 present if "ValueKind" is
1249 "IsPipe" boolean Indicates if the kernel argument
1250 is pipe qualified. Only present
1251 if "ValueKind" is "Pipe".
1254 Can GlobalBuffer be pipe
1256 ================= ============== ========= ================================
1260 .. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
1261 :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
1263 ============================ ============== ========= =====================
1264 String Key Value Type Required? Description
1265 ============================ ============== ========= =====================
1266 "KernargSegmentSize" integer Required The size in bytes of
1268 that holds the values
1271 "GroupSegmentFixedSize" integer Required The amount of group
1275 bytes. This does not
1277 dynamically allocated
1278 group segment memory
1282 "PrivateSegmentFixedSize" integer Required The amount of fixed
1283 private address space
1284 memory required for a
1286 bytes. If the kernel
1288 stack then additional
1290 to this value for the
1292 "KernargSegmentAlign" integer Required The maximum byte
1295 kernarg segment. Must
1297 "WavefrontSize" integer Required Wavefront size. Must
1299 "NumSGPRs" integer Required Number of scalar
1303 includes the special
1309 SGPR added if a trap
1315 "NumVGPRs" integer Required Number of vector
1319 "MaxFlatWorkGroupSize" integer Required Maximum flat
1322 kernel in work-items.
1325 ReqdWorkGroupSize if
1327 "NumSpilledSGPRs" integer Number of stores from
1328 a scalar register to
1329 a register allocator
1332 "NumSpilledVGPRs" integer Number of stores from
1333 a vector register to
1334 a register allocator
1337 ============================ ============== ========= =====================
1344 The HSA architected queuing language (AQL) defines a user space memory interface
1345 that can be used to control the dispatch of kernels, in an agent independent
1346 way. An agent can have zero or more AQL queues created for it using the ROCm
1347 runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
1348 *HSA Platform System Architecture Specification* [HSA]_ for the AQL queue
1349 mechanics and packet layouts.
1351 The packet processor of a kernel agent is responsible for detecting and
1352 dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
1353 packet processor is implemented by the hardware command processor (CP),
1354 asynchronous dispatch controller (ADC) and shader processor input controller
1357 The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
1358 mode driver to initialize and register the AQL queue with CP.
1360 To dispatch a kernel the following actions are performed. This can occur in the
1361 CPU host program, or from an HSA kernel executing on a GPU.
1363 1. A pointer to an AQL queue for the kernel agent on which the kernel is to be
1364 executed is obtained.
1365 2. A pointer to the kernel descriptor (see
1366 :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
1367 obtained. It must be for a kernel that is contained in a code object that that
1368 was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
1370 3. Space is allocated for the kernel arguments using the ROCm runtime allocator
1371 for a memory region with the kernarg property for the kernel agent that will
1372 execute the kernel. It must be at least 16 byte aligned.
1373 4. Kernel argument values are assigned to the kernel argument memory
1374 allocation. The layout is defined in the *HSA Programmer's Language Reference*
1375 [HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
1376 memory in the same way constant memory is accessed. (Note that the HSA
1377 specification allows an implementation to copy the kernel argument contents to
1378 another location that is accessed by the kernel.)
1379 5. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
1380 api uses 64 bit atomic operations to reserve space in the AQL queue for the
1381 packet. The packet must be set up, and the final write must use an atomic
1382 store release to set the packet kind to ensure the packet contents are
1383 visible to the kernel agent. AQL defines a doorbell signal mechanism to
1384 notify the kernel agent that the AQL queue has been updated. These rules, and
1385 the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
1386 System Architecture Specification* [HSA]_.
1387 6. A kernel dispatch packet includes information about the actual dispatch,
1388 such as grid and work-group size, together with information from the code
1389 object about the kernel, such as segment sizes. The ROCm runtime queries on
1390 the kernel symbol can be used to obtain the code object values which are
1391 recorded in the :ref:`amdgpu-amdhsa-hsa-code-object-metadata`.
1392 7. CP executes micro-code and is responsible for detecting and setting up the
1393 GPU to execute the wavefronts of a kernel dispatch.
1394 8. CP ensures that when the a wavefront starts executing the kernel machine
1395 code, the scalar general purpose registers (SGPR) and vector general purpose
1396 registers (VGPR) are set up as required by the machine code. The required
1397 setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
1398 register state is defined in
1399 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
1400 9. The prolog of the kernel machine code (see
1401 :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
1402 before continuing executing the machine code that corresponds to the kernel.
1403 10. When the kernel dispatch has completed execution, CP signals the completion
1404 signal specified in the kernel dispatch packet if not 0.
1406 .. _amdgpu-amdhsa-memory-spaces:
1411 The memory space properties are:
1413 .. table:: AMDHSA Memory Spaces
1414 :name: amdgpu-amdhsa-memory-spaces-table
1416 ================= =========== ======== ======= ==================
1417 Memory Space Name HSA Segment Hardware Address NULL Value
1419 ================= =========== ======== ======= ==================
1420 Private private scratch 32 0x00000000
1421 Local group LDS 32 0xFFFFFFFF
1422 Global global global 64 0x0000000000000000
1423 Constant constant *same as 64 0x0000000000000000
1425 Generic flat flat 64 0x0000000000000000
1426 Region N/A GDS 32 *not implemented
1428 ================= =========== ======== ======= ==================
1430 The global and constant memory spaces both use global virtual addresses, which
1431 are the same virtual address space used by the CPU. However, some virtual
1432 addresses may only be accessible to the CPU, some only accessible by the GPU,
1435 Using the constant memory space indicates that the data will not change during
1436 the execution of the kernel. This allows scalar read instructions to be
1437 used. The vector and scalar L1 caches are invalidated of volatile data before
1438 each kernel dispatch execution to allow constant memory to change values between
1441 The local memory space uses the hardware Local Data Store (LDS) which is
1442 automatically allocated when the hardware creates work-groups of wavefronts, and
1443 freed when all the wavefronts of a work-group have terminated. The data store
1444 (DS) instructions can be used to access it.
1446 The private memory space uses the hardware scratch memory support. If the kernel
1447 uses scratch, then the hardware allocates memory that is accessed using
1448 wavefront lane dword (4 byte) interleaving. The mapping used from private
1449 address to physical address is:
1451 ``wavefront-scratch-base +
1452 (private-address * wavefront-size * 4) +
1453 (wavefront-lane-id * 4)``
1455 There are different ways that the wavefront scratch base address is determined
1456 by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
1457 memory can be accessed in an interleaved manner using buffer instruction with
1458 the scratch buffer descriptor and per wave scratch offset, by the scratch
1459 instructions, or by flat instructions. If each lane of a wavefront accesses the
1460 same private address, the interleaving results in adjacent dwords being accessed
1461 and hence requires fewer cache lines to be fetched. Multi-dword access is not
1462 supported except by flat and scratch instructions in GFX9.
1464 The generic address space uses the hardware flat address support available in
1465 GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
1466 local appertures), that are outside the range of addressible global memory, to
1467 map from a flat address to a private or local address.
1469 FLAT instructions can take a flat address and access global, private (scratch)
1470 and group (LDS) memory depending in if the address is within one of the
1471 apperture ranges. Flat access to scratch requires hardware aperture setup and
1472 setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
1473 access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
1474 (see :ref:`amdgpu-amdhsa-m0`).
1476 To convert between a segment address and a flat address the base address of the
1477 appertures address can be used. For GFX7-GFX8 these are available in the
1478 :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
1479 Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
1480 GFX9 the appature base addresses are directly available as inline constant
1481 registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
1482 address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
1483 which makes it easier to convert from flat to segment or segment to flat.
1488 Image and sample handles created by the ROCm runtime are 64 bit addresses of a
1489 hardware 32 byte V# and 48 byte S# object respectively. In order to support the
1490 HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
1491 enumeration values for the queries that are not trivially deducible from the S#
1497 HSA signal handles created by the ROCm runtime are 64 bit addresses of a
1498 structure allocated in memory accessible from both the CPU and GPU. The
1499 structure is defined by the ROCm runtime and subject to change between releases
1500 (see [AMD-ROCm-github]_).
1502 .. _amdgpu-amdhsa-hsa-aql-queue:
1507 The HSA AQL queue structure is defined by the ROCm runtime and subject to change
1508 between releases (see [AMD-ROCm-github]_). For some processors it contains
1509 fields needed to implement certain language features such as the flat address
1510 aperture bases. It also contains fields used by CP such as managing the
1511 allocation of scratch memory.
1513 .. _amdgpu-amdhsa-kernel-descriptor:
1518 A kernel descriptor consists of the information needed by CP to initiate the
1519 execution of a kernel, including the entry point address of the machine code
1520 that implements the kernel.
1522 Kernel Descriptor for GFX6-GFX9
1523 +++++++++++++++++++++++++++++++
1525 CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
1527 .. table:: Kernel Descriptor for GFX6-GFX9
1528 :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
1530 ======= ======= =============================== ============================
1531 Bits Size Field Name Description
1532 ======= ======= =============================== ============================
1533 31:0 4 bytes GroupSegmentFixedSize The amount of fixed local
1534 address space memory
1535 required for a work-group
1536 in bytes. This does not
1537 include any dynamically
1538 allocated local address
1539 space memory that may be
1540 added when the kernel is
1542 63:32 4 bytes PrivateSegmentFixedSize The amount of fixed
1543 private address space
1544 memory required for a
1545 work-item in bytes. If
1546 is_dynamic_callstack is 1
1547 then additional space must
1548 be added to this value for
1550 127:64 8 bytes Reserved, must be 0.
1551 191:128 8 bytes KernelCodeEntryByteOffset Byte offset (possibly
1554 descriptor to kernel's
1555 entry point instruction
1556 which must be 256 byte
1558 383:192 24 Reserved, must be 0.
1560 415:384 4 bytes ComputePgmRsrc1 Compute Shader (CS)
1561 program settings used by
1563 ``COMPUTE_PGM_RSRC1``
1566 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
1567 447:416 4 bytes ComputePgmRsrc2 Compute Shader (CS)
1568 program settings used by
1570 ``COMPUTE_PGM_RSRC2``
1573 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
1574 448 1 bit EnableSGPRPrivateSegmentBuffer Enable the setup of the
1575 SGPR user data registers
1577 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1579 The total number of SGPR
1581 requested must not exceed
1582 16 and match value in
1583 ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
1584 Any requests beyond 16
1586 449 1 bit EnableSGPRDispatchPtr *see above*
1587 450 1 bit EnableSGPRQueuePtr *see above*
1588 451 1 bit EnableSGPRKernargSegmentPtr *see above*
1589 452 1 bit EnableSGPRDispatchID *see above*
1590 453 1 bit EnableSGPRFlatScratchInit *see above*
1591 454 1 bit EnableSGPRPrivateSegmentSize *see above*
1592 455 1 bit EnableSGPRGridWorkgroupCountX Not implemented in CP and
1594 456 1 bit EnableSGPRGridWorkgroupCountY Not implemented in CP and
1596 457 1 bit EnableSGPRGridWorkgroupCountZ Not implemented in CP and
1598 463:458 6 bits Reserved, must be 0.
1599 511:464 6 Reserved, must be 0.
1601 512 **Total size 64 bytes.**
1602 ======= ====================================================================
1606 .. table:: compute_pgm_rsrc1 for GFX6-GFX9
1607 :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table
1609 ======= ======= =============================== ===========================================================================
1610 Bits Size Field Name Description
1611 ======= ======= =============================== ===========================================================================
1612 5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector registers
1613 used by each work-item,
1614 granularity is device
1619 - roundup((max_vgpg + 1)
1622 Used by CP to set up
1623 ``COMPUTE_PGM_RSRC1.VGPRS``.
1624 9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar registers
1625 used by a wavefront,
1626 granularity is device
1631 - roundup((max_sgpg + 1)
1635 - roundup((max_sgpg + 1)
1638 Includes the special SGPRs
1639 for VCC, Flat Scratch (for
1640 GFX7 onwards) and XNACK
1641 (for GFX8 onwards). It does
1642 not include the 16 SGPR
1643 added if a trap handler is
1646 Used by CP to set up
1647 ``COMPUTE_PGM_RSRC1.SGPRS``.
1648 11:10 2 bits PRIORITY Must be 0.
1650 Start executing wavefront
1651 at the specified priority.
1653 CP is responsible for
1655 ``COMPUTE_PGM_RSRC1.PRIORITY``.
1656 13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution
1657 with specified rounding
1660 precision floating point
1663 Floating point rounding
1664 mode values are defined in
1665 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1667 Used by CP to set up
1668 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1669 15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution
1670 with specified rounding
1671 denorm mode for half/double (16
1672 and 64 bit) floating point
1673 precision floating point
1676 Floating point rounding
1677 mode values are defined in
1678 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1680 Used by CP to set up
1681 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1682 17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution
1683 with specified denorm mode
1686 precision floating point
1689 Floating point denorm mode
1690 values are defined in
1691 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1693 Used by CP to set up
1694 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1695 19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution
1696 with specified denorm mode
1698 and 64 bit) floating point
1699 precision floating point
1702 Floating point denorm mode
1703 values are defined in
1704 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1706 Used by CP to set up
1707 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1708 20 1 bit PRIV Must be 0.
1710 Start executing wavefront
1711 in privilege trap handler
1714 CP is responsible for
1716 ``COMPUTE_PGM_RSRC1.PRIV``.
1717 21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution
1718 with DX10 clamp mode
1719 enabled. Used by the vector
1720 ALU to force DX10 style
1721 treatment of NaN's (when
1722 set, clamp NaN to zero,
1726 Used by CP to set up
1727 ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
1728 22 1 bit DEBUG_MODE Must be 0.
1730 Start executing wavefront
1731 in single step mode.
1733 CP is responsible for
1735 ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
1736 23 1 bit ENABLE_IEEE_MODE Wavefront starts execution
1738 enabled. Floating point
1739 opcodes that support
1740 exception flag gathering
1741 will quiet and propagate
1742 signaling-NaN inputs per
1743 IEEE 754-2008. Min_dx10 and
1744 max_dx10 become IEEE
1745 754-2008 compliant due to
1746 signaling-NaN propagation
1749 Used by CP to set up
1750 ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
1751 24 1 bit BULKY Must be 0.
1753 Only one work-group allowed
1754 to execute on a compute
1757 CP is responsible for
1759 ``COMPUTE_PGM_RSRC1.BULKY``.
1760 25 1 bit CDBG_USER Must be 0.
1762 Flag that can be used to
1763 control debugging code.
1765 CP is responsible for
1767 ``COMPUTE_PGM_RSRC1.CDBG_USER``.
1768 26 1 bit FP16_OVFL GFX6-GFX8
1769 Reserved, must be 0.
1771 Wavefront starts execution
1772 with specified fp16 overflow
1775 - If 0, fp16 overflow generates
1777 - If 1, fp16 overflow that is the
1778 result of an +/-INF input value
1779 or divide by 0 produces a +/-INF,
1780 otherwise clamps computed
1781 overflow to +/-MAX_FP16 as
1784 Used by CP to set up
1785 ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
1786 31:27 5 bits Reserved, must be 0.
1787 32 **Total size 4 bytes**
1788 ======= ===================================================================================================================
1792 .. table:: compute_pgm_rsrc2 for GFX6-GFX9
1793 :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
1795 ======= ======= =============================== ===========================================================================
1796 Bits Size Field Name Description
1797 ======= ======= =============================== ===========================================================================
1798 0 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the
1799 _WAVE_OFFSET SGPR wave scratch offset
1800 system register (see
1801 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1803 Used by CP to set up
1804 ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
1805 5:1 5 bits USER_SGPR_COUNT The total number of SGPR
1807 requested. This number must
1808 match the number of user
1809 data registers enabled.
1811 Used by CP to set up
1812 ``COMPUTE_PGM_RSRC2.USER_SGPR``.
1813 6 1 bit ENABLE_TRAP_HANDLER Set to 1 if code contains a
1814 TRAP instruction which
1815 requires a trap handler to
1819 ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``
1821 installed a trap handler
1822 regardless of the setting
1824 7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the
1825 system SGPR register for
1826 the work-group id in the X
1828 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1830 Used by CP to set up
1831 ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
1832 8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the
1833 system SGPR register for
1834 the work-group id in the Y
1836 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1838 Used by CP to set up
1839 ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
1840 9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the
1841 system SGPR register for
1842 the work-group id in the Z
1844 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1846 Used by CP to set up
1847 ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
1848 10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the
1849 system SGPR register for
1850 work-group information (see
1851 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1853 Used by CP to set up
1854 ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
1855 12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the
1856 VGPR system registers used
1857 for the work-item ID.
1858 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
1861 Used by CP to set up
1862 ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
1863 13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0.
1865 Wavefront starts execution
1867 exceptions enabled which
1868 are generated when L1 has
1869 witnessed a thread access
1873 CP is responsible for
1874 filling in the address
1876 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1877 according to what the
1879 14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0.
1881 Wavefront starts execution
1882 with memory violation
1883 exceptions exceptions
1884 enabled which are generated
1885 when a memory violation has
1886 occurred for this wave from
1888 (write-to-read-only-memory,
1889 mis-aligned atomic, LDS
1890 address out of range,
1891 illegal address, etc.).
1895 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1896 according to what the
1898 23:15 9 bits GRANULATED_LDS_SIZE Must be 0.
1900 CP uses the rounded value
1901 from the dispatch packet,
1902 not this value, as the
1903 dispatch may contain
1904 dynamically allocated group
1905 segment memory. CP writes
1907 ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
1909 Amount of group segment
1910 (LDS) to allocate for each
1911 work-group. Granularity is
1915 roundup(lds-size / (64 * 4))
1917 roundup(lds-size / (128 * 4))
1919 24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution
1920 _INVALID_OPERATION with specified exceptions
1923 Used by CP to set up
1924 ``COMPUTE_PGM_RSRC2.EXCP_EN``
1925 (set from bits 0..6).
1929 25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more
1930 _SOURCE input operands is a
1932 26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by
1933 _DIVISION_BY_ZERO Zero
1934 27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow
1936 28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow
1938 29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact
1940 30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero
1941 _ZERO (rcp_iflag_f32 instruction
1943 31 1 bit Reserved, must be 0.
1944 32 **Total size 4 bytes.**
1945 ======= ===================================================================================================================
1949 .. table:: Floating Point Rounding Mode Enumeration Values
1950 :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
1952 ====================================== ===== ==============================
1953 Enumeration Name Value Description
1954 ====================================== ===== ==============================
1955 AMDGPU_FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
1956 AMDGPU_FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
1957 AMDGPU_FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
1958 AMDGPU_FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
1959 ====================================== ===== ==============================
1963 .. table:: Floating Point Denorm Mode Enumeration Values
1964 :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
1966 ====================================== ===== ==============================
1967 Enumeration Name Value Description
1968 ====================================== ===== ==============================
1969 AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination
1971 AMDGPU_FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
1972 AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
1973 AMDGPU_FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
1974 ====================================== ===== ==============================
1978 .. table:: System VGPR Work-Item ID Enumeration Values
1979 :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
1981 ======================================== ===== ============================
1982 Enumeration Name Value Description
1983 ======================================== ===== ============================
1984 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension
1986 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
1988 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
1990 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
1991 ======================================== ===== ============================
1993 .. _amdgpu-amdhsa-initial-kernel-execution-state:
1995 Initial Kernel Execution State
1996 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1998 This section defines the register state that will be set up by the packet
1999 processor prior to the start of execution of every wavefront. This is limited by
2000 the constraints of the hardware controllers of CP/ADC/SPI.
2002 The order of the SGPR registers is defined, but the compiler can specify which
2003 ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
2004 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2005 for enabled registers are dense starting at SGPR0: the first enabled register is
2006 SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
2009 The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
2010 all waves of the grid. It is possible to specify more than 16 User SGPRs using
2011 the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
2012 initialized. These are then immediately followed by the System SGPRs that are
2013 set up by ADC/SPI and can have different values for each wave of the grid
2016 SGPR register initial state is defined in
2017 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
2019 .. table:: SGPR Register Set Up Order
2020 :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
2022 ========== ========================== ====== ==============================
2023 SGPR Order Name Number Description
2024 (kernel descriptor enable of
2026 ========== ========================== ====== ==============================
2027 First Private Segment Buffer 4 V# that can be used, together
2028 (enable_sgpr_private with Scratch Wave Offset as an
2029 _segment_buffer) offset, to access the private
2030 memory space using a segment
2033 CP uses the value provided by
2035 then Dispatch Ptr 2 64 bit address of AQL dispatch
2036 (enable_sgpr_dispatch_ptr) packet for kernel dispatch
2038 then Queue Ptr 2 64 bit address of amd_queue_t
2039 (enable_sgpr_queue_ptr) object for AQL queue on which
2040 the dispatch packet was
2042 then Kernarg Segment Ptr 2 64 bit address of Kernarg
2043 (enable_sgpr_kernarg segment. This is directly
2044 _segment_ptr) copied from the
2045 kernarg_address in the kernel
2048 Having CP load it once avoids
2049 loading it at the beginning of
2051 then Dispatch Id 2 64 bit Dispatch ID of the
2052 (enable_sgpr_dispatch_id) dispatch packet being
2054 then Flat Scratch Init 2 This is 2 SGPRs:
2055 (enable_sgpr_flat_scratch
2059 The first SGPR is a 32 bit
2061 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2062 to per SPI base of memory
2063 for scratch for the queue
2064 executing the kernel
2065 dispatch. CP obtains this
2066 from the runtime. (The
2067 Scratch Segment Buffer base
2069 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2070 plus this offset.) The value
2071 of Scratch Wave Offset must
2072 be added to this offset by
2073 the kernel machine code,
2074 right shifted by 8, and
2075 moved to the FLAT_SCRATCH_HI
2077 FLAT_SCRATCH_HI corresponds
2078 to SGPRn-4 on GFX7, and
2079 SGPRn-6 on GFX8 (where SGPRn
2080 is the highest numbered SGPR
2081 allocated to the wave).
2083 multiplied by 256 (as it is
2084 in units of 256 bytes) and
2086 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2087 to calculate the per wave
2088 FLAT SCRATCH BASE in flat
2089 memory instructions that
2093 The second SGPR is 32 bit
2094 byte size of a single
2095 work-item's scratch memory
2096 usage. CP obtains this from
2097 the runtime, and it is
2098 always a multiple of DWORD.
2099 CP checks that the value in
2100 the kernel dispatch packet
2101 Private Segment Byte Size is
2102 not larger, and requests the
2103 runtime to increase the
2104 queue's scratch size if
2105 necessary. The kernel code
2107 FLAT_SCRATCH_LO which is
2108 SGPRn-3 on GFX7 and SGPRn-5
2109 on GFX8. FLAT_SCRATCH_LO is
2110 used as the FLAT SCRATCH
2112 instructions. Having CP load
2113 it once avoids loading it at
2114 the beginning of every
2118 64 bit base address of the
2119 per SPI scratch backing
2120 memory managed by SPI for
2121 the queue executing the
2122 kernel dispatch. CP obtains
2123 this from the runtime (and
2124 divides it if there are
2125 multiple Shader Arrays each
2126 with its own SPI). The value
2127 of Scratch Wave Offset must
2128 be added by the kernel
2129 machine code and the result
2130 moved to the FLAT_SCRATCH
2131 SGPR which is SGPRn-6 and
2132 SGPRn-5. It is used as the
2133 FLAT SCRATCH BASE in flat
2134 memory instructions.
2135 then Private Segment Size 1 The 32 bit byte size of a
2136 (enable_sgpr_private single
2138 scratch_segment_size) memory
2139 allocation. This is the
2140 value from the kernel
2141 dispatch packet Private
2142 Segment Byte Size rounded up
2143 by CP to a multiple of
2146 Having CP load it once avoids
2147 loading it at the beginning of
2150 This is not used for
2151 GFX7-GFX8 since it is the same
2152 value as the second SGPR of
2153 Flat Scratch Init. However, it
2154 may be needed for GFX9 which
2155 changes the meaning of the
2156 Flat Scratch Init value.
2157 then Grid Work-Group Count X 1 32 bit count of the number of
2158 (enable_sgpr_grid work-groups in the X dimension
2159 _workgroup_count_X) for the grid being
2160 executed. Computed from the
2161 fields in the kernel dispatch
2162 packet as ((grid_size.x +
2163 workgroup_size.x - 1) /
2165 then Grid Work-Group Count Y 1 32 bit count of the number of
2166 (enable_sgpr_grid work-groups in the Y dimension
2167 _workgroup_count_Y && for the grid being
2168 less than 16 previous executed. Computed from the
2169 SGPRs) fields in the kernel dispatch
2170 packet as ((grid_size.y +
2171 workgroup_size.y - 1) /
2174 Only initialized if <16
2175 previous SGPRs initialized.
2176 then Grid Work-Group Count Z 1 32 bit count of the number of
2177 (enable_sgpr_grid work-groups in the Z dimension
2178 _workgroup_count_Z && for the grid being
2179 less than 16 previous executed. Computed from the
2180 SGPRs) fields in the kernel dispatch
2181 packet as ((grid_size.z +
2182 workgroup_size.z - 1) /
2185 Only initialized if <16
2186 previous SGPRs initialized.
2187 then Work-Group Id X 1 32 bit work-group id in X
2188 (enable_sgpr_workgroup_id dimension of grid for
2190 then Work-Group Id Y 1 32 bit work-group id in Y
2191 (enable_sgpr_workgroup_id dimension of grid for
2193 then Work-Group Id Z 1 32 bit work-group id in Z
2194 (enable_sgpr_workgroup_id dimension of grid for
2196 then Work-Group Info 1 {first_wave, 14'b0000,
2197 (enable_sgpr_workgroup ordered_append_term[10:0],
2198 _info) threadgroup_size_in_waves[5:0]}
2199 then Scratch Wave Offset 1 32 bit byte offset from base
2200 (enable_sgpr_private of scratch base of queue
2201 _segment_wave_offset) executing the kernel
2202 dispatch. Must be used as an
2204 segment address when using
2205 Scratch Segment Buffer. It
2206 must be used to set up FLAT
2207 SCRATCH for flat addressing
2209 :ref:`amdgpu-amdhsa-flat-scratch`).
2210 ========== ========================== ====== ==============================
2212 The order of the VGPR registers is defined, but the compiler can specify which
2213 ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
2214 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2215 for enabled registers are dense starting at VGPR0: the first enabled register is
2216 VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
2219 VGPR register initial state is defined in
2220 :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
2222 .. table:: VGPR Register Set Up Order
2223 :name: amdgpu-amdhsa-vgpr-register-set-up-order-table
2225 ========== ========================== ====== ==============================
2226 VGPR Order Name Number Description
2227 (kernel descriptor enable of
2229 ========== ========================== ====== ==============================
2230 First Work-Item Id X 1 32 bit work item id in X
2231 (Always initialized) dimension of work-group for
2233 then Work-Item Id Y 1 32 bit work item id in Y
2234 (enable_vgpr_workitem_id dimension of work-group for
2235 > 0) wavefront lane.
2236 then Work-Item Id Z 1 32 bit work item id in Z
2237 (enable_vgpr_workitem_id dimension of work-group for
2238 > 1) wavefront lane.
2239 ========== ========================== ====== ==============================
2241 The setting of registers is is done by GPU CP/ADC/SPI hardware as follows:
2243 1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
2245 2. Work-group Id registers X, Y, Z are set by ADC which supports any
2246 combination including none.
2247 3. Scratch Wave Offset is set by SPI in a per wave basis which is why its value
2248 cannot included with the flat scratch init value which is per queue.
2249 4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
2252 Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
2253 value to the hardware required SGPRn-3 and SGPRn-4 respectively.
2255 The global segment can be accessed either using buffer instructions (GFX6 which
2256 has V# 64 bit address support), flat instructions (GFX7-GFX9), or global
2257 instructions (GFX9).
2259 If buffer operations are used then the compiler can generate a V# with the
2260 following properties:
2264 * ATC: 1 if IOMMU present (such as APU)
2266 * MTYPE set to support memory coherence that matches the runtime (such as CC for
2267 APU and NC for dGPU).
2269 .. _amdgpu-amdhsa-kernel-prolog:
2274 .. _amdgpu-amdhsa-m0:
2280 The M0 register must be initialized with a value at least the total LDS size
2281 if the kernel may access LDS via DS or flat operations. Total LDS size is
2282 available in dispatch packet. For M0, it is also possible to use maximum
2283 possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
2286 The M0 register is not used for range checking LDS accesses and so does not
2287 need to be initialized in the prolog.
2289 .. _amdgpu-amdhsa-flat-scratch:
2294 If the kernel may use flat operations to access scratch memory, the prolog code
2295 must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
2296 are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wave
2297 Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
2300 Flat scratch is not supported.
2303 1. The low word of Flat Scratch Init is 32 bit byte offset from
2304 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
2305 being managed by SPI for the queue executing the kernel dispatch. This is
2306 the same value used in the Scratch Segment Buffer V# base address. The
2307 prolog must add the value of Scratch Wave Offset to get the wave's byte
2308 scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
2309 FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
2310 by 8 before moving into FLAT_SCRATCH_LO.
2311 2. The second word of Flat Scratch Init is 32 bit byte size of a single
2312 work-items scratch memory usage. This is directly loaded from the kernel
2313 dispatch packet Private Segment Byte Size and rounded up to a multiple of
2314 DWORD. Having CP load it once avoids loading it at the beginning of every
2315 wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
2319 The Flat Scratch Init is the 64 bit address of the base of scratch backing
2320 memory being managed by SPI for the queue executing the kernel dispatch. The
2321 prolog must add the value of Scratch Wave Offset and moved to the FLAT_SCRATCH
2322 pair for use as the flat scratch base in flat memory instructions.
2324 .. _amdgpu-amdhsa-memory-model:
2329 This section describes the mapping of LLVM memory model onto AMDGPU machine code
2330 (see :ref:`memmodel`). *The implementation is WIP.*
2333 Update when implementation complete.
2335 The AMDGPU backend supports the memory synchronization scopes specified in
2336 :ref:`amdgpu-memory-scopes`.
2338 The code sequences used to implement the memory model are defined in table
2339 :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
2341 The sequences specify the order of instructions that a single thread must
2342 execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
2343 to other memory instructions executed by the same thread. This allows them to be
2344 moved earlier or later which can allow them to be combined with other instances
2345 of the same instruction, or hoisted/sunk out of loops to improve
2346 performance. Only the instructions related to the memory model are given;
2347 additional ``s_waitcnt`` instructions are required to ensure registers are
2348 defined before being used. These may be able to be combined with the memory
2349 model ``s_waitcnt`` instructions as described above.
2351 The AMDGPU backend supports the following memory models:
2353 HSA Memory Model [HSA]_
2354 The HSA memory model uses a single happens-before relation for all address
2355 spaces (see :ref:`amdgpu-address-spaces`).
2356 OpenCL Memory Model [OpenCL]_
2357 The OpenCL memory model which has separate happens-before relations for the
2358 global and local address spaces. Only a fence specifying both global and
2359 local address space, and seq_cst instructions join the relationships. Since
2360 the LLVM ``memfence`` instruction does not allow an address space to be
2361 specified the OpenCL fence has to convervatively assume both local and
2362 global address space was specified. However, optimizations can often be
2363 done to eliminate the additional ``s_waitcnt`` instructions when there are
2364 no intervening memory instructions which access the corresponding address
2365 space. The code sequences in the table indicate what can be omitted for the
2366 OpenCL memory. The target triple environment is used to determine if the
2367 source language is OpenCL (see :ref:`amdgpu-opencl`).
2369 ``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
2372 ``buffer/global/flat_load/store/atomic`` instructions to global memory are
2373 termed vector memory operations.
2377 * Each agent has multiple compute units (CU).
2378 * Each CU has multiple SIMDs that execute wavefronts.
2379 * The wavefronts for a single work-group are executed in the same CU but may be
2380 executed by different SIMDs.
2381 * Each CU has a single LDS memory shared by the wavefronts of the work-groups
2383 * All LDS operations of a CU are performed as wavefront wide operations in a
2384 global order and involve no caching. Completion is reported to a wavefront in
2386 * The LDS memory has multiple request queues shared by the SIMDs of a
2387 CU. Therefore, the LDS operations performed by different waves of a work-group
2388 can be reordered relative to each other, which can result in reordering the
2389 visibility of vector memory operations with respect to LDS operations of other
2390 wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
2391 ensure synchronization between LDS operations and vector memory operations
2392 between waves of a work-group, but not between operations performed by the
2394 * The vector memory operations are performed as wavefront wide operations and
2395 completion is reported to a wavefront in execution order. The exception is
2396 that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
2397 vector memory order if they access LDS memory, and out of LDS operation order
2398 if they access global memory.
2399 * The vector memory operations access a single vector L1 cache shared by all
2400 SIMDs a CU. Therefore, no special action is required for coherence between the
2401 lanes of a single wavefront, or for coherence between wavefronts in the same
2402 work-group. A ``buffer_wbinvl1_vol`` is required for coherence between waves
2403 executing in different work-groups as they may be executing on different CUs.
2404 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
2405 on a group of CUs. The scalar and vector L1 caches are not coherent. However,
2406 scalar operations are used in a restricted way so do not impact the memory
2407 model. See :ref:`amdgpu-amdhsa-memory-spaces`.
2408 * The vector and scalar memory operations use an L2 cache shared by all CUs on
2410 * The L2 cache has independent channels to service disjoint ranges of virtual
2412 * Each CU has a separate request queue per channel. Therefore, the vector and
2413 scalar memory operations performed by waves executing in different work-groups
2414 (which may be executing on different CUs) of an agent can be reordered
2415 relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
2416 synchronization between vector memory operations of different CUs. It ensures a
2417 previous vector memory operation has completed before executing a subsequent
2418 vector memory or LDS operation and so can be used to meet the requirements of
2419 acquire and release.
2420 * The L2 cache can be kept coherent with other agents on some targets, or ranges
2421 of virtual addresses can be set up to bypass it to ensure system coherence.
2423 Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-GFX8),
2424 or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
2425 memory, atomic memory orderings are not meaningful and all accesses are treated
2428 Constant address space uses ``buffer/global_load`` instructions (or equivalent
2429 scalar memory instructions). Since the constant address space contents do not
2430 change during the execution of a kernel dispatch it is not legal to perform
2431 stores, and atomic memory orderings are not meaningful and all access are
2432 treated as non-atomic.
2434 A memory synchronization scope wider than work-group is not meaningful for the
2435 group (LDS) address space and is treated as work-group.
2437 The memory model does not support the region address space which is treated as
2440 Acquire memory ordering is not meaningful on store atomic instructions and is
2441 treated as non-atomic.
2443 Release memory ordering is not meaningful on load atomic instructions and is
2444 treated a non-atomic.
2446 Acquire-release memory ordering is not meaningful on load or store atomic
2447 instructions and is treated as acquire and release respectively.
2449 AMDGPU backend only uses scalar memory operations to access memory that is
2450 proven to not change during the execution of the kernel dispatch. This includes
2451 constant address space and global address space for program scope const
2452 variables. Therefore the kernel machine code does not have to maintain the
2453 scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
2454 and vector L1 caches are invalidated between kernel dispatches by CP since
2455 constant address space data may change between kernel dispatch executions. See
2456 :ref:`amdgpu-amdhsa-memory-spaces`.
2458 The one execption is if scalar writes are used to spill SGPR registers. In this
2459 case the AMDGPU backend ensures the memory location used to spill is never
2460 accessed by vector memory operations at the same time. If scalar writes are used
2461 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
2462 return since the locations may be used for vector memory instructions by a
2463 future wave that uses the same scratch area, or a function call that creates a
2464 frame at the same address, respectively. There is no need for a ``s_dcache_inv``
2465 as all scalar writes are write-before-read in the same thread.
2467 Scratch backing memory (which is used for the private address space)
2468 is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private
2469 address space is only accessed by a single thread, and is always
2470 write-before-read, there is never a need to invalidate these entries from the L1
2471 cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the
2472 volatile cache lines.
2474 On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
2475 to invalidate the L2 cache. This also causes it to be treated as
2476 non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC
2477 (cache coherent) and so the L2 cache will coherent with the CPU and other
2480 .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
2481 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
2483 ============ ============ ============== ========== ===============================
2484 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
2485 Ordering Sync Scope Address
2487 ============ ============ ============== ========== ===============================
2489 -----------------------------------------------------------------------------------
2490 load *none* *none* - global - !volatile & !nontemporal
2492 - private 1. buffer/global/flat_load
2494 - volatile & !nontemporal
2496 1. buffer/global/flat_load
2501 1. buffer/global/flat_load
2504 load *none* *none* - local 1. ds_load
2505 store *none* *none* - global - !nontemporal
2507 - private 1. buffer/global/flat_store
2511 1. buffer/global/flat_stote
2514 store *none* *none* - local 1. ds_store
2515 **Unordered Atomic**
2516 -----------------------------------------------------------------------------------
2517 load atomic unordered *any* *any* *Same as non-atomic*.
2518 store atomic unordered *any* *any* *Same as non-atomic*.
2519 atomicrmw unordered *any* *any* *Same as monotonic
2521 **Monotonic Atomic**
2522 -----------------------------------------------------------------------------------
2523 load atomic monotonic - singlethread - global 1. buffer/global/flat_load
2524 - wavefront - generic
2526 load atomic monotonic - singlethread - local 1. ds_load
2529 load atomic monotonic - agent - global 1. buffer/global/flat_load
2530 - system - generic glc=1
2531 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
2532 - wavefront - generic
2536 store atomic monotonic - singlethread - local 1. ds_store
2539 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
2540 - wavefront - generic
2544 atomicrmw monotonic - singlethread - local 1. ds_atomic
2548 -----------------------------------------------------------------------------------
2549 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
2552 load atomic acquire - workgroup - global 1. buffer/global/flat_load
2553 load atomic acquire - workgroup - local 1. ds_load
2554 2. s_waitcnt lgkmcnt(0)
2557 - Must happen before
2569 load atomic acquire - workgroup - generic 1. flat_load
2570 2. s_waitcnt lgkmcnt(0)
2573 - Must happen before
2585 load atomic acquire - agent - global 1. buffer/global/flat_load
2587 2. s_waitcnt vmcnt(0)
2589 - Must happen before
2597 3. buffer_wbinvl1_vol
2599 - Must happen before
2609 load atomic acquire - agent - generic 1. flat_load glc=1
2610 - system 2. s_waitcnt vmcnt(0) &
2615 - Must happen before
2618 - Ensures the flat_load
2623 3. buffer_wbinvl1_vol
2625 - Must happen before
2635 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
2638 atomicrmw acquire - workgroup - global 1. buffer/global/flat_atomic
2639 atomicrmw acquire - workgroup - local 1. ds_atomic
2640 2. waitcnt lgkmcnt(0)
2643 - Must happen before
2656 atomicrmw acquire - workgroup - generic 1. flat_atomic
2657 2. waitcnt lgkmcnt(0)
2660 - Must happen before
2673 atomicrmw acquire - agent - global 1. buffer/global/flat_atomic
2674 - system 2. s_waitcnt vmcnt(0)
2676 - Must happen before
2685 3. buffer_wbinvl1_vol
2687 - Must happen before
2697 atomicrmw acquire - agent - generic 1. flat_atomic
2698 - system 2. s_waitcnt vmcnt(0) &
2703 - Must happen before
2712 3. buffer_wbinvl1_vol
2714 - Must happen before
2724 fence acquire - singlethread *none* *none*
2726 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0)
2731 - However, since LLVM
2756 fence-paired-atomic).
2757 - Must happen before
2768 fence-paired-atomic.
2770 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
2777 - However, since LLVM
2785 - Could be split into
2794 - s_waitcnt vmcnt(0)
2805 fence-paired-atomic).
2806 - s_waitcnt lgkmcnt(0)
2817 fence-paired-atomic).
2818 - Must happen before
2832 fence-paired-atomic.
2834 2. buffer_wbinvl1_vol
2836 - Must happen before any
2837 following global/generic
2847 -----------------------------------------------------------------------------------
2848 store atomic release - singlethread - global 1. buffer/global/ds/flat_store
2851 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
2860 - Must happen before
2871 2. buffer/global/flat_store
2872 store atomic release - workgroup - local 1. ds_store
2873 store atomic release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
2882 - Must happen before
2894 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
2895 - system - generic vmcnt(0)
2899 - Could be split into
2908 - s_waitcnt vmcnt(0)
2915 - s_waitcnt lgkmcnt(0)
2922 - Must happen before
2933 2. buffer/global/ds/flat_store
2934 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
2937 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
2946 - Must happen before
2957 2. buffer/global/flat_atomic
2958 atomicrmw release - workgroup - local 1. ds_atomic
2959 atomicrmw release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
2968 - Must happen before
2980 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
2981 - system - generic vmcnt(0)
2985 - Could be split into
2994 - s_waitcnt vmcnt(0)
3001 - s_waitcnt lgkmcnt(0)
3008 - Must happen before
3019 2. buffer/global/ds/flat_atomic
3020 fence release - singlethread *none* *none*
3022 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0)
3027 - However, since LLVM
3048 - Must happen before
3057 fence-paired-atomic).
3064 fence-paired-atomic.
3066 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) &
3077 - However, since LLVM
3092 - Could be split into
3101 - s_waitcnt vmcnt(0)
3108 - s_waitcnt lgkmcnt(0)
3115 - Must happen before
3124 fence-paired-atomic).
3131 fence-paired-atomic.
3133 **Acquire-Release Atomic**
3134 -----------------------------------------------------------------------------------
3135 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
3138 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
3147 - Must happen before
3158 2. buffer/global/flat_atomic
3159 atomicrmw acq_rel - workgroup - local 1. ds_atomic
3160 2. s_waitcnt lgkmcnt(0)
3163 - Must happen before
3176 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
3185 - Must happen before
3197 3. s_waitcnt lgkmcnt(0)
3200 - Must happen before
3213 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
3218 - Could be split into
3227 - s_waitcnt vmcnt(0)
3234 - s_waitcnt lgkmcnt(0)
3241 - Must happen before
3252 2. buffer/global/flat_atomic
3253 3. s_waitcnt vmcnt(0)
3255 - Must happen before
3264 4. buffer_wbinvl1_vol
3266 - Must happen before
3276 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
3281 - Could be split into
3290 - s_waitcnt vmcnt(0)
3297 - s_waitcnt lgkmcnt(0)
3304 - Must happen before
3316 3. s_waitcnt vmcnt(0) &
3321 - Must happen before
3330 4. buffer_wbinvl1_vol
3332 - Must happen before
3342 fence acq_rel - singlethread *none* *none*
3344 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0)
3364 - Must happen before
3387 acquire-fence-paired-atomic
3408 release-fence-paired-atomic
3409 ). This satisfies the
3413 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) &
3420 - However, since LLVM
3428 - Could be split into
3437 - s_waitcnt vmcnt(0)
3444 - s_waitcnt lgkmcnt(0)
3451 - Must happen before
3456 global/local/generic
3465 acquire-fence-paired-atomic
3477 global/local/generic
3486 release-fence-paired-atomic
3487 ). This satisfies the
3491 2. buffer_wbinvl1_vol
3493 - Must happen before
3507 **Sequential Consistent Atomic**
3508 -----------------------------------------------------------------------------------
3509 load atomic seq_cst - singlethread - global *Same as corresponding
3510 - wavefront - local load atomic acquire,
3511 - generic except must generated
3512 all instructions even
3514 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0)
3529 lgkmcnt(0) and so do
3564 instructions same as
3567 except must generated
3568 all instructions even
3570 load atomic seq_cst - workgroup - local *Same as corresponding
3571 load atomic acquire,
3572 except must generated
3573 all instructions even
3575 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
3576 - system - generic vmcnt(0)
3578 - Could be split into
3587 - waitcnt lgkmcnt(0)
3600 lgkmcnt(0) and so do
3651 instructions same as
3654 except must generated
3655 all instructions even
3657 store atomic seq_cst - singlethread - global *Same as corresponding
3658 - wavefront - local store atomic release,
3659 - workgroup - generic except must generated
3660 all instructions even
3662 store atomic seq_cst - agent - global *Same as corresponding
3663 - system - generic store atomic release,
3664 except must generated
3665 all instructions even
3667 atomicrmw seq_cst - singlethread - global *Same as corresponding
3668 - wavefront - local atomicrmw acq_rel,
3669 - workgroup - generic except must generated
3670 all instructions even
3672 atomicrmw seq_cst - agent - global *Same as corresponding
3673 - system - generic atomicrmw acq_rel,
3674 except must generated
3675 all instructions even
3677 fence seq_cst - singlethread *none* *Same as corresponding
3678 - wavefront fence acq_rel,
3679 - workgroup except must generated
3680 - agent all instructions even
3681 - system for OpenCL.*
3682 ============ ============ ============== ========== ===============================
3684 The memory order also adds the single thread optimization constrains defined in
3686 :ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
3688 .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
3689 :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
3691 ============ ==============================================================
3692 LLVM Memory Optimization Constraints
3694 ============ ==============================================================
3697 acquire - If a load atomic/atomicrmw then no following load/load
3698 atomic/store/ store atomic/atomicrmw/fence instruction can
3699 be moved before the acquire.
3700 - If a fence then same as load atomic, plus no preceding
3701 associated fence-paired-atomic can be moved after the fence.
3702 release - If a store atomic/atomicrmw then no preceding load/load
3703 atomic/store/ store atomic/atomicrmw/fence instruction can
3704 be moved after the release.
3705 - If a fence then same as store atomic, plus no following
3706 associated fence-paired-atomic can be moved before the
3708 acq_rel Same constraints as both acquire and release.
3709 seq_cst - If a load atomic then same constraints as acquire, plus no
3710 preceding sequentially consistent load atomic/store
3711 atomic/atomicrmw/fence instruction can be moved after the
3713 - If a store atomic then the same constraints as release, plus
3714 no following sequentially consistent load atomic/store
3715 atomic/atomicrmw/fence instruction can be moved before the
3717 - If an atomicrmw/fence then same constraints as acq_rel.
3718 ============ ==============================================================
3723 For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
3724 (such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
3725 the ``s_trap`` instruction with the following usage:
3727 .. table:: AMDGPU Trap Handler for AMDHSA OS
3728 :name: amdgpu-trap-handler-for-amdhsa-os-table
3730 =================== =============== =============== =======================
3731 Usage Code Sequence Trap Handler Description
3733 =================== =============== =============== =======================
3734 reserved ``s_trap 0x00`` Reserved by hardware.
3735 ``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA
3736 ``queue_ptr`` ``debugtrap``
3737 ``VGPR0``: intrinsic (not
3738 ``arg`` implemented).
3739 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be
3740 ``queue_ptr`` terminated and its
3741 associated queue put
3742 into the error state.
3743 ``llvm.debugtrap`` ``s_trap 0x03`` ``SGPR0-1``: If debugger not
3744 ``queue_ptr`` installed handled
3745 same as ``llvm.trap``.
3746 debugger breakpoint ``s_trap 0x07`` Reserved for debugger
3748 debugger ``s_trap 0x08`` Reserved for debugger.
3749 debugger ``s_trap 0xfe`` Reserved for debugger.
3750 debugger ``s_trap 0xff`` Reserved for debugger.
3751 =================== =============== =============== =======================
3756 This section provides code conventions used when the target triple OS is
3757 empty (see :ref:`amdgpu-target-triples`).
3762 For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
3763 not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
3764 instructions are handled as follows:
3766 .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
3767 :name: amdgpu-trap-handler-for-non-amdhsa-os-table
3769 =============== =============== ===========================================
3770 Usage Code Sequence Description
3771 =============== =============== ===========================================
3772 llvm.trap s_endpgm Causes wavefront to be terminated.
3773 llvm.debugtrap *none* Compiler warning given that there is no
3774 trap handler installed.
3775 =============== =============== ===========================================
3785 When generating code for the OpenCL language the target triple environment
3786 should be ``opencl`` or ``amdgizcl`` (see :ref:`amdgpu-target-triples`).
3788 When the language is OpenCL the following differences occur:
3790 1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
3791 2. The AMDGPU backend adds additional arguments to the kernel.
3792 3. Additional metadata is generated
3793 (:ref:`amdgpu-amdhsa-hsa-code-object-metadata`).
3796 Specify what affect this has. Hidden arguments added. Additional metadata
3804 When generating code for the OpenCL language the target triple environment
3805 should be ``hcc`` (see :ref:`amdgpu-target-triples`).
3807 When the language is OpenCL the following differences occur:
3809 1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
3812 Specify what affect this has.
3817 AMDGPU backend has LLVM-MC based assembler which is currently in development.
3818 It supports AMDGCN GFX6-GFX9.
3820 This section describes general syntax for instructions and operands. For more
3821 information about instructions, their semantics and supported combinations of
3822 operands, refer to one of instruction set architecture manuals
3823 [AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_ and [AMD-GCN-GFX9]_.
3825 An instruction has the following syntax (register operands are normally
3826 comma-separated while extra operands are space-separated):
3828 *<opcode> <register_operand0>, ... <extra_operand0> ...*
3833 The following syntax for register operands is supported:
3835 * SGPR registers: s0, ... or s[0], ...
3836 * VGPR registers: v0, ... or v[0], ...
3837 * TTMP registers: ttmp0, ... or ttmp[0], ...
3838 * Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
3839 * Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
3840 * Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
3841 * Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
3842 * Register index expressions: v[2*2], s[1-1:2-1]
3843 * 'off' indicates that an operand is not enabled
3845 The following extra operands are supported:
3847 * offset, offset0, offset1
3849 * glc, slc, tfe bits
3850 * waitcnt: integer or combination of counter values
3853 - abs (\| \|), neg (\-)
3857 - row_shl, row_shr, row_ror, row_rol
3858 - row_mirror, row_half_mirror, row_bcast
3859 - wave_shl, wave_shr, wave_ror, wave_rol, quad_perm
3860 - row_mask, bank_mask, bound_ctrl
3864 - dst_sel, src0_sel, src1_sel (BYTE_N, WORD_M, DWORD)
3865 - dst_unused (UNUSED_PAD, UNUSED_SEXT, UNUSED_PRESERVE)
3868 Instruction Examples
3869 ~~~~~~~~~~~~~~~~~~~~
3874 .. code-block:: nasm
3876 ds_add_u32 v2, v4 offset:16
3877 ds_write_src2_b64 v2 offset0:4 offset1:8
3878 ds_cmpst_f32 v2, v4, v6
3879 ds_min_rtn_f64 v[8:9], v2, v[4:5]
3882 For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
3887 .. code-block:: nasm
3889 flat_load_dword v1, v[3:4]
3890 flat_store_dwordx3 v[3:4], v[5:7]
3891 flat_atomic_swap v1, v[3:4], v5 glc
3892 flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
3893 flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
3895 For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
3900 .. code-block:: nasm
3902 buffer_load_dword v1, off, s[4:7], s1
3903 buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
3904 buffer_store_format_xy v[1:2], off, s[4:7], s1
3906 buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
3908 For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
3913 .. code-block:: nasm
3915 s_load_dword s1, s[2:3], 0xfc
3916 s_load_dwordx8 s[8:15], s[2:3], s4
3917 s_load_dwordx16 s[88:103], s[2:3], s4
3921 For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
3926 .. code-block:: nasm
3929 s_mov_b64 s[0:1], 0x80000000
3931 s_wqm_b64 s[2:3], s[4:5]
3932 s_bcnt0_i32_b64 s1, s[2:3]
3933 s_swappc_b64 s[2:3], s[4:5]
3934 s_cbranch_join s[4:5]
3936 For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
3941 .. code-block:: nasm
3943 s_add_u32 s1, s2, s3
3944 s_and_b64 s[2:3], s[4:5], s[6:7]
3945 s_cselect_b32 s1, s2, s3
3946 s_andn2_b32 s2, s4, s6
3947 s_lshr_b64 s[2:3], s[4:5], s6
3948 s_ashr_i32 s2, s4, s6
3949 s_bfm_b64 s[2:3], s4, s6
3950 s_bfe_i64 s[2:3], s[4:5], s6
3951 s_cbranch_g_fork s[4:5], s[6:7]
3953 For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
3958 .. code-block:: nasm
3961 s_bitcmp1_b32 s1, s2
3962 s_bitcmp0_b64 s[2:3], s4
3965 For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
3970 .. code-block:: nasm
3975 s_waitcnt 0 ; Wait for all counters to be 0
3976 s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
3977 s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
3981 s_sendmsg sendmsg(MSG_INTERRUPT)
3984 For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
3986 Unless otherwise mentioned, little verification is performed on the operands
3987 of SOPP Instructions, so it is up to the programmer to be familiar with the
3988 range or acceptable values.
3993 For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
3994 the assembler will automatically use optimal encoding based on its operands.
3995 To force specific encoding, one can add a suffix to the opcode of the instruction:
3997 * _e32 for 32-bit VOP1/VOP2/VOPC
3998 * _e64 for 64-bit VOP3
4000 * _sdwa for VOP_SDWA
4002 VOP1/VOP2/VOP3/VOPC examples:
4004 .. code-block:: nasm
4007 v_mov_b32_e32 v1, v2
4009 v_cvt_f64_i32_e32 v[1:2], v2
4010 v_floor_f32_e32 v1, v2
4011 v_bfrev_b32_e32 v1, v2
4012 v_add_f32_e32 v1, v2, v3
4013 v_mul_i32_i24_e64 v1, v2, 3
4014 v_mul_i32_i24_e32 v1, -3, v3
4015 v_mul_i32_i24_e32 v1, -100, v3
4016 v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
4017 v_max_f16_e32 v1, v2, v3
4021 .. code-block:: nasm
4023 v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
4024 v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
4025 v_mov_b32 v0, v0 wave_shl:1
4026 v_mov_b32 v0, v0 row_mirror
4027 v_mov_b32 v0, v0 row_bcast:31
4028 v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
4029 v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
4030 v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
4034 .. code-block:: nasm
4036 v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
4037 v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
4038 v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
4039 v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
4040 v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
4042 For full list of supported instructions, refer to "Vector ALU instructions".
4044 HSA Code Object Directives
4045 ~~~~~~~~~~~~~~~~~~~~~~~~~~
4047 AMDGPU ABI defines auxiliary data in output code object. In assembly source,
4048 one can specify them with assembler directives.
4050 .hsa_code_object_version major, minor
4051 +++++++++++++++++++++++++++++++++++++
4053 *major* and *minor* are integers that specify the version of the HSA code
4054 object that will be generated by the assembler.
4056 .hsa_code_object_isa [major, minor, stepping, vendor, arch]
4057 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4060 *major*, *minor*, and *stepping* are all integers that describe the instruction
4061 set architecture (ISA) version of the assembly program.
4063 *vendor* and *arch* are quoted strings. *vendor* should always be equal to
4064 "AMD" and *arch* should always be equal to "AMDGPU".
4066 By default, the assembler will derive the ISA version, *vendor*, and *arch*
4067 from the value of the -mcpu option that is passed to the assembler.
4069 .amdgpu_hsa_kernel (name)
4070 +++++++++++++++++++++++++
4072 This directives specifies that the symbol with given name is a kernel entry point
4073 (label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
4078 This directive marks the beginning of a list of key / value pairs that are used
4079 to specify the amd_kernel_code_t object that will be emitted by the assembler.
4080 The list must be terminated by the *.end_amd_kernel_code_t* directive. For
4081 any amd_kernel_code_t values that are unspecified a default value will be
4082 used. The default value for all keys is 0, with the following exceptions:
4084 - *kernel_code_version_major* defaults to 1.
4085 - *machine_kind* defaults to 1.
4086 - *machine_version_major*, *machine_version_minor*, and
4087 *machine_version_stepping* are derived from the value of the -mcpu option
4088 that is passed to the assembler.
4089 - *kernel_code_entry_byte_offset* defaults to 256.
4090 - *wavefront_size* defaults to 6.
4091 - *kernarg_segment_alignment*, *group_segment_alignment*, and
4092 *private_segment_alignment* default to 4. Note that alignments are specified
4093 as a power of two, so a value of **n** means an alignment of 2^ **n**.
4095 The *.amd_kernel_code_t* directive must be placed immediately after the
4096 function label and before any instructions.
4098 For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
4099 comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
4101 Here is an example of a minimal amd_kernel_code_t specification:
4103 .. code-block:: none
4105 .hsa_code_object_version 1,0
4106 .hsa_code_object_isa
4111 .amdgpu_hsa_kernel hello_world
4116 enable_sgpr_kernarg_segment_ptr = 1
4118 compute_pgm_rsrc1_vgprs = 0
4119 compute_pgm_rsrc1_sgprs = 0
4120 compute_pgm_rsrc2_user_sgpr = 2
4121 kernarg_segment_byte_size = 8
4122 wavefront_sgpr_count = 2
4123 workitem_vgpr_count = 3
4124 .end_amd_kernel_code_t
4126 s_load_dwordx2 s[0:1], s[0:1] 0x0
4127 v_mov_b32 v0, 3.14159
4128 s_waitcnt lgkmcnt(0)
4131 flat_store_dword v[1:2], v0
4134 .size hello_world, .Lfunc_end0-hello_world
4136 Additional Documentation
4137 ========================
4139 .. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
4140 .. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
4141 .. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
4142 .. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
4143 .. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
4144 .. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
4145 .. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
4146 .. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
4147 .. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
4148 .. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
4149 .. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
4150 .. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
4151 .. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
4152 .. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
4153 .. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
4154 .. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__