1 =============================
2 User Guide for AMDGPU Backend
3 =============================
11 The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
12 R600 family up until the current GCN families. It lives in the
13 ``lib/Target/AMDGPU`` directory.
18 .. _amdgpu-target-triples:
23 Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
24 specify the target triple:
26 .. table:: AMDGPU Architectures
27 :name: amdgpu-architecture-table
29 ============ ==============================================================
30 Architecture Description
31 ============ ==============================================================
32 ``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
33 ``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
34 ============ ==============================================================
36 .. table:: AMDGPU Vendors
37 :name: amdgpu-vendor-table
39 ============ ==============================================================
41 ============ ==============================================================
42 ``amd`` Can be used for all AMD GPU usage.
43 ``mesa3d`` Can be used if the OS is ``mesa3d``.
44 ============ ==============================================================
46 .. table:: AMDGPU Operating Systems
47 :name: amdgpu-os-table
49 ============== ============================================================
51 ============== ============================================================
52 *<empty>* Defaults to the *unknown* OS.
53 ``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes
54 such as AMD's ROCm [AMD-ROCm]_.
55 ``amdpal`` Graphic shaders and compute kernels executed on AMD PAL
57 ``mesa3d`` Graphic shaders and compute kernels executed on Mesa 3D
59 ============== ============================================================
61 .. table:: AMDGPU Environments
62 :name: amdgpu-environment-table
64 ============ ==============================================================
65 Environment Description
66 ============ ==============================================================
67 *<empty>* Defaults to ``opencl``.
68 ``opencl`` OpenCL compute kernel (see :ref:`amdgpu-opencl`).
69 ``amdgizcl`` Same as ``opencl`` except a different address space mapping is
70 used (see :ref:`amdgpu-address-spaces`).
71 ``amdgiz`` Same as ``opencl`` except a different address space mapping is
72 used (see :ref:`amdgpu-address-spaces`).
73 ``hcc`` AMD HC language compute kernel (see :ref:`amdgpu-hcc`).
74 ============ ==============================================================
76 .. _amdgpu-processors:
81 Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
82 names from both the *Processor* and *Alternative Processor* can be used.
84 .. table:: AMDGPU Processors
85 :name: amdgpu-processor-table
87 =========== =============== ============ ===== ========= ======= ==================
88 Processor Alternative Target dGPU/ Target ROCm Example
89 Processor Triple APU Features Support Products
90 Architecture Supported
92 =========== =============== ============ ===== ========= ======= ==================
93 **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
94 -----------------------------------------------------------------------------------
95 ``r600`` ``r600`` dGPU
96 ``r630`` ``r600`` dGPU
97 ``rs880`` ``r600`` dGPU
98 ``rv670`` ``r600`` dGPU
99 **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
100 -----------------------------------------------------------------------------------
101 ``rv710`` ``r600`` dGPU
102 ``rv730`` ``r600`` dGPU
103 ``rv770`` ``r600`` dGPU
104 **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
105 -----------------------------------------------------------------------------------
106 ``cedar`` ``r600`` dGPU
107 ``redwood`` ``r600`` dGPU
108 ``sumo`` ``r600`` dGPU
109 ``juniper`` ``r600`` dGPU
110 ``cypress`` ``r600`` dGPU
111 **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
112 -----------------------------------------------------------------------------------
113 ``barts`` ``r600`` dGPU
114 ``turks`` ``r600`` dGPU
115 ``caicos`` ``r600`` dGPU
116 ``cayman`` ``r600`` dGPU
117 **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
118 -----------------------------------------------------------------------------------
119 ``gfx600`` - ``tahiti`` ``amdgcn`` dGPU
120 ``gfx601`` - ``pitcairn`` ``amdgcn`` dGPU
124 **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
125 -----------------------------------------------------------------------------------
126 ``gfx700`` - ``kaveri`` ``amdgcn`` APU - A6-7000
136 ``gfx701`` - ``hawaii`` ``amdgcn`` dGPU ROCm - FirePro W8100
140 ``gfx702`` ``amdgcn`` dGPU ROCm - Radeon R9 290
144 ``gfx703`` - ``kabini`` ``amdgcn`` APU - E1-2100
145 - ``mullins`` - E1-2200
153 ``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Radeon HD 7790
157 **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
158 -----------------------------------------------------------------------------------
159 ``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - A6-8500P
165 \ ``amdgcn`` APU - xnack ROCm - A10-8700P
168 \ ``amdgcn`` APU - xnack - A10-9600P
174 \ ``amdgcn`` APU - xnack - E2-9010
177 ``gfx802`` - ``tonga`` ``amdgcn`` dGPU - xnack ROCm - FirePro S7150
178 - ``iceland`` [off] - FirePro S7100
185 ``gfx803`` - ``fiji`` ``amdgcn`` dGPU - xnack ROCm - Radeon R9 Nano
186 [off] - Radeon R9 Fury
190 - Radeon Instinct MI8
191 \ - ``polaris10`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 470
192 [off] - Radeon RX 480
193 - Radeon Instinct MI6
194 \ - ``polaris11`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 460
196 ``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack
198 **GCN GFX9** [AMD-GCN-GFX9]_
199 -----------------------------------------------------------------------------------
200 ``gfx900`` ``amdgcn`` dGPU - xnack ROCm - Radeon Vega
201 [off] Frontier Edition
206 - Radeon Instinct MI25
207 ``gfx902`` ``amdgcn`` APU - xnack *TBA*
212 =========== =============== ============ ===== ========= ======= ==================
214 .. _amdgpu-target-features:
219 Target features control how code is generated to support certain
220 processor specific features. Not all target features are supported by
221 all processors. The runtime must ensure that the features supported by
222 the device used to execute the code match the features enabled when
223 generating the code. A mismatch of features may result in incorrect
224 execution, or a reduction in performance.
226 The target features supported by each processor, and the default value
227 used if not specified explicitly, is listed in
228 :ref:`amdgpu-processor-table`.
230 Use the ``clang -m[no-]<TargetFeature>`` option to specify the AMD GPU
236 Enable the ``xnack`` feature.
238 Disable the ``xnack`` feature.
240 .. table:: AMDGPU Target Features
241 :name: amdgpu-target-feature-table
243 ============== ==================================================
244 Target Feature Description
245 ============== ==================================================
246 -m[no-]xnack Enable/disable generating code that has
247 memory clauses that are compatible with
248 having XNACK replay enabled.
250 This is used for demand paging and page
251 migration. If XNACK replay is enabled in
252 the device, then if a page fault occurs
253 the code may execute incorrectly if the
254 ``xnack`` feature is not enabled. Executing
255 code that has the feature enabled on a
256 device that does not have XNACK replay
257 enabled will execute correctly, but may
258 be less performant than code with the
260 ============== ==================================================
262 .. _amdgpu-address-spaces:
267 The AMDGPU backend uses the following address space mappings.
269 The memory space names used in the table, aside from the region memory space, is
270 from the OpenCL standard.
272 LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
274 .. table:: Address Space Mapping
275 :name: amdgpu-address-space-mapping-table
277 ================== ================= ================= ================= =================
278 LLVM Address Space Memory Space
279 ------------------ -----------------------------------------------------------------------
280 \ Current Default amdgiz/amdgizcl hcc Future Default
281 ================== ================= ================= ================= =================
282 0 Private (Scratch) Generic (Flat) Generic (Flat) Generic (Flat)
283 1 Global Global Global Global
284 2 Constant Constant Constant Region (GDS)
285 3 Local (group/LDS) Local (group/LDS) Local (group/LDS) Local (group/LDS)
286 4 Generic (Flat) Region (GDS) Region (GDS) Constant
287 5 Region (GDS) Private (Scratch) Private (Scratch) Private (Scratch)
288 ================== ================= ================= ================= =================
291 This is the current default address space mapping used for all languages
292 except hcc. This will shortly be deprecated.
295 This is the current address space mapping used when ``amdgiz`` or ``amdgizcl``
296 is specified as the target triple environment value.
299 This is the current address space mapping used when ``hcc`` is specified as
300 the target triple environment value.This will shortly be deprecated.
303 This will shortly be the only address space mapping for all languages using
306 .. _amdgpu-memory-scopes:
311 This section provides LLVM memory synchronization scopes supported by the AMDGPU
312 backend memory model when the target triple OS is ``amdhsa`` (see
313 :ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
315 The memory model supported is based on the HSA memory model [HSA]_ which is
316 based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
317 relation is transitive over the synchonizes-with relation independent of scope,
318 and synchonizes-with allows the memory scope instances to be inclusive (see
319 table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
321 This is different to the OpenCL [OpenCL]_ memory model which does not have scope
322 inclusion and requires the memory scopes to exactly match. However, this
323 is conservatively correct for OpenCL.
325 .. table:: AMDHSA LLVM Sync Scopes
326 :name: amdgpu-amdhsa-llvm-sync-scopes-table
328 ================ ==========================================================
329 LLVM Sync Scope Description
330 ================ ==========================================================
331 *none* The default: ``system``.
333 Synchronizes with, and participates in modification and
334 seq_cst total orderings with, other operations (except
335 image operations) for all address spaces (except private,
336 or generic that accesses private) provided the other
337 operation's sync scope is:
340 - ``agent`` and executed by a thread on the same agent.
341 - ``workgroup`` and executed by a thread in the same
343 - ``wavefront`` and executed by a thread in the same
346 ``agent`` Synchronizes with, and participates in modification and
347 seq_cst total orderings with, other operations (except
348 image operations) for all address spaces (except private,
349 or generic that accesses private) provided the other
350 operation's sync scope is:
352 - ``system`` or ``agent`` and executed by a thread on the
354 - ``workgroup`` and executed by a thread in the same
356 - ``wavefront`` and executed by a thread in the same
359 ``workgroup`` Synchronizes with, and participates in modification and
360 seq_cst total orderings with, other operations (except
361 image operations) for all address spaces (except private,
362 or generic that accesses private) provided the other
363 operation's sync scope is:
365 - ``system``, ``agent`` or ``workgroup`` and executed by a
366 thread in the same workgroup.
367 - ``wavefront`` and executed by a thread in the same
370 ``wavefront`` Synchronizes with, and participates in modification and
371 seq_cst total orderings with, other operations (except
372 image operations) for all address spaces (except private,
373 or generic that accesses private) provided the other
374 operation's sync scope is:
376 - ``system``, ``agent``, ``workgroup`` or ``wavefront``
377 and executed by a thread in the same wavefront.
379 ``singlethread`` Only synchronizes with, and participates in modification
380 and seq_cst total orderings with, other operations (except
381 image operations) running in the same thread for all
382 address spaces (for example, in signal handlers).
383 ================ ==========================================================
388 The AMDGPU backend implements the following intrinsics.
390 *This section is WIP.*
393 List AMDGPU intrinsics
398 The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
399 can be linked by ``lld`` to produce a standard ELF shared code object which can
400 be loaded and executed on an AMDGPU target.
405 The AMDGPU backend uses the following ELF header:
407 .. table:: AMDGPU ELF Header
408 :name: amdgpu-elf-header-table
410 ========================== ===============================
412 ========================== ===============================
413 ``e_ident[EI_CLASS]`` ``ELFCLASS64``
414 ``e_ident[EI_DATA]`` ``ELFDATA2LSB``
415 ``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE``
416 - ``ELFOSABI_AMDGPU_HSA``
417 - ``ELFOSABI_AMDGPU_PAL``
418 - ``ELFOSABI_AMDGPU_MESA3D``
419 ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA``
420 - ``ELFABIVERSION_AMDGPU_PAL``
421 - ``ELFABIVERSION_AMDGPU_MESA3D``
422 ``e_type`` - ``ET_REL``
424 ``e_machine`` ``EM_AMDGPU``
426 ``e_flags`` See :ref:`amdgpu-elf-header-e_flags-table`
427 ========================== ===============================
431 .. table:: AMDGPU ELF Header Enumeration Values
432 :name: amdgpu-elf-header-enumeration-values-table
434 =============================== =====
436 =============================== =====
439 ``ELFOSABI_AMDGPU_HSA`` 64
440 ``ELFOSABI_AMDGPU_PAL`` 65
441 ``ELFOSABI_AMDGPU_MESA3D`` 66
442 ``ELFABIVERSION_AMDGPU_HSA`` 1
443 ``ELFABIVERSION_AMDGPU_PAL`` 0
444 ``ELFABIVERSION_AMDGPU_MESA3D`` 0
445 =============================== =====
447 ``e_ident[EI_CLASS]``
450 * ``ELFCLASS32`` for ``r600`` architecture.
452 * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64
456 All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
458 ``e_ident[EI_OSABI]``
459 One of the following AMD GPU architecture specific OS ABIs
460 (see :ref:`amdgpu-os-table`):
462 * ``ELFOSABI_NONE`` for *unknown* OS.
464 * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
466 * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
468 * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
470 ``e_ident[EI_ABIVERSION]``
471 The ABI version of the AMD GPU architecture specific OS ABI to which the code
474 * ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
477 * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
480 * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
484 Can be one of the following values:
488 The type produced by the AMD GPU backend compiler as it is relocatable code
492 The type produced by the linker as it is a shared code object.
494 The AMD HSA runtime loader requires a ``ET_DYN`` code object.
497 The value ``EM_AMDGPU`` is used for the machine for all processors supported
498 by the ``r600`` and ``amdgcn`` architectures (see
499 :ref:`amdgpu-processor-table`). The specific processor is specified in the
500 ``EF_AMDGPU_MACH`` bit field of the ``e_flags`` (see
501 :ref:`amdgpu-elf-header-e_flags-table`).
504 The entry point is 0 as the entry points for individual kernels must be
505 selected in order to invoke them through AQL packets.
508 The AMDGPU backend uses the following ELF header flags:
510 .. table:: AMDGPU ELF Header ``e_flags``
511 :name: amdgpu-elf-header-e_flags-table
513 ================================= ========== =============================
514 Name Value Description
515 ================================= ========== =============================
516 **AMDGPU Processor Flag** See :ref:`amdgpu-processor-table`.
517 -------------------------------------------- -----------------------------
518 ``EF_AMDGPU_MACH`` 0x000000ff AMDGPU processor selection
520 ``EF_AMDGPU_MACH_xxx`` values
522 :ref:`amdgpu-ef-amdgpu-mach-table`.
523 ``EF_AMDGPU_XNACK`` 0x00000100 Indicates if the ``xnack``
526 contained in the code object.
528 :ref:`amdgpu-target-features`.
529 ================================= ========== =============================
531 .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
532 :name: amdgpu-ef-amdgpu-mach-table
534 ================================= ========== =============================
535 Name Value Description (see
536 :ref:`amdgpu-processor-table`)
537 ================================= ========== =============================
538 ``EF_AMDGPU_MACH_NONE`` 0 *not specified*
539 ``EF_AMDGPU_MACH_R600_R600`` 1 ``r600``
540 ``EF_AMDGPU_MACH_R600_R630`` 2 ``r630``
541 ``EF_AMDGPU_MACH_R600_RS880`` 3 ``rs880``
542 ``EF_AMDGPU_MACH_R600_RV670`` 4 ``rv670``
543 ``EF_AMDGPU_MACH_R600_RV710`` 5 ``rv710``
544 ``EF_AMDGPU_MACH_R600_RV730`` 6 ``rv730``
545 ``EF_AMDGPU_MACH_R600_RV770`` 7 ``rv770``
546 ``EF_AMDGPU_MACH_R600_CEDAR`` 8 ``cedar``
547 ``EF_AMDGPU_MACH_R600_REDWOOD`` 9 ``redwood``
548 ``EF_AMDGPU_MACH_R600_SUMO`` 10 ``sumo``
549 ``EF_AMDGPU_MACH_R600_JUNIPER`` 11 ``juniper``
550 ``EF_AMDGPU_MACH_R600_CYPRESS`` 12 ``cypress``
551 ``EF_AMDGPU_MACH_R600_BARTS`` 13 ``barts``
552 ``EF_AMDGPU_MACH_R600_TURKS`` 14 ``turks``
553 ``EF_AMDGPU_MACH_R600_CAICOS`` 15 ``caicos``
554 ``EF_AMDGPU_MACH_R600_CAYMAN`` 16 ``cayman``
555 *reserved* 17-31 Reserved for ``r600``
556 architecture processors.
557 ``EF_AMDGPU_MACH_AMDGCN_GFX600`` 32 ``gfx600``
558 ``EF_AMDGPU_MACH_AMDGCN_GFX601`` 33 ``gfx601``
559 ``EF_AMDGPU_MACH_AMDGCN_GFX700`` 34 ``gfx700``
560 ``EF_AMDGPU_MACH_AMDGCN_GFX701`` 35 ``gfx701``
561 ``EF_AMDGPU_MACH_AMDGCN_GFX702`` 36 ``gfx702``
562 ``EF_AMDGPU_MACH_AMDGCN_GFX703`` 37 ``gfx703``
563 ``EF_AMDGPU_MACH_AMDGCN_GFX704`` 38 ``gfx704``
564 *reserved* 39 Reserved.
565 ``EF_AMDGPU_MACH_AMDGCN_GFX801`` 40 ``gfx801``
566 ``EF_AMDGPU_MACH_AMDGCN_GFX802`` 41 ``gfx802``
567 ``EF_AMDGPU_MACH_AMDGCN_GFX803`` 42 ``gfx803``
568 ``EF_AMDGPU_MACH_AMDGCN_GFX810`` 43 ``gfx810``
569 ``EF_AMDGPU_MACH_AMDGCN_GFX900`` 44 ``gfx900``
570 ``EF_AMDGPU_MACH_AMDGCN_GFX902`` 45 ``gfx902``
571 ================================= ========== =============================
576 An AMDGPU target ELF code object has the standard ELF sections which include:
578 .. table:: AMDGPU ELF Sections
579 :name: amdgpu-elf-sections-table
581 ================== ================ =================================
583 ================== ================ =================================
584 ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
585 ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
586 ``.debug_``\ *\** ``SHT_PROGBITS`` *none*
587 ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
588 ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
589 ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
590 ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
591 ``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
592 ``.note`` ``SHT_NOTE`` *none*
593 ``.rela``\ *name* ``SHT_RELA`` *none*
594 ``.rela.dyn`` ``SHT_RELA`` *none*
595 ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
596 ``.shstrtab`` ``SHT_STRTAB`` *none*
597 ``.strtab`` ``SHT_STRTAB`` *none*
598 ``.symtab`` ``SHT_SYMTAB`` *none*
599 ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
600 ================== ================ =================================
602 These sections have their standard meanings (see [ELF]_) and are only generated
606 The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
607 DWARF produced by the AMDGPU backend.
609 ``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
610 The standard sections used by a dynamic loader.
613 See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
616 ``.rela``\ *name*, ``.rela.dyn``
617 For relocatable code objects, *name* is the name of the section that the
618 relocation records apply. For example, ``.rela.text`` is the section name for
619 relocation records associated with the ``.text`` section.
621 For linked shared code objects, ``.rela.dyn`` contains all the relocation
622 records from each of the relocatable code object's ``.rela``\ *name* sections.
624 See :ref:`amdgpu-relocation-records` for the relocation records supported by
628 The executable machine code for the kernels and functions they call. Generated
629 as position independent code. See :ref:`amdgpu-code-conventions` for
630 information on conventions used in the isa generation.
632 .. _amdgpu-note-records:
637 As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero byte padding must
638 be generated after the ``name`` field to ensure the ``desc`` field is 4 byte
639 aligned. In addition, minimal zero byte padding must be generated to ensure the
640 ``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` field of the
641 ``.note`` section must be at least 4 to indicate at least 8 byte alignment.
643 The AMDGPU backend code object uses the following ELF note records in the
644 ``.note`` section. The *Description* column specifies the layout of the note
645 record's ``desc`` field. All fields are consecutive bytes. Note records with
646 variable size strings have a corresponding ``*_size`` field that specifies the
647 number of bytes, including the terminating null character, in the string. The
648 string(s) come immediately after the preceding fields.
650 Additional note records can be present.
652 .. table:: AMDGPU ELF Note Records
653 :name: amdgpu-elf-note-records-table
655 ===== ============================== ======================================
656 Name Type Description
657 ===== ============================== ======================================
658 "AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
659 ===== ============================== ======================================
663 .. table:: AMDGPU ELF Note Record Enumeration Values
664 :name: amdgpu-elf-note-record-enumeration-values-table
666 ============================== =====
668 ============================== =====
670 ``NT_AMD_AMDGPU_HSA_METADATA`` 10
672 ============================== =====
674 ``NT_AMD_AMDGPU_HSA_METADATA``
675 Specifies extensible metadata associated with the code objects executed on HSA
676 [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
677 the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
678 :ref:`amdgpu-amdhsa-hsa-code-object-metadata` for the syntax of the code
679 object metadata string.
686 Symbols include the following:
688 .. table:: AMDGPU ELF Symbols
689 :name: amdgpu-elf-symbols-table
691 ===================== ============== ============= ==================
692 Name Type Section Description
693 ===================== ============== ============= ==================
694 *link-name* ``STT_OBJECT`` - ``.data`` Global variable
697 *link-name*\ ``@kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
698 *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point
699 ===================== ============== ============= ==================
702 Global variables both used and defined by the compilation unit.
704 If the symbol is defined in the compilation unit then it is allocated in the
705 appropriate section according to if it has initialized data or is readonly.
707 If the symbol is external then its section is ``STN_UNDEF`` and the loader
708 will resolve relocations using the definition provided by another code object
709 or explicitly defined by the runtime.
711 All global symbols, whether defined in the compilation unit or external, are
712 accessed by the machine code indirectly through a GOT table entry. This
713 allows them to be preemptable. The GOT table is only supported when the target
714 triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
717 Add description of linked shared object symbols. Seems undefined symbols
718 are marked as STT_NOTYPE.
721 Every HSA kernel has an associated kernel descriptor. It is the address of the
722 kernel descriptor that is used in the AQL dispatch packet used to invoke the
723 kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
724 defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
727 Every HSA kernel also has a symbol for its machine code entry point.
729 .. _amdgpu-relocation-records:
734 AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
735 relocatable fields are:
738 This specifies a 32-bit field occupying 4 bytes with arbitrary byte
739 alignment. These values use the same byte order as other word values in the
740 AMD GPU architecture.
743 This specifies a 64-bit field occupying 8 bytes with arbitrary byte
744 alignment. These values use the same byte order as other word values in the
745 AMD GPU architecture.
747 Following notations are used for specifying relocation calculations:
750 Represents the addend used to compute the value of the relocatable field.
753 Represents the offset into the global offset table at which the relocation
754 entry's symbol will reside during execution.
757 Represents the address of the global offset table.
760 Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
761 of the storage unit being relocated (computed using ``r_offset``).
764 Represents the value of the symbol whose index resides in the relocation
765 entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``.
768 Represents the base address of a loaded executable or shared object which is
769 the difference between the ELF address and the actual load address. Relocations
770 using this are only valid in executable or shared objects.
772 The following relocation types are supported:
774 .. table:: AMDGPU ELF Relocation Records
775 :name: amdgpu-elf-relocation-records-table
777 ========================== ======= ===== ========== ==============================
778 Relocation Type Kind Value Field Calculation
779 ========================== ======= ===== ========== ==============================
780 ``R_AMDGPU_NONE`` 0 *none* *none*
781 ``R_AMDGPU_ABS32_LO`` Dynamic 1 ``word32`` (S + A) & 0xFFFFFFFF
782 ``R_AMDGPU_ABS32_HI`` Dynamic 2 ``word32`` (S + A) >> 32
783 ``R_AMDGPU_ABS64`` Dynamic 3 ``word64`` S + A
784 ``R_AMDGPU_REL32`` Static 4 ``word32`` S + A - P
785 ``R_AMDGPU_REL64`` Static 5 ``word64`` S + A - P
786 ``R_AMDGPU_ABS32`` Static 6 ``word32`` S + A
787 ``R_AMDGPU_GOTPCREL`` Static 7 ``word32`` G + GOT + A - P
788 ``R_AMDGPU_GOTPCREL32_LO`` Static 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
789 ``R_AMDGPU_GOTPCREL32_HI`` Static 9 ``word32`` (G + GOT + A - P) >> 32
790 ``R_AMDGPU_REL32_LO`` Static 10 ``word32`` (S + A - P) & 0xFFFFFFFF
791 ``R_AMDGPU_REL32_HI`` Static 11 ``word32`` (S + A - P) >> 32
793 ``R_AMDGPU_RELATIVE64`` Dynamic 13 ``word64`` B + A
794 ========================== ======= ===== ========== ==============================
801 Standard DWARF [DWARF]_ Version 2 sections can be generated. These contain
802 information that maps the code object executable code and data to the source
803 language constructs. It can be used by tools such as debuggers and profilers.
805 Address Space Mapping
806 ~~~~~~~~~~~~~~~~~~~~~
808 The following address space mapping is used:
810 .. table:: AMDGPU DWARF Address Space Mapping
811 :name: amdgpu-dwarf-address-space-mapping-table
813 =================== =================
814 DWARF Address Space Memory Space
815 =================== =================
820 *omitted* Generic (Flat)
821 *not supported* Region (GDS)
822 =================== =================
824 See :ref:`amdgpu-address-spaces` for information on the memory space terminology
827 An ``address_class`` attribute is generated on pointer type DIEs to specify the
828 DWARF address space of the value of the pointer when it is in the *private* or
829 *local* address space. Otherwise the attribute is omitted.
831 An ``XDEREF`` operation is generated in location list expressions for variables
832 that are allocated in the *private* and *local* address space. Otherwise no
833 ``XDREF`` is omitted.
838 *This section is WIP.*
841 Define DWARF register enumeration.
843 If want to present a wavefront state then should expose vector registers as
844 64 wide (rather than per work-item view that LLVM uses). Either as separate
845 registers, or a 64x4 byte single register. In either case use a new LANE op
846 (akin to XDREF) to select the current lane usage in a location
847 expression. This would also allow scalar register spilling to vector register
848 lanes to be expressed (currently no debug information is being generated for
849 spilling). If choose a wide single register approach then use LANE in
850 conjunction with PIECE operation to select the dword part of the register for
851 the current lane. If the separate register approach then use LANE to select
857 *This section is WIP.*
860 DWARF extension to include runtime generated source text.
862 .. _amdgpu-code-conventions:
867 This section provides code conventions used for each supported target triple OS
868 (see :ref:`amdgpu-target-triples`).
873 This section provides code conventions used when the target triple OS is
874 ``amdhsa`` (see :ref:`amdgpu-target-triples`).
876 .. _amdgpu-amdhsa-hsa-code-object-metadata:
881 The code object metadata specifies extensible metadata associated with the code
882 objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
883 [AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record
884 (see :ref:`amdgpu-note-records`) and is required when the target triple OS is
885 ``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
886 information necessary to support the ROCM kernel queries. For example, the
887 segment sizes needed in a dispatch packet. In addition, a high level language
888 runtime may require other information to be included. For example, the AMD
889 OpenCL runtime records kernel argument information.
891 The metadata is specified as a YAML formatted string (see [YAML]_ and
895 Is the string null terminated? It probably should not if YAML allows it to
896 contain null characters, otherwise it should be.
898 The metadata is represented as a single YAML document comprised of the mapping
899 defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
902 For boolean values, the string values of ``false`` and ``true`` are used for
903 false and true respectively.
905 Additional information can be added to the mappings. To avoid conflicts, any
906 non-AMD key names should be prefixed by "*vendor-name*.".
908 .. table:: AMDHSA Code Object Metadata Mapping
909 :name: amdgpu-amdhsa-code-object-metadata-mapping-table
911 ========== ============== ========= =======================================
912 String Key Value Type Required? Description
913 ========== ============== ========= =======================================
914 "Version" sequence of Required - The first integer is the major
915 2 integers version. Currently 1.
916 - The second integer is the minor
917 version. Currently 0.
918 "Printf" sequence of Each string is encoded information
919 strings about a printf function call. The
920 encoded information is organized as
921 fields separated by colon (':'):
923 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
928 A 32 bit integer as a unique id for
929 each printf function call
932 A 32 bit integer equal to the number
933 of arguments of printf function call
936 ``S[i]`` (where i = 0, 1, ... , N-1)
937 32 bit integers for the size in bytes
938 of the i-th FormatString argument of
939 the printf function call
942 The format string passed to the
943 printf function call.
944 "Kernels" sequence of Required Sequence of the mappings for each
945 mapping kernel in the code object. See
946 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
947 for the definition of the mapping.
948 ========== ============== ========= =======================================
952 .. table:: AMDHSA Code Object Kernel Metadata Mapping
953 :name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
955 ================= ============== ========= ================================
956 String Key Value Type Required? Description
957 ================= ============== ========= ================================
958 "Name" string Required Source name of the kernel.
959 "SymbolName" string Required Name of the kernel
960 descriptor ELF symbol.
961 "Language" string Source language of the kernel.
969 "LanguageVersion" sequence of - The first integer is the major
971 - The second integer is the
973 "Attrs" mapping Mapping of kernel attributes.
975 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
976 for the mapping definition.
977 "Args" sequence of Sequence of mappings of the
978 mapping kernel arguments. See
979 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
980 for the definition of the mapping.
981 "CodeProps" mapping Mapping of properties related to
983 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
984 for the mapping definition.
985 ================= ============== ========= ================================
989 .. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
990 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
992 =================== ============== ========= ==============================
993 String Key Value Type Required? Description
994 =================== ============== ========= ==============================
995 "ReqdWorkGroupSize" sequence of If not 0, 0, 0 then all values
996 3 integers must be >=1 and the dispatch
997 work-group size X, Y, Z must
998 correspond to the specified
999 values. Defaults to 0, 0, 0.
1001 Corresponds to the OpenCL
1002 ``reqd_work_group_size``
1004 "WorkGroupSizeHint" sequence of The dispatch work-group size
1005 3 integers X, Y, Z is likely to be the
1008 Corresponds to the OpenCL
1009 ``work_group_size_hint``
1011 "VecTypeHint" string The name of a scalar or vector
1014 Corresponds to the OpenCL
1015 ``vec_type_hint`` attribute.
1017 "RuntimeHandle" string The external symbol name
1018 associated with a kernel.
1019 OpenCL runtime allocates a
1020 global buffer for the symbol
1021 and saves the kernel's address
1022 to it, which is used for
1023 device side enqueueing. Only
1024 available for device side
1026 =================== ============== ========= ==============================
1030 .. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
1031 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
1033 ================= ============== ========= ================================
1034 String Key Value Type Required? Description
1035 ================= ============== ========= ================================
1036 "Name" string Kernel argument name.
1037 "TypeName" string Kernel argument type name.
1038 "Size" integer Required Kernel argument size in bytes.
1039 "Align" integer Required Kernel argument alignment in
1040 bytes. Must be a power of two.
1041 "ValueKind" string Required Kernel argument kind that
1042 specifies how to set up the
1043 corresponding argument.
1047 The argument is copied
1048 directly into the kernarg.
1051 A global address space pointer
1052 to the buffer data is passed
1055 "DynamicSharedPointer"
1056 A group address space pointer
1057 to dynamically allocated LDS
1058 is passed in the kernarg.
1061 A global address space
1062 pointer to a S# is passed in
1066 A global address space
1067 pointer to a T# is passed in
1071 A global address space pointer
1072 to an OpenCL pipe is passed in
1076 A global address space pointer
1077 to an OpenCL device enqueue
1078 queue is passed in the
1081 "HiddenGlobalOffsetX"
1082 The OpenCL grid dispatch
1083 global offset for the X
1084 dimension is passed in the
1087 "HiddenGlobalOffsetY"
1088 The OpenCL grid dispatch
1089 global offset for the Y
1090 dimension is passed in the
1093 "HiddenGlobalOffsetZ"
1094 The OpenCL grid dispatch
1095 global offset for the Z
1096 dimension is passed in the
1100 An argument that is not used
1101 by the kernel. Space needs to
1102 be left for it, but it does
1103 not need to be set up.
1105 "HiddenPrintfBuffer"
1106 A global address space pointer
1107 to the runtime printf buffer
1108 is passed in kernarg.
1110 "HiddenDefaultQueue"
1111 A global address space pointer
1112 to the OpenCL device enqueue
1113 queue that should be used by
1114 the kernel by default is
1115 passed in the kernarg.
1117 "HiddenCompletionAction"
1118 A global address space pointer
1119 to help link enqueued kernels into
1120 the ancestor tree for determining
1121 when the parent kernel has finished.
1123 "ValueType" string Required Kernel argument value type. Only
1124 present if "ValueKind" is
1125 "ByValue". For vector data
1126 types, the value is for the
1127 element type. Values include:
1143 How can it be determined if a
1144 vector type, and what size
1146 "PointeeAlign" integer Alignment in bytes of pointee
1147 type for pointer type kernel
1148 argument. Must be a power
1149 of 2. Only present if
1151 "DynamicSharedPointer".
1152 "AddrSpaceQual" string Kernel argument address space
1153 qualifier. Only present if
1154 "ValueKind" is "GlobalBuffer" or
1155 "DynamicSharedPointer". Values
1166 Is GlobalBuffer only Global
1168 DynamicSharedPointer always
1169 Local? Can HCC allow Generic?
1170 How can Private or Region
1172 "AccQual" string Kernel argument access
1173 qualifier. Only present if
1174 "ValueKind" is "Image" or
1185 "ActualAccQual" string The actual memory accesses
1186 performed by the kernel on the
1187 kernel argument. Only present if
1188 "ValueKind" is "GlobalBuffer",
1189 "Image", or "Pipe". This may be
1190 more restrictive than indicated
1191 by "AccQual" to reflect what the
1192 kernel actual does. If not
1193 present then the runtime must
1194 assume what is implied by
1195 "AccQual" and "IsConst". Values
1202 "IsConst" boolean Indicates if the kernel argument
1203 is const qualified. Only present
1207 "IsRestrict" boolean Indicates if the kernel argument
1208 is restrict qualified. Only
1209 present if "ValueKind" is
1212 "IsVolatile" boolean Indicates if the kernel argument
1213 is volatile qualified. Only
1214 present if "ValueKind" is
1217 "IsPipe" boolean Indicates if the kernel argument
1218 is pipe qualified. Only present
1219 if "ValueKind" is "Pipe".
1222 Can GlobalBuffer be pipe
1224 ================= ============== ========= ================================
1228 .. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
1229 :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
1231 ============================ ============== ========= =====================
1232 String Key Value Type Required? Description
1233 ============================ ============== ========= =====================
1234 "KernargSegmentSize" integer Required The size in bytes of
1236 that holds the values
1239 "GroupSegmentFixedSize" integer Required The amount of group
1243 bytes. This does not
1245 dynamically allocated
1246 group segment memory
1250 "PrivateSegmentFixedSize" integer Required The amount of fixed
1251 private address space
1252 memory required for a
1254 bytes. If the kernel
1256 stack then additional
1258 to this value for the
1260 "KernargSegmentAlign" integer Required The maximum byte
1263 kernarg segment. Must
1265 "WavefrontSize" integer Required Wavefront size. Must
1267 "NumSGPRs" integer Required Number of scalar
1271 includes the special
1277 SGPR added if a trap
1283 "NumVGPRs" integer Required Number of vector
1287 "MaxFlatWorkGroupSize" integer Required Maximum flat
1290 kernel in work-items.
1293 ReqdWorkGroupSize if
1295 "NumSpilledSGPRs" integer Number of stores from
1296 a scalar register to
1297 a register allocator
1300 "NumSpilledVGPRs" integer Number of stores from
1301 a vector register to
1302 a register allocator
1305 ============================ ============== ========= =====================
1312 The HSA architected queuing language (AQL) defines a user space memory interface
1313 that can be used to control the dispatch of kernels, in an agent independent
1314 way. An agent can have zero or more AQL queues created for it using the ROCm
1315 runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
1316 *HSA Platform System Architecture Specification* [HSA]_ for the AQL queue
1317 mechanics and packet layouts.
1319 The packet processor of a kernel agent is responsible for detecting and
1320 dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
1321 packet processor is implemented by the hardware command processor (CP),
1322 asynchronous dispatch controller (ADC) and shader processor input controller
1325 The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
1326 mode driver to initialize and register the AQL queue with CP.
1328 To dispatch a kernel the following actions are performed. This can occur in the
1329 CPU host program, or from an HSA kernel executing on a GPU.
1331 1. A pointer to an AQL queue for the kernel agent on which the kernel is to be
1332 executed is obtained.
1333 2. A pointer to the kernel descriptor (see
1334 :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
1335 obtained. It must be for a kernel that is contained in a code object that that
1336 was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
1338 3. Space is allocated for the kernel arguments using the ROCm runtime allocator
1339 for a memory region with the kernarg property for the kernel agent that will
1340 execute the kernel. It must be at least 16 byte aligned.
1341 4. Kernel argument values are assigned to the kernel argument memory
1342 allocation. The layout is defined in the *HSA Programmer's Language Reference*
1343 [HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
1344 memory in the same way constant memory is accessed. (Note that the HSA
1345 specification allows an implementation to copy the kernel argument contents to
1346 another location that is accessed by the kernel.)
1347 5. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
1348 api uses 64 bit atomic operations to reserve space in the AQL queue for the
1349 packet. The packet must be set up, and the final write must use an atomic
1350 store release to set the packet kind to ensure the packet contents are
1351 visible to the kernel agent. AQL defines a doorbell signal mechanism to
1352 notify the kernel agent that the AQL queue has been updated. These rules, and
1353 the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
1354 System Architecture Specification* [HSA]_.
1355 6. A kernel dispatch packet includes information about the actual dispatch,
1356 such as grid and work-group size, together with information from the code
1357 object about the kernel, such as segment sizes. The ROCm runtime queries on
1358 the kernel symbol can be used to obtain the code object values which are
1359 recorded in the :ref:`amdgpu-amdhsa-hsa-code-object-metadata`.
1360 7. CP executes micro-code and is responsible for detecting and setting up the
1361 GPU to execute the wavefronts of a kernel dispatch.
1362 8. CP ensures that when the a wavefront starts executing the kernel machine
1363 code, the scalar general purpose registers (SGPR) and vector general purpose
1364 registers (VGPR) are set up as required by the machine code. The required
1365 setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
1366 register state is defined in
1367 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
1368 9. The prolog of the kernel machine code (see
1369 :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
1370 before continuing executing the machine code that corresponds to the kernel.
1371 10. When the kernel dispatch has completed execution, CP signals the completion
1372 signal specified in the kernel dispatch packet if not 0.
1374 .. _amdgpu-amdhsa-memory-spaces:
1379 The memory space properties are:
1381 .. table:: AMDHSA Memory Spaces
1382 :name: amdgpu-amdhsa-memory-spaces-table
1384 ================= =========== ======== ======= ==================
1385 Memory Space Name HSA Segment Hardware Address NULL Value
1387 ================= =========== ======== ======= ==================
1388 Private private scratch 32 0x00000000
1389 Local group LDS 32 0xFFFFFFFF
1390 Global global global 64 0x0000000000000000
1391 Constant constant *same as 64 0x0000000000000000
1393 Generic flat flat 64 0x0000000000000000
1394 Region N/A GDS 32 *not implemented
1396 ================= =========== ======== ======= ==================
1398 The global and constant memory spaces both use global virtual addresses, which
1399 are the same virtual address space used by the CPU. However, some virtual
1400 addresses may only be accessible to the CPU, some only accessible by the GPU,
1403 Using the constant memory space indicates that the data will not change during
1404 the execution of the kernel. This allows scalar read instructions to be
1405 used. The vector and scalar L1 caches are invalidated of volatile data before
1406 each kernel dispatch execution to allow constant memory to change values between
1409 The local memory space uses the hardware Local Data Store (LDS) which is
1410 automatically allocated when the hardware creates work-groups of wavefronts, and
1411 freed when all the wavefronts of a work-group have terminated. The data store
1412 (DS) instructions can be used to access it.
1414 The private memory space uses the hardware scratch memory support. If the kernel
1415 uses scratch, then the hardware allocates memory that is accessed using
1416 wavefront lane dword (4 byte) interleaving. The mapping used from private
1417 address to physical address is:
1419 ``wavefront-scratch-base +
1420 (private-address * wavefront-size * 4) +
1421 (wavefront-lane-id * 4)``
1423 There are different ways that the wavefront scratch base address is determined
1424 by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
1425 memory can be accessed in an interleaved manner using buffer instruction with
1426 the scratch buffer descriptor and per wave scratch offset, by the scratch
1427 instructions, or by flat instructions. If each lane of a wavefront accesses the
1428 same private address, the interleaving results in adjacent dwords being accessed
1429 and hence requires fewer cache lines to be fetched. Multi-dword access is not
1430 supported except by flat and scratch instructions in GFX9.
1432 The generic address space uses the hardware flat address support available in
1433 GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
1434 local appertures), that are outside the range of addressible global memory, to
1435 map from a flat address to a private or local address.
1437 FLAT instructions can take a flat address and access global, private (scratch)
1438 and group (LDS) memory depending in if the address is within one of the
1439 apperture ranges. Flat access to scratch requires hardware aperture setup and
1440 setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
1441 access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
1442 (see :ref:`amdgpu-amdhsa-m0`).
1444 To convert between a segment address and a flat address the base address of the
1445 appertures address can be used. For GFX7-GFX8 these are available in the
1446 :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
1447 Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
1448 GFX9 the appature base addresses are directly available as inline constant
1449 registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
1450 address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
1451 which makes it easier to convert from flat to segment or segment to flat.
1456 Image and sample handles created by the ROCm runtime are 64 bit addresses of a
1457 hardware 32 byte V# and 48 byte S# object respectively. In order to support the
1458 HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
1459 enumeration values for the queries that are not trivially deducible from the S#
1465 HSA signal handles created by the ROCm runtime are 64 bit addresses of a
1466 structure allocated in memory accessible from both the CPU and GPU. The
1467 structure is defined by the ROCm runtime and subject to change between releases
1468 (see [AMD-ROCm-github]_).
1470 .. _amdgpu-amdhsa-hsa-aql-queue:
1475 The HSA AQL queue structure is defined by the ROCm runtime and subject to change
1476 between releases (see [AMD-ROCm-github]_). For some processors it contains
1477 fields needed to implement certain language features such as the flat address
1478 aperture bases. It also contains fields used by CP such as managing the
1479 allocation of scratch memory.
1481 .. _amdgpu-amdhsa-kernel-descriptor:
1486 A kernel descriptor consists of the information needed by CP to initiate the
1487 execution of a kernel, including the entry point address of the machine code
1488 that implements the kernel.
1490 Kernel Descriptor for GFX6-GFX9
1491 +++++++++++++++++++++++++++++++
1493 CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
1495 .. table:: Kernel Descriptor for GFX6-GFX9
1496 :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
1498 ======= ======= =============================== ============================
1499 Bits Size Field Name Description
1500 ======= ======= =============================== ============================
1501 31:0 4 bytes GroupSegmentFixedSize The amount of fixed local
1502 address space memory
1503 required for a work-group
1504 in bytes. This does not
1505 include any dynamically
1506 allocated local address
1507 space memory that may be
1508 added when the kernel is
1510 63:32 4 bytes PrivateSegmentFixedSize The amount of fixed
1511 private address space
1512 memory required for a
1513 work-item in bytes. If
1514 is_dynamic_callstack is 1
1515 then additional space must
1516 be added to this value for
1518 127:64 8 bytes Reserved, must be 0.
1519 191:128 8 bytes KernelCodeEntryByteOffset Byte offset (possibly
1522 descriptor to kernel's
1523 entry point instruction
1524 which must be 256 byte
1526 383:192 24 Reserved, must be 0.
1528 415:384 4 bytes ComputePgmRsrc1 Compute Shader (CS)
1529 program settings used by
1531 ``COMPUTE_PGM_RSRC1``
1534 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
1535 447:416 4 bytes ComputePgmRsrc2 Compute Shader (CS)
1536 program settings used by
1538 ``COMPUTE_PGM_RSRC2``
1541 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
1542 448 1 bit EnableSGPRPrivateSegmentBuffer Enable the setup of the
1543 SGPR user data registers
1545 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1547 The total number of SGPR
1549 requested must not exceed
1550 16 and match value in
1551 ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
1552 Any requests beyond 16
1554 449 1 bit EnableSGPRDispatchPtr *see above*
1555 450 1 bit EnableSGPRQueuePtr *see above*
1556 451 1 bit EnableSGPRKernargSegmentPtr *see above*
1557 452 1 bit EnableSGPRDispatchID *see above*
1558 453 1 bit EnableSGPRFlatScratchInit *see above*
1559 454 1 bit EnableSGPRPrivateSegmentSize *see above*
1560 455 1 bit EnableSGPRGridWorkgroupCountX Not implemented in CP and
1562 456 1 bit EnableSGPRGridWorkgroupCountY Not implemented in CP and
1564 457 1 bit EnableSGPRGridWorkgroupCountZ Not implemented in CP and
1566 463:458 6 bits Reserved, must be 0.
1567 511:464 6 Reserved, must be 0.
1569 512 **Total size 64 bytes.**
1570 ======= ====================================================================
1574 .. table:: compute_pgm_rsrc1 for GFX6-GFX9
1575 :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table
1577 ======= ======= =============================== ===========================================================================
1578 Bits Size Field Name Description
1579 ======= ======= =============================== ===========================================================================
1580 5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector registers
1581 used by each work-item,
1582 granularity is device
1587 - roundup((max_vgpg + 1)
1590 Used by CP to set up
1591 ``COMPUTE_PGM_RSRC1.VGPRS``.
1592 9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar registers
1593 used by a wavefront,
1594 granularity is device
1599 - roundup((max_sgpg + 1)
1603 - roundup((max_sgpg + 1)
1606 Includes the special SGPRs
1607 for VCC, Flat Scratch (for
1608 GFX7 onwards) and XNACK
1609 (for GFX8 onwards). It does
1610 not include the 16 SGPR
1611 added if a trap handler is
1614 Used by CP to set up
1615 ``COMPUTE_PGM_RSRC1.SGPRS``.
1616 11:10 2 bits PRIORITY Must be 0.
1618 Start executing wavefront
1619 at the specified priority.
1621 CP is responsible for
1623 ``COMPUTE_PGM_RSRC1.PRIORITY``.
1624 13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution
1625 with specified rounding
1628 precision floating point
1631 Floating point rounding
1632 mode values are defined in
1633 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1635 Used by CP to set up
1636 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1637 15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution
1638 with specified rounding
1639 denorm mode for half/double (16
1640 and 64 bit) floating point
1641 precision floating point
1644 Floating point rounding
1645 mode values are defined in
1646 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1648 Used by CP to set up
1649 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1650 17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution
1651 with specified denorm mode
1654 precision floating point
1657 Floating point denorm mode
1658 values are defined in
1659 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1661 Used by CP to set up
1662 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1663 19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution
1664 with specified denorm mode
1666 and 64 bit) floating point
1667 precision floating point
1670 Floating point denorm mode
1671 values are defined in
1672 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1674 Used by CP to set up
1675 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1676 20 1 bit PRIV Must be 0.
1678 Start executing wavefront
1679 in privilege trap handler
1682 CP is responsible for
1684 ``COMPUTE_PGM_RSRC1.PRIV``.
1685 21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution
1686 with DX10 clamp mode
1687 enabled. Used by the vector
1688 ALU to force DX10 style
1689 treatment of NaN's (when
1690 set, clamp NaN to zero,
1694 Used by CP to set up
1695 ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
1696 22 1 bit DEBUG_MODE Must be 0.
1698 Start executing wavefront
1699 in single step mode.
1701 CP is responsible for
1703 ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
1704 23 1 bit ENABLE_IEEE_MODE Wavefront starts execution
1706 enabled. Floating point
1707 opcodes that support
1708 exception flag gathering
1709 will quiet and propagate
1710 signaling-NaN inputs per
1711 IEEE 754-2008. Min_dx10 and
1712 max_dx10 become IEEE
1713 754-2008 compliant due to
1714 signaling-NaN propagation
1717 Used by CP to set up
1718 ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
1719 24 1 bit BULKY Must be 0.
1721 Only one work-group allowed
1722 to execute on a compute
1725 CP is responsible for
1727 ``COMPUTE_PGM_RSRC1.BULKY``.
1728 25 1 bit CDBG_USER Must be 0.
1730 Flag that can be used to
1731 control debugging code.
1733 CP is responsible for
1735 ``COMPUTE_PGM_RSRC1.CDBG_USER``.
1736 26 1 bit FP16_OVFL GFX6-GFX8
1737 Reserved, must be 0.
1739 Wavefront starts execution
1740 with specified fp16 overflow
1743 - If 0, fp16 overflow generates
1745 - If 1, fp16 overflow that is the
1746 result of an +/-INF input value
1747 or divide by 0 produces a +/-INF,
1748 otherwise clamps computed
1749 overflow to +/-MAX_FP16 as
1752 Used by CP to set up
1753 ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
1754 31:27 5 bits Reserved, must be 0.
1755 32 **Total size 4 bytes**
1756 ======= ===================================================================================================================
1760 .. table:: compute_pgm_rsrc2 for GFX6-GFX9
1761 :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
1763 ======= ======= =============================== ===========================================================================
1764 Bits Size Field Name Description
1765 ======= ======= =============================== ===========================================================================
1766 0 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the
1767 _WAVE_OFFSET SGPR wave scratch offset
1768 system register (see
1769 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1771 Used by CP to set up
1772 ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
1773 5:1 5 bits USER_SGPR_COUNT The total number of SGPR
1775 requested. This number must
1776 match the number of user
1777 data registers enabled.
1779 Used by CP to set up
1780 ``COMPUTE_PGM_RSRC2.USER_SGPR``.
1781 6 1 bit ENABLE_TRAP_HANDLER Set to 1 if code contains a
1782 TRAP instruction which
1783 requires a trap handler to
1787 ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``
1789 installed a trap handler
1790 regardless of the setting
1792 7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the
1793 system SGPR register for
1794 the work-group id in the X
1796 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1798 Used by CP to set up
1799 ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
1800 8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the
1801 system SGPR register for
1802 the work-group id in the Y
1804 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1806 Used by CP to set up
1807 ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
1808 9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the
1809 system SGPR register for
1810 the work-group id in the Z
1812 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1814 Used by CP to set up
1815 ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
1816 10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the
1817 system SGPR register for
1818 work-group information (see
1819 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1821 Used by CP to set up
1822 ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
1823 12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the
1824 VGPR system registers used
1825 for the work-item ID.
1826 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
1829 Used by CP to set up
1830 ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
1831 13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0.
1833 Wavefront starts execution
1835 exceptions enabled which
1836 are generated when L1 has
1837 witnessed a thread access
1841 CP is responsible for
1842 filling in the address
1844 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1845 according to what the
1847 14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0.
1849 Wavefront starts execution
1850 with memory violation
1851 exceptions exceptions
1852 enabled which are generated
1853 when a memory violation has
1854 occurred for this wave from
1856 (write-to-read-only-memory,
1857 mis-aligned atomic, LDS
1858 address out of range,
1859 illegal address, etc.).
1863 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1864 according to what the
1866 23:15 9 bits GRANULATED_LDS_SIZE Must be 0.
1868 CP uses the rounded value
1869 from the dispatch packet,
1870 not this value, as the
1871 dispatch may contain
1872 dynamically allocated group
1873 segment memory. CP writes
1875 ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
1877 Amount of group segment
1878 (LDS) to allocate for each
1879 work-group. Granularity is
1883 roundup(lds-size / (64 * 4))
1885 roundup(lds-size / (128 * 4))
1887 24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution
1888 _INVALID_OPERATION with specified exceptions
1891 Used by CP to set up
1892 ``COMPUTE_PGM_RSRC2.EXCP_EN``
1893 (set from bits 0..6).
1897 25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more
1898 _SOURCE input operands is a
1900 26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by
1901 _DIVISION_BY_ZERO Zero
1902 27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow
1904 28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow
1906 29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact
1908 30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero
1909 _ZERO (rcp_iflag_f32 instruction
1911 31 1 bit Reserved, must be 0.
1912 32 **Total size 4 bytes.**
1913 ======= ===================================================================================================================
1917 .. table:: Floating Point Rounding Mode Enumeration Values
1918 :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
1920 ====================================== ===== ==============================
1921 Enumeration Name Value Description
1922 ====================================== ===== ==============================
1923 AMDGPU_FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
1924 AMDGPU_FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
1925 AMDGPU_FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
1926 AMDGPU_FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
1927 ====================================== ===== ==============================
1931 .. table:: Floating Point Denorm Mode Enumeration Values
1932 :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
1934 ====================================== ===== ==============================
1935 Enumeration Name Value Description
1936 ====================================== ===== ==============================
1937 AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination
1939 AMDGPU_FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
1940 AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
1941 AMDGPU_FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
1942 ====================================== ===== ==============================
1946 .. table:: System VGPR Work-Item ID Enumeration Values
1947 :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
1949 ======================================== ===== ============================
1950 Enumeration Name Value Description
1951 ======================================== ===== ============================
1952 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension
1954 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
1956 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
1958 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
1959 ======================================== ===== ============================
1961 .. _amdgpu-amdhsa-initial-kernel-execution-state:
1963 Initial Kernel Execution State
1964 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1966 This section defines the register state that will be set up by the packet
1967 processor prior to the start of execution of every wavefront. This is limited by
1968 the constraints of the hardware controllers of CP/ADC/SPI.
1970 The order of the SGPR registers is defined, but the compiler can specify which
1971 ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
1972 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
1973 for enabled registers are dense starting at SGPR0: the first enabled register is
1974 SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
1977 The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
1978 all waves of the grid. It is possible to specify more than 16 User SGPRs using
1979 the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
1980 initialized. These are then immediately followed by the System SGPRs that are
1981 set up by ADC/SPI and can have different values for each wave of the grid
1984 SGPR register initial state is defined in
1985 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
1987 .. table:: SGPR Register Set Up Order
1988 :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
1990 ========== ========================== ====== ==============================
1991 SGPR Order Name Number Description
1992 (kernel descriptor enable of
1994 ========== ========================== ====== ==============================
1995 First Private Segment Buffer 4 V# that can be used, together
1996 (enable_sgpr_private with Scratch Wave Offset as an
1997 _segment_buffer) offset, to access the private
1998 memory space using a segment
2001 CP uses the value provided by
2003 then Dispatch Ptr 2 64 bit address of AQL dispatch
2004 (enable_sgpr_dispatch_ptr) packet for kernel dispatch
2006 then Queue Ptr 2 64 bit address of amd_queue_t
2007 (enable_sgpr_queue_ptr) object for AQL queue on which
2008 the dispatch packet was
2010 then Kernarg Segment Ptr 2 64 bit address of Kernarg
2011 (enable_sgpr_kernarg segment. This is directly
2012 _segment_ptr) copied from the
2013 kernarg_address in the kernel
2016 Having CP load it once avoids
2017 loading it at the beginning of
2019 then Dispatch Id 2 64 bit Dispatch ID of the
2020 (enable_sgpr_dispatch_id) dispatch packet being
2022 then Flat Scratch Init 2 This is 2 SGPRs:
2023 (enable_sgpr_flat_scratch
2027 The first SGPR is a 32 bit
2029 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2030 to per SPI base of memory
2031 for scratch for the queue
2032 executing the kernel
2033 dispatch. CP obtains this
2034 from the runtime. (The
2035 Scratch Segment Buffer base
2037 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2038 plus this offset.) The value
2039 of Scratch Wave Offset must
2040 be added to this offset by
2041 the kernel machine code,
2042 right shifted by 8, and
2043 moved to the FLAT_SCRATCH_HI
2045 FLAT_SCRATCH_HI corresponds
2046 to SGPRn-4 on GFX7, and
2047 SGPRn-6 on GFX8 (where SGPRn
2048 is the highest numbered SGPR
2049 allocated to the wave).
2051 multiplied by 256 (as it is
2052 in units of 256 bytes) and
2054 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2055 to calculate the per wave
2056 FLAT SCRATCH BASE in flat
2057 memory instructions that
2061 The second SGPR is 32 bit
2062 byte size of a single
2063 work-item's scratch memory
2064 usage. CP obtains this from
2065 the runtime, and it is
2066 always a multiple of DWORD.
2067 CP checks that the value in
2068 the kernel dispatch packet
2069 Private Segment Byte Size is
2070 not larger, and requests the
2071 runtime to increase the
2072 queue's scratch size if
2073 necessary. The kernel code
2075 FLAT_SCRATCH_LO which is
2076 SGPRn-3 on GFX7 and SGPRn-5
2077 on GFX8. FLAT_SCRATCH_LO is
2078 used as the FLAT SCRATCH
2080 instructions. Having CP load
2081 it once avoids loading it at
2082 the beginning of every
2086 64 bit base address of the
2087 per SPI scratch backing
2088 memory managed by SPI for
2089 the queue executing the
2090 kernel dispatch. CP obtains
2091 this from the runtime (and
2092 divides it if there are
2093 multiple Shader Arrays each
2094 with its own SPI). The value
2095 of Scratch Wave Offset must
2096 be added by the kernel
2097 machine code and the result
2098 moved to the FLAT_SCRATCH
2099 SGPR which is SGPRn-6 and
2100 SGPRn-5. It is used as the
2101 FLAT SCRATCH BASE in flat
2102 memory instructions.
2103 then Private Segment Size 1 The 32 bit byte size of a
2104 (enable_sgpr_private single
2106 scratch_segment_size) memory
2107 allocation. This is the
2108 value from the kernel
2109 dispatch packet Private
2110 Segment Byte Size rounded up
2111 by CP to a multiple of
2114 Having CP load it once avoids
2115 loading it at the beginning of
2118 This is not used for
2119 GFX7-GFX8 since it is the same
2120 value as the second SGPR of
2121 Flat Scratch Init. However, it
2122 may be needed for GFX9 which
2123 changes the meaning of the
2124 Flat Scratch Init value.
2125 then Grid Work-Group Count X 1 32 bit count of the number of
2126 (enable_sgpr_grid work-groups in the X dimension
2127 _workgroup_count_X) for the grid being
2128 executed. Computed from the
2129 fields in the kernel dispatch
2130 packet as ((grid_size.x +
2131 workgroup_size.x - 1) /
2133 then Grid Work-Group Count Y 1 32 bit count of the number of
2134 (enable_sgpr_grid work-groups in the Y dimension
2135 _workgroup_count_Y && for the grid being
2136 less than 16 previous executed. Computed from the
2137 SGPRs) fields in the kernel dispatch
2138 packet as ((grid_size.y +
2139 workgroup_size.y - 1) /
2142 Only initialized if <16
2143 previous SGPRs initialized.
2144 then Grid Work-Group Count Z 1 32 bit count of the number of
2145 (enable_sgpr_grid work-groups in the Z dimension
2146 _workgroup_count_Z && for the grid being
2147 less than 16 previous executed. Computed from the
2148 SGPRs) fields in the kernel dispatch
2149 packet as ((grid_size.z +
2150 workgroup_size.z - 1) /
2153 Only initialized if <16
2154 previous SGPRs initialized.
2155 then Work-Group Id X 1 32 bit work-group id in X
2156 (enable_sgpr_workgroup_id dimension of grid for
2158 then Work-Group Id Y 1 32 bit work-group id in Y
2159 (enable_sgpr_workgroup_id dimension of grid for
2161 then Work-Group Id Z 1 32 bit work-group id in Z
2162 (enable_sgpr_workgroup_id dimension of grid for
2164 then Work-Group Info 1 {first_wave, 14'b0000,
2165 (enable_sgpr_workgroup ordered_append_term[10:0],
2166 _info) threadgroup_size_in_waves[5:0]}
2167 then Scratch Wave Offset 1 32 bit byte offset from base
2168 (enable_sgpr_private of scratch base of queue
2169 _segment_wave_offset) executing the kernel
2170 dispatch. Must be used as an
2172 segment address when using
2173 Scratch Segment Buffer. It
2174 must be used to set up FLAT
2175 SCRATCH for flat addressing
2177 :ref:`amdgpu-amdhsa-flat-scratch`).
2178 ========== ========================== ====== ==============================
2180 The order of the VGPR registers is defined, but the compiler can specify which
2181 ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
2182 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2183 for enabled registers are dense starting at VGPR0: the first enabled register is
2184 VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
2187 VGPR register initial state is defined in
2188 :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
2190 .. table:: VGPR Register Set Up Order
2191 :name: amdgpu-amdhsa-vgpr-register-set-up-order-table
2193 ========== ========================== ====== ==============================
2194 VGPR Order Name Number Description
2195 (kernel descriptor enable of
2197 ========== ========================== ====== ==============================
2198 First Work-Item Id X 1 32 bit work item id in X
2199 (Always initialized) dimension of work-group for
2201 then Work-Item Id Y 1 32 bit work item id in Y
2202 (enable_vgpr_workitem_id dimension of work-group for
2203 > 0) wavefront lane.
2204 then Work-Item Id Z 1 32 bit work item id in Z
2205 (enable_vgpr_workitem_id dimension of work-group for
2206 > 1) wavefront lane.
2207 ========== ========================== ====== ==============================
2209 The setting of registers is is done by GPU CP/ADC/SPI hardware as follows:
2211 1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
2213 2. Work-group Id registers X, Y, Z are set by ADC which supports any
2214 combination including none.
2215 3. Scratch Wave Offset is set by SPI in a per wave basis which is why its value
2216 cannot included with the flat scratch init value which is per queue.
2217 4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
2220 Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
2221 value to the hardware required SGPRn-3 and SGPRn-4 respectively.
2223 The global segment can be accessed either using buffer instructions (GFX6 which
2224 has V# 64 bit address support), flat instructions (GFX7-GFX9), or global
2225 instructions (GFX9).
2227 If buffer operations are used then the compiler can generate a V# with the
2228 following properties:
2232 * ATC: 1 if IOMMU present (such as APU)
2234 * MTYPE set to support memory coherence that matches the runtime (such as CC for
2235 APU and NC for dGPU).
2237 .. _amdgpu-amdhsa-kernel-prolog:
2242 .. _amdgpu-amdhsa-m0:
2248 The M0 register must be initialized with a value at least the total LDS size
2249 if the kernel may access LDS via DS or flat operations. Total LDS size is
2250 available in dispatch packet. For M0, it is also possible to use maximum
2251 possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
2254 The M0 register is not used for range checking LDS accesses and so does not
2255 need to be initialized in the prolog.
2257 .. _amdgpu-amdhsa-flat-scratch:
2262 If the kernel may use flat operations to access scratch memory, the prolog code
2263 must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
2264 are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wave
2265 Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
2268 Flat scratch is not supported.
2271 1. The low word of Flat Scratch Init is 32 bit byte offset from
2272 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
2273 being managed by SPI for the queue executing the kernel dispatch. This is
2274 the same value used in the Scratch Segment Buffer V# base address. The
2275 prolog must add the value of Scratch Wave Offset to get the wave's byte
2276 scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
2277 FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
2278 by 8 before moving into FLAT_SCRATCH_LO.
2279 2. The second word of Flat Scratch Init is 32 bit byte size of a single
2280 work-items scratch memory usage. This is directly loaded from the kernel
2281 dispatch packet Private Segment Byte Size and rounded up to a multiple of
2282 DWORD. Having CP load it once avoids loading it at the beginning of every
2283 wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
2287 The Flat Scratch Init is the 64 bit address of the base of scratch backing
2288 memory being managed by SPI for the queue executing the kernel dispatch. The
2289 prolog must add the value of Scratch Wave Offset and moved to the FLAT_SCRATCH
2290 pair for use as the flat scratch base in flat memory instructions.
2292 .. _amdgpu-amdhsa-memory-model:
2297 This section describes the mapping of LLVM memory model onto AMDGPU machine code
2298 (see :ref:`memmodel`). *The implementation is WIP.*
2301 Update when implementation complete.
2303 The AMDGPU backend supports the memory synchronization scopes specified in
2304 :ref:`amdgpu-memory-scopes`.
2306 The code sequences used to implement the memory model are defined in table
2307 :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
2309 The sequences specify the order of instructions that a single thread must
2310 execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
2311 to other memory instructions executed by the same thread. This allows them to be
2312 moved earlier or later which can allow them to be combined with other instances
2313 of the same instruction, or hoisted/sunk out of loops to improve
2314 performance. Only the instructions related to the memory model are given;
2315 additional ``s_waitcnt`` instructions are required to ensure registers are
2316 defined before being used. These may be able to be combined with the memory
2317 model ``s_waitcnt`` instructions as described above.
2319 The AMDGPU backend supports the following memory models:
2321 HSA Memory Model [HSA]_
2322 The HSA memory model uses a single happens-before relation for all address
2323 spaces (see :ref:`amdgpu-address-spaces`).
2324 OpenCL Memory Model [OpenCL]_
2325 The OpenCL memory model which has separate happens-before relations for the
2326 global and local address spaces. Only a fence specifying both global and
2327 local address space, and seq_cst instructions join the relationships. Since
2328 the LLVM ``memfence`` instruction does not allow an address space to be
2329 specified the OpenCL fence has to convervatively assume both local and
2330 global address space was specified. However, optimizations can often be
2331 done to eliminate the additional ``s_waitcnt`` instructions when there are
2332 no intervening memory instructions which access the corresponding address
2333 space. The code sequences in the table indicate what can be omitted for the
2334 OpenCL memory. The target triple environment is used to determine if the
2335 source language is OpenCL (see :ref:`amdgpu-opencl`).
2337 ``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
2340 ``buffer/global/flat_load/store/atomic`` instructions to global memory are
2341 termed vector memory operations.
2345 * Each agent has multiple compute units (CU).
2346 * Each CU has multiple SIMDs that execute wavefronts.
2347 * The wavefronts for a single work-group are executed in the same CU but may be
2348 executed by different SIMDs.
2349 * Each CU has a single LDS memory shared by the wavefronts of the work-groups
2351 * All LDS operations of a CU are performed as wavefront wide operations in a
2352 global order and involve no caching. Completion is reported to a wavefront in
2354 * The LDS memory has multiple request queues shared by the SIMDs of a
2355 CU. Therefore, the LDS operations performed by different waves of a work-group
2356 can be reordered relative to each other, which can result in reordering the
2357 visibility of vector memory operations with respect to LDS operations of other
2358 wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
2359 ensure synchronization between LDS operations and vector memory operations
2360 between waves of a work-group, but not between operations performed by the
2362 * The vector memory operations are performed as wavefront wide operations and
2363 completion is reported to a wavefront in execution order. The exception is
2364 that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
2365 vector memory order if they access LDS memory, and out of LDS operation order
2366 if they access global memory.
2367 * The vector memory operations access a single vector L1 cache shared by all
2368 SIMDs a CU. Therefore, no special action is required for coherence between the
2369 lanes of a single wavefront, or for coherence between wavefronts in the same
2370 work-group. A ``buffer_wbinvl1_vol`` is required for coherence between waves
2371 executing in different work-groups as they may be executing on different CUs.
2372 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
2373 on a group of CUs. The scalar and vector L1 caches are not coherent. However,
2374 scalar operations are used in a restricted way so do not impact the memory
2375 model. See :ref:`amdgpu-amdhsa-memory-spaces`.
2376 * The vector and scalar memory operations use an L2 cache shared by all CUs on
2378 * The L2 cache has independent channels to service disjoint ranges of virtual
2380 * Each CU has a separate request queue per channel. Therefore, the vector and
2381 scalar memory operations performed by waves executing in different work-groups
2382 (which may be executing on different CUs) of an agent can be reordered
2383 relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
2384 synchronization between vector memory operations of different CUs. It ensures a
2385 previous vector memory operation has completed before executing a subsequent
2386 vector memory or LDS operation and so can be used to meet the requirements of
2387 acquire and release.
2388 * The L2 cache can be kept coherent with other agents on some targets, or ranges
2389 of virtual addresses can be set up to bypass it to ensure system coherence.
2391 Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-GFX8),
2392 or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
2393 memory, atomic memory orderings are not meaningful and all accesses are treated
2396 Constant address space uses ``buffer/global_load`` instructions (or equivalent
2397 scalar memory instructions). Since the constant address space contents do not
2398 change during the execution of a kernel dispatch it is not legal to perform
2399 stores, and atomic memory orderings are not meaningful and all access are
2400 treated as non-atomic.
2402 A memory synchronization scope wider than work-group is not meaningful for the
2403 group (LDS) address space and is treated as work-group.
2405 The memory model does not support the region address space which is treated as
2408 Acquire memory ordering is not meaningful on store atomic instructions and is
2409 treated as non-atomic.
2411 Release memory ordering is not meaningful on load atomic instructions and is
2412 treated a non-atomic.
2414 Acquire-release memory ordering is not meaningful on load or store atomic
2415 instructions and is treated as acquire and release respectively.
2417 AMDGPU backend only uses scalar memory operations to access memory that is
2418 proven to not change during the execution of the kernel dispatch. This includes
2419 constant address space and global address space for program scope const
2420 variables. Therefore the kernel machine code does not have to maintain the
2421 scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
2422 and vector L1 caches are invalidated between kernel dispatches by CP since
2423 constant address space data may change between kernel dispatch executions. See
2424 :ref:`amdgpu-amdhsa-memory-spaces`.
2426 The one execption is if scalar writes are used to spill SGPR registers. In this
2427 case the AMDGPU backend ensures the memory location used to spill is never
2428 accessed by vector memory operations at the same time. If scalar writes are used
2429 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
2430 return since the locations may be used for vector memory instructions by a
2431 future wave that uses the same scratch area, or a function call that creates a
2432 frame at the same address, respectively. There is no need for a ``s_dcache_inv``
2433 as all scalar writes are write-before-read in the same thread.
2435 Scratch backing memory (which is used for the private address space)
2436 is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private
2437 address space is only accessed by a single thread, and is always
2438 write-before-read, there is never a need to invalidate these entries from the L1
2439 cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the
2440 volatile cache lines.
2442 On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
2443 to invalidate the L2 cache. This also causes it to be treated as
2444 non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC
2445 (cache coherent) and so the L2 cache will coherent with the CPU and other
2448 .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
2449 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
2451 ============ ============ ============== ========== ===============================
2452 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
2453 Ordering Sync Scope Address
2455 ============ ============ ============== ========== ===============================
2457 -----------------------------------------------------------------------------------
2458 load *none* *none* - global - !volatile & !nontemporal
2460 - private 1. buffer/global/flat_load
2462 - volatile & !nontemporal
2464 1. buffer/global/flat_load
2469 1. buffer/global/flat_load
2472 load *none* *none* - local 1. ds_load
2473 store *none* *none* - global - !nontemporal
2475 - private 1. buffer/global/flat_store
2479 1. buffer/global/flat_stote
2482 store *none* *none* - local 1. ds_store
2483 **Unordered Atomic**
2484 -----------------------------------------------------------------------------------
2485 load atomic unordered *any* *any* *Same as non-atomic*.
2486 store atomic unordered *any* *any* *Same as non-atomic*.
2487 atomicrmw unordered *any* *any* *Same as monotonic
2489 **Monotonic Atomic**
2490 -----------------------------------------------------------------------------------
2491 load atomic monotonic - singlethread - global 1. buffer/global/flat_load
2492 - wavefront - generic
2494 load atomic monotonic - singlethread - local 1. ds_load
2497 load atomic monotonic - agent - global 1. buffer/global/flat_load
2498 - system - generic glc=1
2499 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
2500 - wavefront - generic
2504 store atomic monotonic - singlethread - local 1. ds_store
2507 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
2508 - wavefront - generic
2512 atomicrmw monotonic - singlethread - local 1. ds_atomic
2516 -----------------------------------------------------------------------------------
2517 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
2520 load atomic acquire - workgroup - global 1. buffer/global/flat_load
2521 load atomic acquire - workgroup - local 1. ds_load
2522 2. s_waitcnt lgkmcnt(0)
2525 - Must happen before
2537 load atomic acquire - workgroup - generic 1. flat_load
2538 2. s_waitcnt lgkmcnt(0)
2541 - Must happen before
2553 load atomic acquire - agent - global 1. buffer/global/flat_load
2555 2. s_waitcnt vmcnt(0)
2557 - Must happen before
2565 3. buffer_wbinvl1_vol
2567 - Must happen before
2577 load atomic acquire - agent - generic 1. flat_load glc=1
2578 - system 2. s_waitcnt vmcnt(0) &
2583 - Must happen before
2586 - Ensures the flat_load
2591 3. buffer_wbinvl1_vol
2593 - Must happen before
2603 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
2606 atomicrmw acquire - workgroup - global 1. buffer/global/flat_atomic
2607 atomicrmw acquire - workgroup - local 1. ds_atomic
2608 2. waitcnt lgkmcnt(0)
2611 - Must happen before
2624 atomicrmw acquire - workgroup - generic 1. flat_atomic
2625 2. waitcnt lgkmcnt(0)
2628 - Must happen before
2641 atomicrmw acquire - agent - global 1. buffer/global/flat_atomic
2642 - system 2. s_waitcnt vmcnt(0)
2644 - Must happen before
2653 3. buffer_wbinvl1_vol
2655 - Must happen before
2665 atomicrmw acquire - agent - generic 1. flat_atomic
2666 - system 2. s_waitcnt vmcnt(0) &
2671 - Must happen before
2680 3. buffer_wbinvl1_vol
2682 - Must happen before
2692 fence acquire - singlethread *none* *none*
2694 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0)
2699 - However, since LLVM
2724 fence-paired-atomic).
2725 - Must happen before
2736 fence-paired-atomic.
2738 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
2745 - However, since LLVM
2753 - Could be split into
2762 - s_waitcnt vmcnt(0)
2773 fence-paired-atomic).
2774 - s_waitcnt lgkmcnt(0)
2785 fence-paired-atomic).
2786 - Must happen before
2800 fence-paired-atomic.
2802 2. buffer_wbinvl1_vol
2804 - Must happen before any
2805 following global/generic
2815 -----------------------------------------------------------------------------------
2816 store atomic release - singlethread - global 1. buffer/global/ds/flat_store
2819 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
2828 - Must happen before
2839 2. buffer/global/flat_store
2840 store atomic release - workgroup - local 1. ds_store
2841 store atomic release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
2850 - Must happen before
2862 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
2863 - system - generic vmcnt(0)
2867 - Could be split into
2876 - s_waitcnt vmcnt(0)
2883 - s_waitcnt lgkmcnt(0)
2890 - Must happen before
2901 2. buffer/global/ds/flat_store
2902 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
2905 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
2914 - Must happen before
2925 2. buffer/global/flat_atomic
2926 atomicrmw release - workgroup - local 1. ds_atomic
2927 atomicrmw release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
2936 - Must happen before
2948 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
2949 - system - generic vmcnt(0)
2953 - Could be split into
2962 - s_waitcnt vmcnt(0)
2969 - s_waitcnt lgkmcnt(0)
2976 - Must happen before
2987 2. buffer/global/ds/flat_atomic
2988 fence release - singlethread *none* *none*
2990 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0)
2995 - However, since LLVM
3016 - Must happen before
3025 fence-paired-atomic).
3032 fence-paired-atomic.
3034 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) &
3045 - However, since LLVM
3060 - Could be split into
3069 - s_waitcnt vmcnt(0)
3076 - s_waitcnt lgkmcnt(0)
3083 - Must happen before
3092 fence-paired-atomic).
3099 fence-paired-atomic.
3101 **Acquire-Release Atomic**
3102 -----------------------------------------------------------------------------------
3103 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
3106 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
3115 - Must happen before
3126 2. buffer/global/flat_atomic
3127 atomicrmw acq_rel - workgroup - local 1. ds_atomic
3128 2. s_waitcnt lgkmcnt(0)
3131 - Must happen before
3144 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
3153 - Must happen before
3165 3. s_waitcnt lgkmcnt(0)
3168 - Must happen before
3181 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
3186 - Could be split into
3195 - s_waitcnt vmcnt(0)
3202 - s_waitcnt lgkmcnt(0)
3209 - Must happen before
3220 2. buffer/global/flat_atomic
3221 3. s_waitcnt vmcnt(0)
3223 - Must happen before
3232 4. buffer_wbinvl1_vol
3234 - Must happen before
3244 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
3249 - Could be split into
3258 - s_waitcnt vmcnt(0)
3265 - s_waitcnt lgkmcnt(0)
3272 - Must happen before
3284 3. s_waitcnt vmcnt(0) &
3289 - Must happen before
3298 4. buffer_wbinvl1_vol
3300 - Must happen before
3310 fence acq_rel - singlethread *none* *none*
3312 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0)
3332 - Must happen before
3355 acquire-fence-paired-atomic
3376 release-fence-paired-atomic
3377 ). This satisfies the
3381 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) &
3388 - However, since LLVM
3396 - Could be split into
3405 - s_waitcnt vmcnt(0)
3412 - s_waitcnt lgkmcnt(0)
3419 - Must happen before
3424 global/local/generic
3433 acquire-fence-paired-atomic
3445 global/local/generic
3454 release-fence-paired-atomic
3455 ). This satisfies the
3459 2. buffer_wbinvl1_vol
3461 - Must happen before
3475 **Sequential Consistent Atomic**
3476 -----------------------------------------------------------------------------------
3477 load atomic seq_cst - singlethread - global *Same as corresponding
3478 - wavefront - local load atomic acquire,
3479 - generic except must generated
3480 all instructions even
3482 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0)
3497 lgkmcnt(0) and so do
3532 instructions same as
3535 except must generated
3536 all instructions even
3538 load atomic seq_cst - workgroup - local *Same as corresponding
3539 load atomic acquire,
3540 except must generated
3541 all instructions even
3543 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
3544 - system - generic vmcnt(0)
3546 - Could be split into
3555 - waitcnt lgkmcnt(0)
3568 lgkmcnt(0) and so do
3619 instructions same as
3622 except must generated
3623 all instructions even
3625 store atomic seq_cst - singlethread - global *Same as corresponding
3626 - wavefront - local store atomic release,
3627 - workgroup - generic except must generated
3628 all instructions even
3630 store atomic seq_cst - agent - global *Same as corresponding
3631 - system - generic store atomic release,
3632 except must generated
3633 all instructions even
3635 atomicrmw seq_cst - singlethread - global *Same as corresponding
3636 - wavefront - local atomicrmw acq_rel,
3637 - workgroup - generic except must generated
3638 all instructions even
3640 atomicrmw seq_cst - agent - global *Same as corresponding
3641 - system - generic atomicrmw acq_rel,
3642 except must generated
3643 all instructions even
3645 fence seq_cst - singlethread *none* *Same as corresponding
3646 - wavefront fence acq_rel,
3647 - workgroup except must generated
3648 - agent all instructions even
3649 - system for OpenCL.*
3650 ============ ============ ============== ========== ===============================
3652 The memory order also adds the single thread optimization constrains defined in
3654 :ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
3656 .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
3657 :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
3659 ============ ==============================================================
3660 LLVM Memory Optimization Constraints
3662 ============ ==============================================================
3665 acquire - If a load atomic/atomicrmw then no following load/load
3666 atomic/store/ store atomic/atomicrmw/fence instruction can
3667 be moved before the acquire.
3668 - If a fence then same as load atomic, plus no preceding
3669 associated fence-paired-atomic can be moved after the fence.
3670 release - If a store atomic/atomicrmw then no preceding load/load
3671 atomic/store/ store atomic/atomicrmw/fence instruction can
3672 be moved after the release.
3673 - If a fence then same as store atomic, plus no following
3674 associated fence-paired-atomic can be moved before the
3676 acq_rel Same constraints as both acquire and release.
3677 seq_cst - If a load atomic then same constraints as acquire, plus no
3678 preceding sequentially consistent load atomic/store
3679 atomic/atomicrmw/fence instruction can be moved after the
3681 - If a store atomic then the same constraints as release, plus
3682 no following sequentially consistent load atomic/store
3683 atomic/atomicrmw/fence instruction can be moved before the
3685 - If an atomicrmw/fence then same constraints as acq_rel.
3686 ============ ==============================================================
3691 For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
3692 (such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
3693 the ``s_trap`` instruction with the following usage:
3695 .. table:: AMDGPU Trap Handler for AMDHSA OS
3696 :name: amdgpu-trap-handler-for-amdhsa-os-table
3698 =================== =============== =============== =======================
3699 Usage Code Sequence Trap Handler Description
3701 =================== =============== =============== =======================
3702 reserved ``s_trap 0x00`` Reserved by hardware.
3703 ``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA
3704 ``queue_ptr`` ``debugtrap``
3705 ``VGPR0``: intrinsic (not
3706 ``arg`` implemented).
3707 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be
3708 ``queue_ptr`` terminated and its
3709 associated queue put
3710 into the error state.
3711 ``llvm.debugtrap`` ``s_trap 0x03`` ``SGPR0-1``: If debugger not
3712 ``queue_ptr`` installed handled
3713 same as ``llvm.trap``.
3714 debugger breakpoint ``s_trap 0x07`` Reserved for debugger
3716 debugger ``s_trap 0x08`` Reserved for debugger.
3717 debugger ``s_trap 0xfe`` Reserved for debugger.
3718 debugger ``s_trap 0xff`` Reserved for debugger.
3719 =================== =============== =============== =======================
3724 This section provides code conventions used when the target triple OS is
3725 empty (see :ref:`amdgpu-target-triples`).
3730 For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
3731 not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
3732 instructions are handled as follows:
3734 .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
3735 :name: amdgpu-trap-handler-for-non-amdhsa-os-table
3737 =============== =============== ===========================================
3738 Usage Code Sequence Description
3739 =============== =============== ===========================================
3740 llvm.trap s_endpgm Causes wavefront to be terminated.
3741 llvm.debugtrap *none* Compiler warning given that there is no
3742 trap handler installed.
3743 =============== =============== ===========================================
3753 When generating code for the OpenCL language the target triple environment
3754 should be ``opencl`` or ``amdgizcl`` (see :ref:`amdgpu-target-triples`).
3756 When the language is OpenCL the following differences occur:
3758 1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
3759 2. The AMDGPU backend adds additional arguments to the kernel.
3760 3. Additional metadata is generated
3761 (:ref:`amdgpu-amdhsa-hsa-code-object-metadata`).
3764 Specify what affect this has. Hidden arguments added. Additional metadata
3772 When generating code for the OpenCL language the target triple environment
3773 should be ``hcc`` (see :ref:`amdgpu-target-triples`).
3775 When the language is OpenCL the following differences occur:
3777 1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
3780 Specify what affect this has.
3785 AMDGPU backend has LLVM-MC based assembler which is currently in development.
3786 It supports AMDGCN GFX6-GFX9.
3788 This section describes general syntax for instructions and operands. For more
3789 information about instructions, their semantics and supported combinations of
3790 operands, refer to one of instruction set architecture manuals
3791 [AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_ and [AMD-GCN-GFX9]_.
3793 An instruction has the following syntax (register operands are normally
3794 comma-separated while extra operands are space-separated):
3796 *<opcode> <register_operand0>, ... <extra_operand0> ...*
3801 The following syntax for register operands is supported:
3803 * SGPR registers: s0, ... or s[0], ...
3804 * VGPR registers: v0, ... or v[0], ...
3805 * TTMP registers: ttmp0, ... or ttmp[0], ...
3806 * Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
3807 * Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
3808 * Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
3809 * Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
3810 * Register index expressions: v[2*2], s[1-1:2-1]
3811 * 'off' indicates that an operand is not enabled
3813 The following extra operands are supported:
3815 * offset, offset0, offset1
3817 * glc, slc, tfe bits
3818 * waitcnt: integer or combination of counter values
3821 - abs (\| \|), neg (\-)
3825 - row_shl, row_shr, row_ror, row_rol
3826 - row_mirror, row_half_mirror, row_bcast
3827 - wave_shl, wave_shr, wave_ror, wave_rol, quad_perm
3828 - row_mask, bank_mask, bound_ctrl
3832 - dst_sel, src0_sel, src1_sel (BYTE_N, WORD_M, DWORD)
3833 - dst_unused (UNUSED_PAD, UNUSED_SEXT, UNUSED_PRESERVE)
3836 Instruction Examples
3837 ~~~~~~~~~~~~~~~~~~~~
3842 .. code-block:: nasm
3844 ds_add_u32 v2, v4 offset:16
3845 ds_write_src2_b64 v2 offset0:4 offset1:8
3846 ds_cmpst_f32 v2, v4, v6
3847 ds_min_rtn_f64 v[8:9], v2, v[4:5]
3850 For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
3855 .. code-block:: nasm
3857 flat_load_dword v1, v[3:4]
3858 flat_store_dwordx3 v[3:4], v[5:7]
3859 flat_atomic_swap v1, v[3:4], v5 glc
3860 flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
3861 flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
3863 For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
3868 .. code-block:: nasm
3870 buffer_load_dword v1, off, s[4:7], s1
3871 buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
3872 buffer_store_format_xy v[1:2], off, s[4:7], s1
3874 buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
3876 For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
3881 .. code-block:: nasm
3883 s_load_dword s1, s[2:3], 0xfc
3884 s_load_dwordx8 s[8:15], s[2:3], s4
3885 s_load_dwordx16 s[88:103], s[2:3], s4
3889 For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
3894 .. code-block:: nasm
3897 s_mov_b64 s[0:1], 0x80000000
3899 s_wqm_b64 s[2:3], s[4:5]
3900 s_bcnt0_i32_b64 s1, s[2:3]
3901 s_swappc_b64 s[2:3], s[4:5]
3902 s_cbranch_join s[4:5]
3904 For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
3909 .. code-block:: nasm
3911 s_add_u32 s1, s2, s3
3912 s_and_b64 s[2:3], s[4:5], s[6:7]
3913 s_cselect_b32 s1, s2, s3
3914 s_andn2_b32 s2, s4, s6
3915 s_lshr_b64 s[2:3], s[4:5], s6
3916 s_ashr_i32 s2, s4, s6
3917 s_bfm_b64 s[2:3], s4, s6
3918 s_bfe_i64 s[2:3], s[4:5], s6
3919 s_cbranch_g_fork s[4:5], s[6:7]
3921 For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
3926 .. code-block:: nasm
3929 s_bitcmp1_b32 s1, s2
3930 s_bitcmp0_b64 s[2:3], s4
3933 For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
3938 .. code-block:: nasm
3943 s_waitcnt 0 ; Wait for all counters to be 0
3944 s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
3945 s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
3949 s_sendmsg sendmsg(MSG_INTERRUPT)
3952 For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
3954 Unless otherwise mentioned, little verification is performed on the operands
3955 of SOPP Instructions, so it is up to the programmer to be familiar with the
3956 range or acceptable values.
3961 For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
3962 the assembler will automatically use optimal encoding based on its operands.
3963 To force specific encoding, one can add a suffix to the opcode of the instruction:
3965 * _e32 for 32-bit VOP1/VOP2/VOPC
3966 * _e64 for 64-bit VOP3
3968 * _sdwa for VOP_SDWA
3970 VOP1/VOP2/VOP3/VOPC examples:
3972 .. code-block:: nasm
3975 v_mov_b32_e32 v1, v2
3977 v_cvt_f64_i32_e32 v[1:2], v2
3978 v_floor_f32_e32 v1, v2
3979 v_bfrev_b32_e32 v1, v2
3980 v_add_f32_e32 v1, v2, v3
3981 v_mul_i32_i24_e64 v1, v2, 3
3982 v_mul_i32_i24_e32 v1, -3, v3
3983 v_mul_i32_i24_e32 v1, -100, v3
3984 v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
3985 v_max_f16_e32 v1, v2, v3
3989 .. code-block:: nasm
3991 v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
3992 v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
3993 v_mov_b32 v0, v0 wave_shl:1
3994 v_mov_b32 v0, v0 row_mirror
3995 v_mov_b32 v0, v0 row_bcast:31
3996 v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
3997 v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
3998 v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
4002 .. code-block:: nasm
4004 v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
4005 v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
4006 v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
4007 v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
4008 v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
4010 For full list of supported instructions, refer to "Vector ALU instructions".
4012 HSA Code Object Directives
4013 ~~~~~~~~~~~~~~~~~~~~~~~~~~
4015 AMDGPU ABI defines auxiliary data in output code object. In assembly source,
4016 one can specify them with assembler directives.
4018 .hsa_code_object_version major, minor
4019 +++++++++++++++++++++++++++++++++++++
4021 *major* and *minor* are integers that specify the version of the HSA code
4022 object that will be generated by the assembler.
4024 .hsa_code_object_isa [major, minor, stepping, vendor, arch]
4025 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4028 *major*, *minor*, and *stepping* are all integers that describe the instruction
4029 set architecture (ISA) version of the assembly program.
4031 *vendor* and *arch* are quoted strings. *vendor* should always be equal to
4032 "AMD" and *arch* should always be equal to "AMDGPU".
4034 By default, the assembler will derive the ISA version, *vendor*, and *arch*
4035 from the value of the -mcpu option that is passed to the assembler.
4037 .amdgpu_hsa_kernel (name)
4038 +++++++++++++++++++++++++
4040 This directives specifies that the symbol with given name is a kernel entry point
4041 (label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
4046 This directive marks the beginning of a list of key / value pairs that are used
4047 to specify the amd_kernel_code_t object that will be emitted by the assembler.
4048 The list must be terminated by the *.end_amd_kernel_code_t* directive. For
4049 any amd_kernel_code_t values that are unspecified a default value will be
4050 used. The default value for all keys is 0, with the following exceptions:
4052 - *kernel_code_version_major* defaults to 1.
4053 - *machine_kind* defaults to 1.
4054 - *machine_version_major*, *machine_version_minor*, and
4055 *machine_version_stepping* are derived from the value of the -mcpu option
4056 that is passed to the assembler.
4057 - *kernel_code_entry_byte_offset* defaults to 256.
4058 - *wavefront_size* defaults to 6.
4059 - *kernarg_segment_alignment*, *group_segment_alignment*, and
4060 *private_segment_alignment* default to 4. Note that alignments are specified
4061 as a power of two, so a value of **n** means an alignment of 2^ **n**.
4063 The *.amd_kernel_code_t* directive must be placed immediately after the
4064 function label and before any instructions.
4066 For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
4067 comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
4069 Here is an example of a minimal amd_kernel_code_t specification:
4071 .. code-block:: none
4073 .hsa_code_object_version 1,0
4074 .hsa_code_object_isa
4079 .amdgpu_hsa_kernel hello_world
4084 enable_sgpr_kernarg_segment_ptr = 1
4086 compute_pgm_rsrc1_vgprs = 0
4087 compute_pgm_rsrc1_sgprs = 0
4088 compute_pgm_rsrc2_user_sgpr = 2
4089 kernarg_segment_byte_size = 8
4090 wavefront_sgpr_count = 2
4091 workitem_vgpr_count = 3
4092 .end_amd_kernel_code_t
4094 s_load_dwordx2 s[0:1], s[0:1] 0x0
4095 v_mov_b32 v0, 3.14159
4096 s_waitcnt lgkmcnt(0)
4099 flat_store_dword v[1:2], v0
4102 .size hello_world, .Lfunc_end0-hello_world
4104 Additional Documentation
4105 ========================
4107 .. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
4108 .. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
4109 .. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
4110 .. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
4111 .. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
4112 .. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
4113 .. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
4114 .. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
4115 .. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
4116 .. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
4117 .. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
4118 .. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
4119 .. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
4120 .. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
4121 .. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
4122 .. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__