1 =============================
2 User Guide for AMDGPU Backend
3 =============================
11 The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
12 R600 family up until the current GCN families. It lives in the
13 ``lib/Target/AMDGPU`` directory.
18 .. _amdgpu-target-triples:
23 Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
24 specify the target triple:
26 .. table:: AMDGPU Target Triples
27 :name: amdgpu-target-triples-table
29 ============ ======== ========= ===========
30 Architecture Vendor OS Environment
31 ============ ======== ========= ===========
32 r600 amd <empty> <empty>
33 amdgcn amd <empty> <empty>
34 amdgcn amd amdhsa <empty>
35 amdgcn amd amdhsa opencl
36 amdgcn amd amdhsa amdgizcl
37 amdgcn amd amdhsa amdgiz
39 ============ ======== ========= ===========
42 Supports AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders executed on
46 Supports AMD GPUs GCN 6 onwards for graphics and compute shaders executed on
49 ``amdgcn-amd-amdhsa-``
50 Supports AMD GCN GPUs GFX6 onwards for compute kernels executed on HSA [HSA]_
51 compatible runtimes such as AMD's ROCm [AMD-ROCm]_.
53 ``amdgcn-amd-amdhsa-opencl``
54 Supports AMD GCN GPUs GFX6 onwards for OpenCL compute kernels executed on HSA
55 [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. See
58 ``amdgcn-amd-amdhsa-amdgizcl``
59 Same as ``amdgcn-amd-amdhsa-opencl`` except a different address space mapping
60 is used (see :ref:`amdgpu-address-spaces`).
62 ``amdgcn-amd-amdhsa-amdgiz``
63 Same as ``amdgcn-amd-amdhsa-`` except a different address space mapping is
64 used (see :ref:`amdgpu-address-spaces`).
66 ``amdgcn-amd-amdhsa-hcc``
67 Supports AMD GCN GPUs GFX6 onwards for AMD HC language compute kernels
68 executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. See
71 .. _amdgpu-processors:
76 Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
77 names from both the *Processor* and *Alternative Processor* can be used.
79 .. table:: AMDGPU Processors
80 :name: amdgpu-processors-table
82 ========== =========== ============ ===== ======= ==================
83 Processor Alternative Target dGPU/ Runtime Example
84 Processor Triple APU Support Products
86 ========== =========== ============ ===== ======= ==================
88 --------------------------------------------------------------------
94 --------------------------------------------------------------------
98 **Evergreen** [AMD-Evergreen]_
99 --------------------------------------------------------------------
105 **Northern Islands** [AMD-Cayman-Trinity]_
106 --------------------------------------------------------------------
111 **GCN GFX6 (Southern Islands (SI))** [AMD-Souther-Islands]_
112 --------------------------------------------------------------------
113 gfx600 - SI amdgcn dGPU
115 gfx601 - pitcairn amdgcn dGPU
119 **GCN GFX7 (Sea Islands (CI))** [AMD-Sea-Islands]_
120 --------------------------------------------------------------------
121 gfx700 - bonaire amdgcn dGPU - Radeon HD 7790
125 \ - kaveri amdgcn APU - A6-7000
135 gfx701 - hawaii amdgcn dGPU ROCm - FirePro W8100
139 gfx702 dGPU ROCm - Radeon R9 290
143 gfx703 - kabini amdgcn APU - E1-2100
152 **GCN GFX8 (Volcanic Islands (VI))** [AMD-Volcanic-Islands]_
153 --------------------------------------------------------------------
154 gfx800 - iceland amdgcn dGPU - FirePro S7150
162 gfx801 - carrizo amdgcn APU - A6-8500P
168 \ amdgcn APU ROCm - A10-8700P
171 \ amdgcn APU - A10-9600P
177 \ amdgcn APU - E2-9010
180 gfx802 - tonga amdgcn dGPU ROCm Same as gfx800
181 gfx803 - fiji amdgcn dGPU ROCm - Radeon R9 Nano
186 \ - polaris10 amdgcn dGPU ROCm - Radeon RX 470
188 \ - polaris11 amdgcn dGPU ROCm - Radeon RX 460
189 gfx804 amdgcn dGPU Same as gfx803
190 gfx810 - stoney amdgcn APU
192 --------------------------------------------------------------------
193 gfx900 amdgcn dGPU - FirePro W9500
196 gfx901 amdgcn dGPU ROCm Same as gfx900
199 gfx902 amdgcn APU *TBA*
204 gfx903 amdgcn APU Same as gfx902
207 ========== =========== ============ ===== ======= ==================
209 .. _amdgpu-address-spaces:
214 The AMDGPU backend uses the following address space mappings.
216 The memory space names used in the table, aside from the region memory space, is
217 from the OpenCL standard.
219 LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
221 .. table:: Address Space Mapping
222 :name: amdgpu-address-space-mapping-table
224 ================== ================= ================= ================= =================
225 LLVM Address Space Memory Space
226 ------------------ -----------------------------------------------------------------------
227 \ Current Default amdgiz/amdgizcl hcc Future Default
228 ================== ================= ================= ================= =================
229 0 Private (Scratch) Generic (Flat) Generic (Flat) Generic (Flat)
230 1 Global Global Global Global
231 2 Constant Constant Constant Region (GDS)
232 3 Local (group/LDS) Local (group/LDS) Local (group/LDS) Local (group/LDS)
233 4 Generic (Flat) Region (GDS) Region (GDS) Constant
234 5 Region (GDS) Private (Scratch) Private (Scratch) Private (Scratch)
235 ================== ================= ================= ================= =================
238 This is the current default address space mapping used for all languages
239 except hcc. This will shortly be deprecated.
242 This is the current address space mapping used when ``amdgiz`` or ``amdgizcl``
243 is specified as the target triple environment value.
246 This is the current address space mapping used when ``hcc`` is specified as
247 the target triple environment value.This will shortly be deprecated.
250 This will shortly be the only address space mapping for all languages using
253 .. _amdgpu-memory-scopes:
258 This section provides LLVM memory synchronization scopes supported by the AMDGPU
259 backend memory model when the target triple OS is ``amdhsa`` (see
260 :ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
262 The memory model supported is based on the HSA memory model [HSA]_ which is
263 based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
264 relation is transitive over the synchonizes-with relation independent of scope,
265 and synchonizes-with allows the memory scope instances to be inclusive (see
266 table :ref:`amdgpu-amdhsa-llvm-sync-scopes-amdhsa-table`).
268 This is different to the OpenCL [OpenCL]_ memory model which does not have scope
269 inclusion and requires the memory scopes to exactly match. However, this
270 is conservatively correct for OpenCL.
272 .. table:: AMDHSA LLVM Sync Scopes for AMDHSA
273 :name: amdgpu-amdhsa-llvm-sync-scopes-amdhsa-table
275 ================ ==========================================================
276 LLVM Sync Scope Description
277 ================ ==========================================================
278 *none* The default: ``system``.
280 Synchronizes with, and participates in modification and
281 seq_cst total orderings with, other operations (except
282 image operations) for all address spaces (except private,
283 or generic that accesses private) provided the other
284 operation's sync scope is:
287 - ``agent`` and executed by a thread on the same agent.
288 - ``workgroup`` and executed by a thread in the same
290 - ``wavefront`` and executed by a thread in the same
293 ``agent`` Synchronizes with, and participates in modification and
294 seq_cst total orderings with, other operations (except
295 image operations) for all address spaces (except private,
296 or generic that accesses private) provided the other
297 operation's sync scope is:
299 - ``system`` or ``agent`` and executed by a thread on the
301 - ``workgroup`` and executed by a thread in the same
303 - ``wavefront`` and executed by a thread in the same
306 ``workgroup`` Synchronizes with, and participates in modification and
307 seq_cst total orderings with, other operations (except
308 image operations) for all address spaces (except private,
309 or generic that accesses private) provided the other
310 operation's sync scope is:
312 - ``system``, ``agent`` or ``workgroup`` and executed by a
313 thread in the same workgroup.
314 - ``wavefront`` and executed by a thread in the same
317 ``wavefront`` Synchronizes with, and participates in modification and
318 seq_cst total orderings with, other operations (except
319 image operations) for all address spaces (except private,
320 or generic that accesses private) provided the other
321 operation's sync scope is:
323 - ``system``, ``agent``, ``workgroup`` or ``wavefront``
324 and executed by a thread in the same wavefront.
326 ``singlethread`` Only synchronizes with, and participates in modification
327 and seq_cst total orderings with, other operations (except
328 image operations) running in the same thread for all
329 address spaces (for example, in signal handlers).
330 ================ ==========================================================
335 The AMDGPU backend implements the following intrinsics.
337 *This section is WIP.*
340 List AMDGPU intrinsics
345 The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
346 can be linked by ``lld`` to produce a standard ELF shared code object which can
347 be loaded and executed on an AMDGPU target.
352 The AMDGPU backend uses the following ELF header:
354 .. table:: AMDGPU ELF Header
355 :name: amdgpu-elf-header-table
357 ========================== =========================
359 ========================== =========================
360 ``e_ident[EI_CLASS]`` ``ELFCLASS64``
361 ``e_ident[EI_DATA]`` ``ELFDATA2LSB``
362 ``e_ident[EI_OSABI]`` ``ELFOSABI_AMDGPU_HSA``
363 ``e_ident[EI_ABIVERSION]`` ``ELFABIVERSION_AMDGPU_HSA``
364 ``e_type`` ``ET_REL`` or ``ET_DYN``
365 ``e_machine`` ``EM_AMDGPU``
368 ========================== =========================
372 .. table:: AMDGPU ELF Header Enumeration Values
373 :name: amdgpu-elf-header-enumeration-values-table
375 ============================ =====
377 ============================ =====
379 ``ELFOSABI_AMDGPU_HSA`` 64
380 ``ELFABIVERSION_AMDGPU_HSA`` 1
381 ============================ =====
383 ``e_ident[EI_CLASS]``
384 The ELF class is always ``ELFCLASS64``. The AMDGPU backend only supports 64 bit
388 All AMDGPU targets use ELFDATA2LSB for little-endian byte ordering.
390 ``e_ident[EI_OSABI]``
391 The AMD GPU architecture specific OS ABI of ``ELFOSABI_AMDGPU_HSA`` is used to
392 specify that the code object conforms to the AMD HSA runtime ABI [HSA]_.
394 ``e_ident[EI_ABIVERSION]``
395 The AMD GPU architecture specific OS ABI version of
396 ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA runtime
397 ABI to which the code object conforms.
400 Can be one of the following values:
404 The type produced by the AMD GPU backend compiler as it is relocatable code
408 The type produced by the linker as it is a shared code object.
410 The AMD HSA runtime loader requires a ``ET_DYN`` code object.
413 The value ``EM_AMDGPU`` is used for the machine for all members of the AMD GPU
414 architecture family. The specific member is specified in the
415 ``NT_AMD_AMDGPU_ISA`` entry in the ``.note`` section (see
416 :ref:`amdgpu-note-records`).
419 The entry point is 0 as the entry points for individual kernels must be
420 selected in order to invoke them through AQL packets.
423 The value is 0 as no flags are used.
428 An AMDGPU target ELF code object has the standard ELF sections which include:
430 .. table:: AMDGPU ELF Sections
431 :name: amdgpu-elf-sections-table
433 ================== ================ =================================
435 ================== ================ =================================
436 ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
437 ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
438 ``.debug_``\ *\** ``SHT_PROGBITS`` *none*
439 ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
440 ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
441 ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
442 ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
443 ``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
444 ``.note`` ``SHT_NOTE`` *none*
445 ``.rela``\ *name* ``SHT_RELA`` *none*
446 ``.rela.dyn`` ``SHT_RELA`` *none*
447 ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
448 ``.shstrtab`` ``SHT_STRTAB`` *none*
449 ``.strtab`` ``SHT_STRTAB`` *none*
450 ``.symtab`` ``SHT_SYMTAB`` *none*
451 ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
452 ================== ================ =================================
454 These sections have their standard meanings (see [ELF]_) and are only generated
458 The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
459 DWARF produced by the AMDGPU backend.
461 ``.dynamic``, ``.dynstr``, ``.dynstr``, ``.hash``
462 The standard sections used by a dynamic loader.
465 See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
468 ``.rela``\ *name*, ``.rela.dyn``
469 For relocatable code objects, *name* is the name of the section that the
470 relocation records apply. For example, ``.rela.text`` is the section name for
471 relocation records associated with the ``.text`` section.
473 For linked shared code objects, ``.rela.dyn`` contains all the relocation
474 records from each of the relocatable code object's ``.rela``\ *name* sections.
476 See :ref:`amdgpu-relocation-records` for the relocation records supported by
480 The executable machine code for the kernels and functions they call. Generated
481 as position independent code. See :ref:`amdgpu-code-conventions` for
482 information on conventions used in the isa generation.
484 .. _amdgpu-note-records:
489 As required by ``ELFCLASS64``, minimal zero byte padding must be generated after
490 the ``name`` field to ensure the ``desc`` field is 4 byte aligned. In addition,
491 minimal zero byte padding must be generated to ensure the ``desc`` field size is
492 a multiple of 4 bytes. The ``sh_addralign`` field of the ``.note`` section must
493 be at least 4 to indicate at least 8 byte alignment.
495 The AMDGPU backend code object uses the following ELF note records in the
496 ``.note`` section. The *Description* column specifies the layout of the note
497 record’s ``desc`` field. All fields are consecutive bytes. Note records with
498 variable size strings have a corresponding ``*_size`` field that specifies the
499 number of bytes, including the terminating null character, in the string. The
500 string(s) come immediately after the preceding fields.
502 Additional note records can be present.
504 .. table:: AMDGPU ELF Note Records
505 :name: amdgpu-elf-note-records-table
507 ===== ========================== ==========================================
508 Name Type Description
509 ===== ========================== ==========================================
510 "AMD" ``NT_AMD_AMDGPU_METADATA`` <metadata null terminated string>
511 "AMD" ``NT_AMD_AMDGPU_ISA`` <isa name null terminated string>
512 ===== ========================== ==========================================
516 .. table:: AMDGPU ELF Note Record Enumeration Values
517 :name: amdgpu-elf-note-record-enumeration-values-table
519 ============================= =====
521 ============================= =====
523 ``NT_AMD_AMDGPU_METADATA`` 10
524 ``NT_AMD_AMDGPU_ISA`` 11
525 ============================= =====
527 ``NT_AMD_AMDGPU_ISA``
528 Specifies the instruction set architecture used by the machine code contained
531 This note record is required for code objects containing machine code for
532 processors matching the ``amdgcn`` architecture in table
533 :ref:`amdgpu-processors`.
535 The null terminated string has the following syntax:
537 *architecture*\ ``-``\ *vendor*\ ``-``\ *os*\ ``-``\ *environment*\ ``-``\ *processor*
542 The architecture from table :ref:`amdgpu-target-triples-table`.
544 This is always ``amdgcn`` when the target triple OS is ``amdhsa`` (see
545 :ref:`amdgpu-target-triples`).
548 The vendor from table :ref:`amdgpu-target-triples-table`.
550 For the AMDGPU backend this is always ``amd``.
553 The OS from table :ref:`amdgpu-target-triples-table`.
556 An environment from table :ref:`amdgpu-target-triples-table`, or blank if
557 the environment has no affect on the execution of the code object.
559 For the AMDGPU backend this is currently always blank.
561 The processor from table :ref:`amdgpu-processors-table`.
565 ``amdgcn-amd-amdhsa--gfx901``
567 ``NT_AMD_AMDGPU_METADATA``
568 Specifies extensible metadata associated with the code object. See
569 :ref:`amdgpu-code-object-metadata` for the syntax of the code object metadata
572 This note record is required and must contain the minimum information
573 necessary to support the ROCM kernel queries. For example, the segment sizes
574 needed in a dispatch packet. In addition, a high level language runtime may
575 require other information to be included. For example, the AMD OpenCL runtime
576 records kernel argument information.
579 Is the string null terminated? It probably should not if YAML allows it to
580 contain null characters, otherwise it should be.
582 .. _amdgpu-code-object-metadata:
587 The code object metadata is specified by the ``NT_AMD_AMDHSA_METADATA`` note
588 record (see :ref:`amdgpu-note-records`).
590 The metadata is specified as a YAML formatted string (see [YAML]_ and
593 The metadata is represented as a single YAML document comprised of the mapping
594 defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
597 For boolean values, the string values of ``false`` and ``true`` are used for
598 false and true respectively.
600 Additional information can be added to the mappings. To avoid conflicts, any
601 non-AMD key names should be prefixed by "*vendor-name*.".
603 .. table:: AMDHSA Code Object Metadata Mapping
604 :name: amdgpu-amdhsa-code-object-metadata-mapping-table
606 ========== ============== ========= =======================================
607 String Key Value Type Required? Description
608 ========== ============== ========= =======================================
609 "Version" sequence of Required - The first integer is the major
610 2 integers version. Currently 1.
611 - The second integer is the minor
612 version. Currently 0.
613 "Printf" sequence of Each string is encoded information
614 strings about a printf function call. The
615 encoded information is organized as
616 fields separated by colon (':'):
618 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
623 A 32 bit integer as a unique id for
624 each printf function call
627 A 32 bit integer equal to the number
628 of arguments of printf function call
631 ``S[i]`` (where i = 0, 1, ... , N-1)
632 32 bit integers for the size in bytes
633 of the i-th FormatString argument of
634 the printf function call
637 The format string passed to the
638 printf function call.
639 "Kernels" sequence of Required Sequence of the mappings for each
640 mapping kernel in the code object. See
641 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
642 for the definition of the mapping.
643 ========== ============== ========= =======================================
647 .. table:: AMDHSA Code Object Kernel Metadata Mapping
648 :name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
650 ================= ============== ========= ================================
651 String Key Value Type Required? Description
652 ================= ============== ========= ================================
653 "Name" string Required Source name of the kernel.
654 "SymbolName" string Required Name of the kernel
655 descriptor ELF symbol.
656 "Language" string Source language of the kernel.
664 "LanguageVersion" sequence of - The first integer is the major
666 - The second integer is the
668 "Attrs" mapping Mapping of kernel attributes.
670 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
671 for the mapping definition.
672 "Arguments" sequence of Sequence of mappings of the
673 mapping kernel arguments. See
674 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
675 for the definition of the mapping.
676 "CodeProps" mapping Mapping of properties related to
678 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
679 for the mapping definition.
680 "DebugProps" mapping Mapping of properties related to
681 the kernel debugging. See
682 :ref:`amdgpu-amdhsa-code-object-kernel-debug-properties-metadata-mapping-table`
683 for the mapping definition.
684 ================= ============== ========= ================================
688 .. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
689 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
691 =================== ============== ========= ==============================
692 String Key Value Type Required? Description
693 =================== ============== ========= ==============================
694 "ReqdWorkGroupSize" sequence of The dispatch work-group size
695 3 integers X, Y, Z must correspond to the
698 Corresponds to the OpenCL
699 ``reqd_work_group_size``
701 "WorkGroupSizeHint" sequence of The dispatch work-group size
702 3 integers X, Y, Z is likely to be the
705 Corresponds to the OpenCL
706 ``work_group_size_hint``
708 "VecTypeHint" string The name of a scalar or vector
711 Corresponds to the OpenCL
712 ``vec_type_hint`` attribute.
713 =================== ============== ========= ==============================
717 .. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
718 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
720 ================= ============== ========= ================================
721 String Key Value Type Required? Description
722 ================= ============== ========= ================================
723 "Name" string Kernel argument name.
724 "TypeName" string Kernel argument type name.
725 "Size" integer Required Kernel argument size in bytes.
726 "Align" integer Required Kernel argument alignment in
727 bytes. Must be a power of two.
728 "ValueKind" string Required Kernel argument kind that
729 specifies how to set up the
730 corresponding argument.
734 The argument is copied
735 directly into the kernarg.
738 A global address space pointer
739 to the buffer data is passed
742 "DynamicSharedPointer"
743 A group address space pointer
744 to dynamically allocated LDS
745 is passed in the kernarg.
748 A global address space
749 pointer to a S# is passed in
753 A global address space
754 pointer to a T# is passed in
758 A global address space pointer
759 to an OpenCL pipe is passed in
763 A global address space pointer
764 to an OpenCL device enqueue
765 queue is passed in the
768 "HiddenGlobalOffsetX"
769 The OpenCL grid dispatch
770 global offset for the X
771 dimension is passed in the
774 "HiddenGlobalOffsetY"
775 The OpenCL grid dispatch
776 global offset for the Y
777 dimension is passed in the
780 "HiddenGlobalOffsetZ"
781 The OpenCL grid dispatch
782 global offset for the Z
783 dimension is passed in the
787 An argument that is not used
788 by the kernel. Space needs to
789 be left for it, but it does
790 not need to be set up.
793 A global address space pointer
794 to the runtime printf buffer
795 is passed in kernarg.
798 A global address space pointer
799 to the OpenCL device enqueue
800 queue that should be used by
801 the kernel by default is
802 passed in the kernarg.
804 "HiddenCompletionAction"
810 "ValueType" string Required Kernel argument value type. Only
811 present if "ValueKind" is
812 "ByValue". For vector data
813 types, the value is for the
814 element type. Values include:
830 How can it be determined if a
831 vector type, and what size
833 "PointeeAlign" integer Alignment in bytes of pointee
834 type for pointer type kernel
835 argument. Must be a power
836 of 2. Only present if
838 "DynamicSharedPointer".
839 "AddrSpaceQual" string Kernel argument address space
840 qualifier. Only present if
841 "ValueKind" is "GlobalBuffer" or
842 "DynamicSharedPointer". Values
853 Is GlobalBuffer only Global
855 DynamicSharedPointer always
856 Local? Can HCC allow Generic?
857 How can Private or Region
859 "AccQual" string Kernel argument access
860 qualifier. Only present if
861 "ValueKind" is "Image" or
872 "ActualAcc" string The actual memory accesses
873 performed by the kernel on the
874 kernel argument. Only present if
875 "ValueKind" is "GlobalBuffer",
876 "Image", or "Pipe". This may be
877 more restrictive than indicated
878 by "AccQual" to reflect what the
879 kernel actual does. If not
880 present then the runtime must
881 assume what is implied by
882 "AccQual" and "IsConst". Values
889 "IsConst" boolean Indicates if the kernel argument
890 is const qualified. Only present
894 "IsRestrict" boolean Indicates if the kernel argument
895 is restrict qualified. Only
896 present if "ValueKind" is
899 "IsVolatile" boolean Indicates if the kernel argument
900 is volatile qualified. Only
901 present if "ValueKind" is
904 "IsPipe" boolean Indicates if the kernel argument
905 is pipe qualified. Only present
906 if "ValueKind" is "Pipe".
909 Can GlobalBuffer be pipe
911 ================= ============== ========= ================================
915 .. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
916 :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
918 ============================ ============== ========= =====================
919 String Key Value Type Required? Description
920 ============================ ============== ========= =====================
921 "KernargSegmentSize" integer Required The size in bytes of
923 that holds the values
926 "GroupSegmentFixedSize" integer Required The amount of group
932 dynamically allocated
937 "PrivateSegmentFixedSize" integer Required The amount of fixed
938 private address space
939 memory required for a
945 to this value for the
947 "KernargSegmentAlign" integer Required The maximum byte
950 kernarg segment. Must
952 "WavefrontSize" integer Required Wavefront size. Must
954 "NumSGPRs" integer Number of scalar
970 "NumVGPRs" integer Number of vector
974 "MaxFlatWorkgroupSize" integer Maximum flat
977 kernel in work-items.
978 "IsDynamicCallStack" boolean Indicates if the
983 "IsXNACKEnabled" boolean Indicates if the
987 ============================ ============== ========= =====================
991 .. table:: AMDHSA Code Object Kernel Debug Properties Metadata Mapping
992 :name: amdgpu-amdhsa-code-object-kernel-debug-properties-metadata-mapping-table
994 =================================== ============== ========= ==============
995 String Key Value Type Required? Description
996 =================================== ============== ========= ==============
997 "DebuggerABIVersion" string
998 "ReservedNumVGPRs" integer
999 "ReservedFirstVGPR" integer
1000 "PrivateSegmentBufferSGPR" integer
1001 "WavefrontPrivateSegmentOffsetSGPR" integer
1002 =================================== ============== ========= ==============
1005 Plan to remove the debug properties metadata.
1012 Symbols include the following:
1014 .. table:: AMDGPU ELF Symbols
1015 :name: amdgpu-elf-symbols-table
1017 ===================== ============== ============= ==================
1018 Name Type Section Description
1019 ===================== ============== ============= ==================
1020 *link-name* ``STT_OBJECT`` - ``.data`` Global variable
1023 *link-name*\ ``@kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
1024 *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point
1025 ===================== ============== ============= ==================
1028 Global variables both used and defined by the compilation unit.
1030 If the symbol is defined in the compilation unit then it is allocated in the
1031 appropriate section according to if it has initialized data or is readonly.
1033 If the symbol is external then its section is ``STN_UNDEF`` and the loader
1034 will resolve relocations using the definition provided by another code object
1035 or explicitly defined by the runtime.
1037 All global symbols, whether defined in the compilation unit or external, are
1038 accessed by the machine code indirectly through a GOT table entry. This
1039 allows them to be preemptable. The GOT table is only supported when the target
1040 triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
1043 Add description of linked shared object symbols. Seems undefined symbols
1044 are marked as STT_NOTYPE.
1047 Every HSA kernel has an associated kernel descriptor. It is the address of the
1048 kernel descriptor that is used in the AQL dispatch packet used to invoke the
1049 kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
1050 defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
1053 Every HSA kernel also has a symbol for its machine code entry point.
1055 .. _amdgpu-relocation-records:
1060 AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
1061 relocatable fields are:
1064 This specifies a 32-bit field occupying 4 bytes with arbitrary byte
1065 alignment. These values use the same byte order as other word values in the
1066 AMD GPU architecture.
1069 This specifies a 64-bit field occupying 8 bytes with arbitrary byte
1070 alignment. These values use the same byte order as other word values in the
1071 AMD GPU architecture.
1073 Following notations are used for specifying relocation calculations:
1076 Represents the addend used to compute the value of the relocatable field.
1079 Represents the offset into the global offset table at which the relocation
1080 entry’s symbol will reside during execution.
1083 Represents the address of the global offset table.
1086 Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
1087 of the storage unit being relocated (computed using ``r_offset``).
1090 Represents the value of the symbol whose index resides in the relocation
1093 The following relocation types are supported:
1095 .. table:: AMDGPU ELF Relocation Records
1096 :name: amdgpu-elf-relocation-records-table
1098 ========================== ===== ========== ==============================
1099 Relocation Type Value Field Calculation
1100 ========================== ===== ========== ==============================
1101 ``R_AMDGPU_NONE`` 0 *none* *none*
1102 ``R_AMDGPU_ABS32_LO`` 1 ``word32`` (S + A) & 0xFFFFFFFF
1103 ``R_AMDGPU_ABS32_HI`` 2 ``word32`` (S + A) >> 32
1104 ``R_AMDGPU_ABS64`` 3 ``word64`` S + A
1105 ``R_AMDGPU_REL32`` 4 ``word32`` S + A - P
1106 ``R_AMDGPU_REL64`` 5 ``word64`` S + A - P
1107 ``R_AMDGPU_ABS32`` 6 ``word32`` S + A
1108 ``R_AMDGPU_GOTPCREL`` 7 ``word32`` G + GOT + A - P
1109 ``R_AMDGPU_GOTPCREL32_LO`` 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
1110 ``R_AMDGPU_GOTPCREL32_HI`` 9 ``word32`` (G + GOT + A - P) >> 32
1111 ``R_AMDGPU_REL32_LO`` 10 ``word32`` (S + A - P) & 0xFFFFFFFF
1112 ``R_AMDGPU_REL32_HI`` 11 ``word32`` (S + A - P) >> 32
1113 ========================== ===== ========== ==============================
1120 Standard DWARF [DWARF]_ Version 2 sections can be generated. These contain
1121 information that maps the code object executable code and data to the source
1122 language constructs. It can be used by tools such as debuggers and profilers.
1124 Address Space Mapping
1125 ~~~~~~~~~~~~~~~~~~~~~
1127 The following address space mapping is used:
1129 .. table:: AMDGPU DWARF Address Space Mapping
1130 :name: amdgpu-dwarf-address-space-mapping-table
1132 =================== =================
1133 DWARF Address Space Memory Space
1134 =================== =================
1139 *omitted* Generic (Flat)
1140 *not supported* Region (GDS)
1141 =================== =================
1143 See :ref:`amdgpu-address-spaces` for infomration on the memory space terminology
1146 An ``address_class`` attribute is generated on pointer type DIEs to specify the
1147 DWARF address space of the value of the pointer when it is in the *private* or
1148 *local* address space. Otherwise the attribute is omitted.
1150 An ``XDEREF`` operation is generated in location list expressions for variables
1151 that are allocated in the *private* and *local* address space. Otherwise no
1152 ``XDREF`` is omitted.
1157 *This section is WIP.*
1160 Define DWARF register enumeration.
1162 If want to present a wavefront state then should expose vector registers as
1163 64 wide (rather than per work-item view that LLVM uses). Either as separate
1164 registers, or a 64x4 byte single register. In either case use a new LANE op
1165 (akin to XDREF) to select the current lane usage in a location
1166 expression. This would also allow scalar register spilling to vector register
1167 lanes to be expressed (currently no debug information is being generated for
1168 spilling). If choose a wide single register approach then use LANE in
1169 conjunction with PIECE operation to select the dword part of the register for
1170 the current lane. If the separate register approach then use LANE to select
1176 *This section is WIP.*
1179 DWARF extension to include runtime generated source text.
1181 .. _amdgpu-code-conventions:
1189 This section provides code conventions used when the target triple OS is
1190 ``amdhsa`` (see :ref:`amdgpu-target-triples`).
1195 The HSA architected queuing language (AQL) defines a user space memory interface
1196 that can be used to control the dispatch of kernels, in an agent independent
1197 way. An agent can have zero or more AQL queues created for it using the ROCm
1198 runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
1199 *HSA Platform System Architecture Specification* [HSA]_ for the AQL queue
1200 mechanics and packet layouts.
1202 The packet processor of a kernel agent is responsible for detecting and
1203 dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
1204 packet processor is implemented by the hardware command processor (CP),
1205 asynchronous dispatch controller (ADC) and shader processor input controller
1208 The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
1209 mode driver to initialize and register the AQL queue with CP.
1211 To dispatch a kernel the following actions are performed. This can occur in the
1212 CPU host program, or from an HSA kernel executing on a GPU.
1214 1. A pointer to an AQL queue for the kernel agent on which the kernel is to be
1215 executed is obtained.
1216 2. A pointer to the kernel descriptor (see
1217 :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
1218 obtained. It must be for a kernel that is contained in a code object that that
1219 was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
1221 3. Space is allocated for the kernel arguments using the ROCm runtime allocator
1222 for a memory region with the kernarg property for the kernel agent that will
1223 execute the kernel. It must be at least 16 byte aligned.
1224 4. Kernel argument values are assigned to the kernel argument memory
1225 allocation. The layout is defined in the *HSA Programmer’s Language Reference*
1226 [HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
1227 memory in the same way constant memory is accessed. (Note that the HSA
1228 specification allows an implementation to copy the kernel argument contents to
1229 another location that is accessed by the kernel.)
1230 5. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
1231 api uses 64 bit atomic operations to reserve space in the AQL queue for the
1232 packet. The packet must be set up, and the final write must use an atomic
1233 store release to set the packet kind to ensure the packet contents are
1234 visible to the kernel agent. AQL defines a doorbell signal mechanism to
1235 notify the kernel agent that the AQL queue has been updated. These rules, and
1236 the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
1237 System Architecture Specification* [HSA]_.
1238 6. A kernel dispatch packet includes information about the actual dispatch,
1239 such as grid and work-group size, together with information from the code
1240 object about the kernel, such as segment sizes. The ROCm runtime queries on
1241 the kernel symbol can be used to obtain the code object values which are
1242 recorded in the :ref:`amdgpu-code-object-metadata`.
1243 7. CP executes micro-code and is responsible for detecting and setting up the
1244 GPU to execute the wavefronts of a kernel dispatch.
1245 8. CP ensures that when the a wavefront starts executing the kernel machine
1246 code, the scalar general purpose registers (SGPR) and vector general purpose
1247 registers (VGPR) are set up as required by the machine code. The required
1248 setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
1249 register state is defined in
1250 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
1251 9. The prolog of the kernel machine code (see
1252 :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
1253 before continuing executing the machine code that corresponds to the kernel.
1254 10. When the kernel dispatch has completed execution, CP signals the completion
1255 signal specified in the kernel dispatch packet if not 0.
1257 .. _amdgpu-amdhsa-memory-spaces:
1262 The memory space properties are:
1264 .. table:: AMDHSA Memory Spaces
1265 :name: amdgpu-amdhsa-memory-spaces-table
1267 ================= =========== ======== ======= ==================
1268 Memory Space Name HSA Segment Hardware Address NULL Value
1270 ================= =========== ======== ======= ==================
1271 Private private scratch 32 0x00000000
1272 Local group LDS 32 0xFFFFFFFF
1273 Global global global 64 0x0000000000000000
1274 Constant constant *same as 64 0x0000000000000000
1276 Generic flat flat 64 0x0000000000000000
1277 Region N/A GDS 32 *not implemented
1279 ================= =========== ======== ======= ==================
1281 The global and constant memory spaces both use global virtual addresses, which
1282 are the same virtual address space used by the CPU. However, some virtual
1283 addresses may only be accessible to the CPU, some only accessible by the GPU,
1286 Using the constant memory space indicates that the data will not change during
1287 the execution of the kernel. This allows scalar read instructions to be
1288 used. The vector and scalar L1 caches are invalidated of volatile data before
1289 each kernel dispatch execution to allow constant memory to change values between
1292 The local memory space uses the hardware Local Data Store (LDS) which is
1293 automatically allocated when the hardware creates work-groups of wavefronts, and
1294 freed when all the wavefronts of a work-group have terminated. The data store
1295 (DS) instructions can be used to access it.
1297 The private memory space uses the hardware scratch memory support. If the kernel
1298 uses scratch, then the hardware allocates memory that is accessed using
1299 wavefront lane dword (4 byte) interleaving. The mapping used from private
1300 address to physical address is:
1302 ``wavefront-scratch-base +
1303 (private-address * wavefront-size * 4) +
1304 (wavefront-lane-id * 4)``
1306 There are different ways that the wavefront scratch base address is determined
1307 by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
1308 memory can be accessed in an interleaved manner using buffer instruction with
1309 the scratch buffer descriptor and per wave scratch offset, by the scratch
1310 instructions, or by flat instructions. If each lane of a wavefront accesses the
1311 same private address, the interleaving results in adjacent dwords being accessed
1312 and hence requires fewer cache lines to be fetched. Multi-dword access is not
1313 supported except by flat and scratch instructions in GFX9.
1315 The generic address space uses the hardware flat address support available in
1316 GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
1317 local appertures), that are outside the range of addressible global memory, to
1318 map from a flat address to a private or local address.
1320 FLAT instructions can take a flat address and access global, private (scratch)
1321 and group (LDS) memory depending in if the address is within one of the
1322 apperture ranges. Flat access to scratch requires hardware aperture setup and
1323 setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
1324 access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
1325 (see :ref:`amdgpu-amdhsa-m0`).
1327 To convert between a segment address and a flat address the base address of the
1328 appertures address can be used. For GFX7-GFX8 these are available in the
1329 :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
1330 Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
1331 GFX9 the appature base addresses are directly available as inline constant
1332 registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
1333 address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
1334 which makes it easier to convert from flat to segment or segment to flat.
1336 HSA Image and Samplers
1337 ~~~~~~~~~~~~~~~~~~~~~~
1339 Image and sample handles created by the ROCm runtime are 64 bit addresses of a
1340 hardware 32 byte V# and 48 byte S# object respectively. In order to support the
1341 HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
1342 enumeration values for the queries that are not trivially deducible from the S#
1348 Signal handles created by the ROCm runtime are 64 bit addresses of a structure
1349 allocated in memory accessible from both the CPU and GPU. The structure is
1350 defined by the ROCm runtime and subject to change between releases (see
1351 [AMD-ROCm-github]_).
1353 .. _amdgpu-amdhsa-hsa-aql-queue:
1358 The AQL queue structure is defined by the ROCm runtime and subject to change
1359 between releases (see [AMD-ROCm-github]_). For some processors it contains
1360 fields needed to implement certain language features such as the flat address
1361 aperture bases. It also contains fields used by CP such as managing the
1362 allocation of scratch memory.
1364 .. _amdgpu-amdhsa-kernel-descriptor:
1369 A kernel descriptor consists of the information needed by CP to initiate the
1370 execution of a kernel, including the entry point address of the machine code
1371 that implements the kernel.
1373 Kernel Descriptor for GFX6-GFX9
1374 +++++++++++++++++++++++++++++++
1376 CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
1378 .. table:: Kernel Descriptor for GFX6-GFX9
1379 :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
1381 ======= ======= =============================== ===========================
1382 Bits Size Field Name Description
1383 ======= ======= =============================== ===========================
1384 31:0 4 bytes group_segment_fixed_size The amount of fixed local
1385 address space memory
1386 required for a work-group
1387 in bytes. This does not
1388 include any dynamically
1389 allocated local address
1390 space memory that may be
1391 added when the kernel is
1393 63:32 4 bytes private_segment_fixed_size The amount of fixed
1394 private address space
1395 memory required for a
1396 work-item in bytes. If
1397 is_dynamic_callstack is 1
1398 then additional space must
1399 be added to this value for
1401 95:64 4 bytes max_flat_workgroup_size Maximum flat work-group
1402 size supported by the
1403 kernel in work-items.
1404 96 1 bit is_dynamic_call_stack Indicates if the generated
1405 machine code is using a
1406 dynamically sized call
1408 97 1 bit is_xnack_enabled Indicates if the generated
1409 machine code is capable of
1411 127:98 30 bits Reserved. Must be 0.
1412 191:128 8 bytes kernel_code_entry_byte_offset Byte offset (possibly
1415 descriptor to kernel's
1416 entry point instruction
1417 which must be 256 byte
1419 383:192 24 Reserved. Must be 0.
1421 415:384 4 bytes compute_pgm_rsrc1 Compute Shader (CS)
1422 program settings used by
1424 ``COMPUTE_PGM_RSRC1``
1427 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1_t-gfx6-gfx9-table`.
1428 447:416 4 bytes compute_pgm_rsrc2 Compute Shader (CS)
1429 program settings used by
1431 ``COMPUTE_PGM_RSRC2``
1434 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
1435 448 1 bit enable_sgpr_private_segment Enable the setup of the
1436 _buffer SGPR user data registers
1438 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1440 The total number of SGPR
1442 requested must not exceed
1443 16 and match value in
1444 ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
1445 Any requests beyond 16
1447 449 1 bit enable_sgpr_dispatch_ptr *see above*
1448 450 1 bit enable_sgpr_queue_ptr *see above*
1449 451 1 bit enable_sgpr_kernarg_segment_ptr *see above*
1450 452 1 bit enable_sgpr_dispatch_id *see above*
1451 453 1 bit enable_sgpr_flat_scratch_init *see above*
1452 454 1 bit enable_sgpr_private_segment *see above*
1454 455 1 bit enable_sgpr_grid_workgroup Not implemented in CP and
1455 _count_X should always be 0.
1456 456 1 bit enable_sgpr_grid_workgroup Not implemented in CP and
1457 _count_Y should always be 0.
1458 457 1 bit enable_sgpr_grid_workgroup Not implemented in CP and
1459 _count_Z should always be 0.
1460 463:458 6 bits Reserved. Must be 0.
1461 511:464 4 Reserved. Must be 0.
1463 512 **Total size 64 bytes.**
1464 ======= ===================================================================
1468 .. table:: compute_pgm_rsrc1 for GFX6-GFX9
1469 :name: amdgpu-amdhsa-compute_pgm_rsrc1_t-gfx6-gfx9-table
1471 ======= ======= =============================== ===========================================================================
1472 Bits Size Field Name Description
1473 ======= ======= =============================== ===========================================================================
1474 5:0 6 bits granulated_workitem_vgpr_count Number of vector registers
1475 used by each work-item,
1476 granularity is device
1480 roundup((max-vgpg + 1)
1483 Used by CP to set up
1484 ``COMPUTE_PGM_RSRC1.VGPRS``.
1485 9:6 4 bits granulated_wavefront_sgpr_count Number of scalar registers
1486 used by a wavefront,
1487 granularity is device
1491 roundup((max-sgpg + 1)
1494 roundup((max-sgpg + 1)
1497 Includes the special SGPRs
1498 for VCC, Flat Scratch (for
1499 GFX7 onwards) and XNACK
1500 (for GFX8 onwards). It does
1501 not include the 16 SGPR
1502 added if a trap handler is
1505 Used by CP to set up
1506 ``COMPUTE_PGM_RSRC1.SGPRS``.
1507 11:10 2 bits priority Must be 0.
1509 Start executing wavefront
1510 at the specified priority.
1512 CP is responsible for
1514 ``COMPUTE_PGM_RSRC1.PRIORITY``.
1515 13:12 2 bits float_mode_round_32 Wavefront starts execution
1516 with specified rounding
1519 precision floating point
1522 Floating point rounding
1523 mode values are defined in
1524 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1526 Used by CP to set up
1527 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1528 15:14 2 bits float_mode_round_16_64 Wavefront starts execution
1529 with specified rounding
1530 denorm mode for half/double (16
1531 and 64 bit) floating point
1532 precision floating point
1535 Floating point rounding
1536 mode values are defined in
1537 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1539 Used by CP to set up
1540 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1541 17:16 2 bits float_mode_denorm_32 Wavefront starts execution
1542 with specified denorm mode
1545 precision floating point
1548 Floating point denorm mode
1549 values are defined in
1550 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1552 Used by CP to set up
1553 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1554 19:18 2 bits float_mode_denorm_16_64 Wavefront starts execution
1555 with specified denorm mode
1557 and 64 bit) floating point
1558 precision floating point
1561 Floating point denorm mode
1562 values are defined in
1563 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1565 Used by CP to set up
1566 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1567 20 1 bit priv Must be 0.
1569 Start executing wavefront
1570 in privilege trap handler
1573 CP is responsible for
1575 ``COMPUTE_PGM_RSRC1.PRIV``.
1576 21 1 bit enable_dx10_clamp Wavefront starts execution
1577 with DX10 clamp mode
1578 enabled. Used by the vector
1579 ALU to force DX-10 style
1580 treatment of NaN's (when
1581 set, clamp NaN to zero,
1585 Used by CP to set up
1586 ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
1587 22 1 bit debug_mode Must be 0.
1589 Start executing wavefront
1590 in single step mode.
1592 CP is responsible for
1594 ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
1595 23 1 bit enable_ieee_mode Wavefront starts execution
1597 enabled. Floating point
1598 opcodes that support
1599 exception flag gathering
1600 will quiet and propagate
1601 signaling-NaN inputs per
1602 IEEE 754-2008. Min_dx10 and
1603 max_dx10 become IEEE
1604 754-2008 compliant due to
1605 signaling-NaN propagation
1608 Used by CP to set up
1609 ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
1610 24 1 bit bulky Must be 0.
1612 Only one work-group allowed
1613 to execute on a compute
1616 CP is responsible for
1618 ``COMPUTE_PGM_RSRC1.BULKY``.
1619 25 1 bit cdbg_user Must be 0.
1621 Flag that can be used to
1622 control debugging code.
1624 CP is responsible for
1626 ``COMPUTE_PGM_RSRC1.CDBG_USER``.
1627 31:26 6 bits Reserved. Must be 0.
1628 32 **Total size 4 bytes**
1629 ======= ===================================================================================================================
1633 .. table:: compute_pgm_rsrc2 for GFX6-GFX9
1634 :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
1636 ======= ======= =============================== ===========================================================================
1637 Bits Size Field Name Description
1638 ======= ======= =============================== ===========================================================================
1639 0 1 bit enable_sgpr_private_segment Enable the setup of the
1640 _wave_offset SGPR wave scratch offset
1641 system register (see
1642 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1644 Used by CP to set up
1645 ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
1646 5:1 5 bits user_sgpr_count The total number of SGPR
1648 requested. This number must
1649 match the number of user
1650 data registers enabled.
1652 Used by CP to set up
1653 ``COMPUTE_PGM_RSRC2.USER_SGPR``.
1654 6 1 bit enable_trap_handler Set to 1 if code contains a
1655 TRAP instruction which
1656 requires a trap handler to
1660 ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``
1662 installed a trap handler
1663 regardless of the setting
1665 7 1 bit enable_sgpr_workgroup_id_x Enable the setup of the
1666 system SGPR register for
1667 the work-group id in the X
1669 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1671 Used by CP to set up
1672 ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
1673 8 1 bit enable_sgpr_workgroup_id_y Enable the setup of the
1674 system SGPR register for
1675 the work-group id in the Y
1677 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1679 Used by CP to set up
1680 ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
1681 9 1 bit enable_sgpr_workgroup_id_z Enable the setup of the
1682 system SGPR register for
1683 the work-group id in the Z
1685 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1687 Used by CP to set up
1688 ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
1689 10 1 bit enable_sgpr_workgroup_info Enable the setup of the
1690 system SGPR register for
1691 work-group information (see
1692 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1694 Used by CP to set up
1695 ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
1696 12:11 2 bits enable_vgpr_workitem_id Enable the setup of the
1697 VGPR system registers used
1698 for the work-item ID.
1699 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
1702 Used by CP to set up
1703 ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
1704 13 1 bit enable_exception_address_watch Must be 0.
1706 Wavefront starts execution
1708 exceptions enabled which
1709 are generated when L1 has
1710 witnessed a thread access
1714 CP is responsible for
1715 filling in the address
1717 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1718 according to what the
1720 14 1 bit enable_exception_memory Must be 0.
1722 Wavefront starts execution
1723 with memory violation
1724 exceptions exceptions
1725 enabled which are generated
1726 when a memory violation has
1727 occurred for this wave from
1729 (write-to-read-only-memory,
1730 mis-aligned atomic, LDS
1731 address out of range,
1732 illegal address, etc.).
1736 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1737 according to what the
1739 23:15 9 bits granulated_lds_size Must be 0.
1741 CP uses the rounded value
1742 from the dispatch packet,
1743 not this value, as the
1744 dispatch may contain
1745 dynamically allocated group
1746 segment memory. CP writes
1748 ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
1750 Amount of group segment
1751 (LDS) to allocate for each
1752 work-group. Granularity is
1756 roundup(lds-size / (64 * 4))
1758 roundup(lds-size / (128 * 4))
1760 24 1 bit enable_exception_ieee_754_fp Wavefront starts execution
1761 _invalid_operation with specified exceptions
1764 Used by CP to set up
1765 ``COMPUTE_PGM_RSRC2.EXCP_EN``
1766 (set from bits 0..6).
1770 25 1 bit enable_exception_fp_denormal FP Denormal one or more
1771 _source input operands is a
1773 26 1 bit enable_exception_ieee_754_fp IEEE 754 FP Division by
1774 _division_by_zero Zero
1775 27 1 bit enable_exception_ieee_754_fp IEEE 754 FP FP Overflow
1777 28 1 bit enable_exception_ieee_754_fp IEEE 754 FP Underflow
1779 29 1 bit enable_exception_ieee_754_fp IEEE 754 FP Inexact
1781 30 1 bit enable_exception_int_divide_by Integer Division by Zero
1782 _zero (rcp_iflag_f32 instruction
1784 31 1 bit Reserved. Must be 0.
1785 32 **Total size 4 bytes.**
1786 ======= ===================================================================================================================
1790 .. table:: Floating Point Rounding Mode Enumeration Values
1791 :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
1793 ===================================== ===== ===============================
1794 Enumeration Name Value Description
1795 ===================================== ===== ===============================
1796 AMD_FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
1797 AMD_FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
1798 AMD_FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
1799 AMD_FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
1800 ===================================== ===== ===============================
1804 .. table:: Floating Point Denorm Mode Enumeration Values
1805 :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
1807 ===================================== ===== ===============================
1808 Enumeration Name Value Description
1809 ===================================== ===== ===============================
1810 AMD_FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination
1812 AMD_FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
1813 AMD_FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
1814 AMD_FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
1815 ===================================== ===== ===============================
1819 .. table:: System VGPR Work-Item ID Enumeration Values
1820 :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
1822 ===================================== ===== ===============================
1823 Enumeration Name Value Description
1824 ===================================== ===== ===============================
1825 AMD_SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension ID.
1826 AMD_SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
1828 AMD_SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
1830 AMD_SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
1831 ===================================== ===== ===============================
1833 .. _amdgpu-amdhsa-initial-kernel-execution-state:
1835 Initial Kernel Execution State
1836 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1838 This section defines the register state that will be set up by the packet
1839 processor prior to the start of execution of every wavefront. This is limited by
1840 the constraints of the hardware controllers of CP/ADC/SPI.
1842 The order of the SGPR registers is defined, but the compiler can specify which
1843 ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
1844 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
1845 for enabled registers are dense starting at SGPR0: the first enabled register is
1846 SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
1849 The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
1850 all waves of the grid. It is possible to specify more than 16 User SGPRs using
1851 the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
1852 initialized. These are then immediately followed by the System SGPRs that are
1853 set up by ADC/SPI and can have different values for each wave of the grid
1856 SGPR register initial state is defined in
1857 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
1859 .. table:: SGPR Register Set Up Order
1860 :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
1862 ========== ========================== ====== ==============================
1863 SGPR Order Name Number Description
1864 (kernel descriptor enable of
1866 ========== ========================== ====== ==============================
1867 First Private Segment Buffer 4 V# that can be used, together
1868 (enable_sgpr_private with Scratch Wave Offset as an
1869 _segment_buffer) offset, to access the private
1870 memory space using a segment
1873 CP uses the value provided by
1875 then Dispatch Ptr 2 64 bit address of AQL dispatch
1876 (enable_sgpr_dispatch_ptr) packet for kernel dispatch
1878 then Queue Ptr 2 64 bit address of amd_queue_t
1879 (enable_sgpr_queue_ptr) object for AQL queue on which
1880 the dispatch packet was
1882 then Kernarg Segment Ptr 2 64 bit address of Kernarg
1883 (enable_sgpr_kernarg segment. This is directly
1884 _segment_ptr) copied from the
1885 kernarg_address in the kernel
1888 Having CP load it once avoids
1889 loading it at the beginning of
1891 then Dispatch Id 2 64 bit Dispatch ID of the
1892 (enable_sgpr_dispatch_id) dispatch packet being
1894 then Flat Scratch Init 2 This is 2 SGPRs:
1895 (enable_sgpr_flat_scratch
1899 The first SGPR is a 32 bit
1901 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
1902 to per SPI base of memory
1903 for scratch for the queue
1904 executing the kernel
1905 dispatch. CP obtains this
1908 This is the same offset used
1909 in computing the Scratch
1911 address. The value of
1912 Scratch Wave Offset must be
1913 added by the kernel machine
1914 code and moved to SGPRn-4
1915 for use as the FLAT SCRATCH
1919 The second SGPR is 32 bit
1920 byte size of a single
1921 work-item’s scratch memory
1922 usage. This is directly
1923 loaded from the kernel
1924 dispatch packet Private
1925 Segment Byte Size and
1926 rounded up to a multiple of
1929 The kernel code must move to
1930 SGPRn-3 for use as the FLAT
1931 SCRATCH SIZE in flat memory
1932 instructions. Having CP load
1933 it once avoids loading it at
1934 the beginning of every
1937 This is the 64 bit base
1938 address of the per SPI
1939 scratch backing memory
1940 managed by SPI for the queue
1941 executing the kernel
1942 dispatch. CP obtains this
1943 from the runtime (and
1944 divides it if there are
1945 multiple Shader Arrays each
1946 with its own SPI). The value
1947 of Scratch Wave Offset must
1948 be added by the kernel
1949 machine code and moved to
1950 SGPRn-4 and SGPRn-3 for use
1951 as the FLAT SCRATCH BASE in
1952 flat memory instructions.
1953 then Private Segment Size 1 The 32 bit byte size of a
1954 (enable_sgpr_private single work-item’s scratch
1955 _segment_size) memory allocation. This is the
1956 value from the kernel dispatch
1957 packet Private Segment Byte
1958 Size rounded up by CP to a
1961 Having CP load it once avoids
1962 loading it at the beginning of
1965 This is not used for
1966 GFX7-GFX8 since it is the same
1967 value as the second SGPR of
1968 Flat Scratch Init. However, it
1969 may be needed for GFX9 which
1970 changes the meaning of the
1971 Flat Scratch Init value.
1972 then Grid Work-Group Count X 1 32 bit count of the number of
1973 (enable_sgpr_grid work-groups in the X dimension
1974 _workgroup_count_X) for the grid being
1975 executed. Computed from the
1976 fields in the kernel dispatch
1977 packet as ((grid_size.x +
1978 workgroup_size.x - 1) /
1980 then Grid Work-Group Count Y 1 32 bit count of the number of
1981 (enable_sgpr_grid work-groups in the Y dimension
1982 _workgroup_count_Y && for the grid being
1983 less than 16 previous executed. Computed from the
1984 SGPRs) fields in the kernel dispatch
1985 packet as ((grid_size.y +
1986 workgroup_size.y - 1) /
1989 Only initialized if <16
1990 previous SGPRs initialized.
1991 then Grid Work-Group Count Z 1 32 bit count of the number of
1992 (enable_sgpr_grid work-groups in the Z dimension
1993 _workgroup_count_Z && for the grid being
1994 less than 16 previous executed. Computed from the
1995 SGPRs) fields in the kernel dispatch
1996 packet as ((grid_size.z +
1997 workgroup_size.z - 1) /
2000 Only initialized if <16
2001 previous SGPRs initialized.
2002 then Work-Group Id X 1 32 bit work-group id in X
2003 (enable_sgpr_workgroup_id dimension of grid for
2005 then Work-Group Id Y 1 32 bit work-group id in Y
2006 (enable_sgpr_workgroup_id dimension of grid for
2008 then Work-Group Id Z 1 32 bit work-group id in Z
2009 (enable_sgpr_workgroup_id dimension of grid for
2011 then Work-Group Info 1 {first_wave, 14’b0000,
2012 (enable_sgpr_workgroup ordered_append_term[10:0],
2013 _info) threadgroup_size_in_waves[5:0]}
2014 then Scratch Wave Offset 1 32 bit byte offset from base
2015 (enable_sgpr_private of scratch base of queue
2016 _segment_wave_offset) executing the kernel
2017 dispatch. Must be used as an
2019 segment address when using
2020 Scratch Segment Buffer. It
2021 must be used to set up FLAT
2022 SCRATCH for flat addressing
2024 :ref:`amdgpu-amdhsa-flat-scratch`).
2025 ========== ========================== ====== ==============================
2027 The order of the VGPR registers is defined, but the compiler can specify which
2028 ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
2029 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2030 for enabled registers are dense starting at VGPR0: the first enabled register is
2031 VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
2034 VGPR register initial state is defined in
2035 :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
2037 .. table:: VGPR Register Set Up Order
2038 :name: amdgpu-amdhsa-vgpr-register-set-up-order-table
2040 ========== ========================== ====== ==============================
2041 VGPR Order Name Number Description
2042 (kernel descriptor enable of
2044 ========== ========================== ====== ==============================
2045 First Work-Item Id X 1 32 bit work item id in X
2046 (Always initialized) dimension of work-group for
2048 then Work-Item Id Y 1 32 bit work item id in Y
2049 (enable_vgpr_workitem_id dimension of work-group for
2050 > 0) wavefront lane.
2051 then Work-Item Id Z 1 32 bit work item id in Z
2052 (enable_vgpr_workitem_id dimension of work-group for
2053 > 1) wavefront lane.
2054 ========== ========================== ====== ==============================
2056 The setting of registers is is done by GPU CP/ADC/SPI hardware as follows:
2058 1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
2060 2. Work-group Id registers X, Y, Z are set by ADC which supports any
2061 combination including none.
2062 3. Scratch Wave Offset is set by SPI in a per wave basis which is why its value
2063 cannot included with the flat scratch init value which is per queue.
2064 4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
2067 Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
2068 value to the hardware required SGPRn-3 and SGPRn-4 respectively.
2070 The global segment can be accessed either using buffer instructions (GFX6 which
2071 has V# 64 bit address support), flat instructions (GFX7-9), or global
2072 instructions (GFX9).
2074 If buffer operations are used then the compiler can generate a V# with the
2075 following properties:
2079 * ATC: 1 if IOMMU present (such as APU)
2081 * MTYPE set to support memory coherence that matches the runtime (such as CC for
2082 APU and NC for dGPU).
2084 .. _amdgpu-amdhsa-kernel-prolog:
2089 .. _amdgpu-amdhsa-m0:
2095 The M0 register must be initialized with a value at least the total LDS size
2096 if the kernel may access LDS via DS or flat operations. Total LDS size is
2097 available in dispatch packet. For M0, it is also possible to use maximum
2098 possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
2101 The M0 register is not used for range checking LDS accesses and so does not
2102 need to be initialized in the prolog.
2104 .. _amdgpu-amdhsa-flat-scratch:
2109 If the kernel may use flat operations to access scratch memory, the prolog code
2110 must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
2111 are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wave
2112 Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
2115 Flat scratch is not supported.
2118 1. The low word of Flat Scratch Init is 32 bit byte offset from
2119 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
2120 being managed by SPI for the queue executing the kernel dispatch. This is
2121 the same value used in the Scratch Segment Buffer V# base address. The
2122 prolog must add the value of Scratch Wave Offset to get the wave's byte
2123 scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
2124 FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
2125 by 8 before moving into FLAT_SCRATCH_LO.
2126 2. The second word of Flat Scratch Init is 32 bit byte size of a single
2127 work-items scratch memory usage. This is directly loaded from the kernel
2128 dispatch packet Private Segment Byte Size and rounded up to a multiple of
2129 DWORD. Having CP load it once avoids loading it at the beginning of every
2130 wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
2133 The Flat Scratch Init is the 64 bit address of the base of scratch backing
2134 memory being managed by SPI for the queue executing the kernel dispatch. The
2135 prolog must add the value of Scratch Wave Offset and moved to the FLAT_SCRATCH
2136 pair for use as the flat scratch base in flat memory instructions.
2138 .. _amdgpu-amdhsa-memory-model:
2143 This section describes the mapping of LLVM memory model onto AMDGPU machine code
2144 (see :ref:`memmodel`). *The implementation is WIP.*
2147 Update when implementation complete.
2149 Support more relaxed OpenCL memory model to be controlled by environment
2150 component of target triple.
2152 The AMDGPU backend supports the memory synchronization scopes specified in
2153 :ref:`amdgpu-memory-scopes`.
2155 The code sequences used to implement the memory model are defined in table
2156 :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
2158 The sequences specify the order of instructions that a single thread must
2159 execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
2160 to other memory instructions executed by the same thread. This allows them to be
2161 moved earlier or later which can allow them to be combined with other instances
2162 of the same instruction, or hoisted/sunk out of loops to improve
2163 performance. Only the instructions related to the memory model are given;
2164 additional ``s_waitcnt`` instructions are required to ensure registers are
2165 defined before being used. These may be able to be combined with the memory
2166 model ``s_waitcnt`` instructions as described above.
2168 The AMDGPU memory model supports both the HSA [HSA]_ memory model, and the
2169 OpenCL [OpenCL]_ memory model. The HSA memory model uses a single happens-before
2170 relation for all address spaces (see :ref:`amdgpu-address-spaces`). The OpenCL
2171 memory model which has separate happens-before relations for the global and
2172 local address spaces, and only a fence specifying both global and local address
2173 space joins the relationships. Since the LLVM ``memfence`` instruction does not
2174 allow an address space to be specified the OpenCL fence has to convervatively
2175 assume both local and global address space was specified. However, optimizations
2176 can often be done to eliminate the additional ``s_waitcnt``instructions when
2177 there are no intervening corresponding ``ds/flat_load/store/atomic`` memory
2178 instructions. The code sequences in the table indicate what can be omitted for
2179 the OpenCL memory. The target triple environment is used to determine if the
2180 source language is OpenCL (see :ref:`amdgpu-opencl`).
2182 ``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
2185 ``buffer/global/flat_load/store/atomic`` instructions to global memory are
2186 termed vector memory operations.
2190 * Each agent has multiple compute units (CU).
2191 * Each CU has multiple SIMDs that execute wavefronts.
2192 * The wavefronts for a single work-group are executed in the same CU but may be
2193 executed by different SIMDs.
2194 * Each CU has a single LDS memory shared by the wavefronts of the work-groups
2196 * All LDS operations of a CU are performed as wavefront wide operations in a
2197 global order and involve no caching. Completion is reported to a wavefront in
2199 * The LDS memory has multiple request queues shared by the SIMDs of a
2200 CU. Therefore, the LDS operations performed by different waves of a work-group
2201 can be reordered relative to each other, which can result in reordering the
2202 visibility of vector memory operations with respect to LDS operations of other
2203 wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
2204 ensure synchronization between LDS operations and vector memory operations
2205 between waves of a work-group, but not between operations performed by the
2207 * The vector memory operations are performed as wavefront wide operations and
2208 completion is reported to a wavefront in execution order. The exception is
2209 that for GFX7-9 ``flat_load/store/atomic`` instructions can report out of
2210 vector memory order if they access LDS memory, and out of LDS operation order
2211 if they access global memory.
2212 * The vector memory operations access a vector L1 cache shared by all wavefronts
2213 on a CU. Therefore, no special action is required for coherence between
2214 wavefronts in the same work-group. A ``buffer_wbinvl1_vol`` is required for
2215 coherence between waves executing in different work-groups as they may be
2216 executing on different CUs.
2217 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
2218 on a group of CUs. The scalar and vector L1 caches are not coherent. However,
2219 scalar operations are used in a restricted way so do not impact the memory
2220 model. See :ref:`amdgpu-amdhsa-memory-spaces`.
2221 * The vector and scalar memory operations use an L2 cache shared by all CUs on
2223 * The L2 cache has independent channels to service disjoint ranges of virtual
2225 * Each CU has a separate request queue per channel. Therefore, the vector and
2226 scalar memory operations performed by waves executing in different work-groups
2227 (which may be executing on different CUs) of an agent can be reordered
2228 relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
2229 synchronization between vector memory operations of different CUs. It ensures a
2230 previous vector memory operation has completed before executing a subsequent
2231 vector memory or LDS operation and so can be used to meet the requirements of
2232 acquire and release.
2233 * The L2 cache can be kept coherent with other agents on some targets, or ranges
2234 of virtual addresses can be set up to bypass it to ensure system coherence.
2236 Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-8),
2237 or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
2238 memory, atomic memory orderings are not meaningful and all accesses are treated
2241 Constant address space uses ``buffer/global_load`` instructions (or equivalent
2242 scalar memory instructions). Since the constant address space contents do not
2243 change during the execution of a kernel dispatch it is not legal to perform
2244 stores, and atomic memory orderings are not meaningful and all access are
2245 treated as non-atomic.
2247 A memory synchronization scope wider than work-group is not meaningful for the
2248 group (LDS) address space and is treated as work-group.
2250 The memory model does not support the region address space which is treated as
2253 Acquire memory ordering is not meaningful on store atomic instructions and is
2254 treated as non-atomic.
2256 Release memory ordering is not meaningful on load atomic instructions and is
2257 treated a non-atomic.
2259 Acquire-release memory ordering is not meaningful on load or store atomic
2260 instructions and is treated as acquire and release respectively.
2262 AMDGPU backend only uses scalar memory operations to access memory that is
2263 proven to not change during the execution of the kernel dispatch. This includes
2264 constant address space and global address space for program scope const
2265 variables. Therefore the kernel machine code does not have to maintain the
2266 scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
2267 and vector L1 caches are invalidated between kernel dispatches by CP since
2268 constant address space data may change between kernel dispatch executions. See
2269 :ref:`amdgpu-amdhsa-memory-spaces`.
2271 The one execption is if scalar writes are used to spill SGPR registers. In this
2272 case the AMDGPU backend ensures the memory location used to spill is never
2273 accessed by vector memory operations at the same time. If scalar writes are used
2274 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
2275 return since the locations may be used for vector memory instructions by a
2276 future wave that uses the same scratch area, or a function call that creates a
2277 frame at the same address, respectively. There is no need for a ``s_dcache_inv``
2278 as all scalar writes are write-before-read in the same thread.
2280 Scratch backing memory (which is used for the private address space) is accessed
2281 with MTYPE NC_NV (non-coherenent non-volatile). Since the private address space
2282 is only accessed by a single thread, and is always write-before-read,
2283 there is never a need to invalidate these entries from the L1 cache. Hence all
2284 cache invalidates are done as ``*_vol`` to only invalidate the volatile cache
2287 On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
2288 to invalidate the L2 cache. This also causes it to be treated as non-volatile
2289 and so is not invalidated by ``*_vol``. On APU it is accessed as CC (cache
2290 coherent) and so the L2 cache will coherent with the CPU and other agents.
2292 .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
2293 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
2295 ============ ============ ============== ========== =======================
2296 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
2297 Ordering Sync Scope Address
2299 ============ ============ ============== ========== =======================
2301 ---------------------------------------------------------------------------
2302 load *none* *none* - global non-volatile
2303 - generic 1. buffer/global/flat_load
2305 1. buffer/global/flat_load
2307 load *none* *none* - local 1. ds_load
2308 store *none* *none* - global 1. buffer/global/flat_store
2310 store *none* *none* - local 1. ds_store
2311 **Unordered Atomic**
2312 ---------------------------------------------------------------------------
2313 load atomic unordered *any* *any* *Same as non-atomic*.
2314 store atomic unordered *any* *any* *Same as non-atomic*.
2315 atomicrmw unordered *any* *any* *Same as monotonic
2317 **Monotonic Atomic**
2318 ---------------------------------------------------------------------------
2319 load atomic monotonic - singlethread - global 1. buffer/global/flat_load
2320 - wavefront - generic
2322 load atomic monotonic - singlethread - local 1. ds_load
2325 load atomic monotonic - agent - global 1. buffer/global/flat_load
2326 - system - generic glc=1
2327 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
2328 - wavefront - generic
2332 store atomic monotonic - singlethread - local 1. ds_store
2335 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
2336 - wavefront - generic
2340 atomicrmw monotonic - singlethread - local 1. ds_atomic
2344 ---------------------------------------------------------------------------
2345 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
2348 load atomic acquire - workgroup - global 1. buffer/global_load
2349 load atomic acquire - workgroup - local 1. ds/flat_load
2350 - generic 2. s_waitcnt lgkmcnt(0)
2354 - Must happen before
2367 load atomic acquire - agent - global 1. buffer/global_load
2369 2. s_waitcnt vmcnt(0)
2371 - Must happen before
2379 3. buffer_wbinvl1_vol
2381 - Must happen before
2391 load atomic acquire - agent - generic 1. flat_load glc=1
2392 - system 2. s_waitcnt vmcnt(0) &
2397 - Must happen before
2400 - Ensures the flat_load
2405 3. buffer_wbinvl1_vol
2407 - Must happen before
2417 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
2420 atomicrmw acquire - workgroup - global 1. buffer/global_atomic
2421 atomicrmw acquire - workgroup - local 1. ds/flat_atomic
2422 - generic 2. waitcnt lgkmcnt(0)
2426 - Must happen before
2439 atomicrmw acquire - agent - global 1. buffer/global_atomic
2440 - system 2. s_waitcnt vmcnt(0)
2442 - Must happen before
2451 3. buffer_wbinvl1_vol
2453 - Must happen before
2463 atomicrmw acquire - agent - generic 1. flat_atomic
2464 - system 2. s_waitcnt vmcnt(0) &
2469 - Must happen before
2478 3. buffer_wbinvl1_vol
2480 - Must happen before
2490 fence acquire - singlethread *none* *none*
2492 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0)
2523 fence-paired-atomic).
2524 - Must happen before
2535 fence-paired-atomic.
2537 fence acquire - agent *none* 1. s_waitcnt vmcnt(0) &
2552 - Could be split into
2561 - s_waitcnt vmcnt(0)
2572 fence-paired-atomic).
2573 - s_waitcnt lgkmcnt(0)
2584 fence-paired-atomic).
2585 - Must happen before
2599 fence-paired-atomic.
2601 2. buffer_wbinvl1_vol
2603 - Must happen before
2604 any following global/generic
2614 ---------------------------------------------------------------------------
2615 store atomic release - singlethread - global 1. buffer/global/ds/flat_store
2618 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
2628 - Must happen before
2639 2. buffer/global/flat_store
2640 store atomic release - workgroup - local 1. ds_store
2641 store atomic release - agent - global 1. s_waitcnt vmcnt(0) &
2642 - system - generic lgkmcnt(0)
2646 - Could be split into
2655 - s_waitcnt vmcnt(0)
2662 - s_waitcnt lgkmcnt(0)
2669 - Must happen before
2680 2. buffer/global/ds/flat_store
2681 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
2684 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
2694 - Must happen before
2705 2. buffer/global/flat_atomic
2706 atomicrmw release - workgroup - local 1. ds_atomic
2707 atomicrmw release - agent - global 1. s_waitcnt vmcnt(0) &
2708 - system - generic lgkmcnt(0)
2712 - Could be split into
2721 - s_waitcnt vmcnt(0)
2728 - s_waitcnt lgkmcnt(0)
2735 - Must happen before
2746 2. buffer/global/ds/flat_atomic*
2747 fence release - singlethread *none* *none*
2749 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0)
2769 - Must happen before
2778 fence-paired-atomic).
2785 fence-paired-atomic.
2787 fence release - agent *none* 1. s_waitcnt vmcnt(0) &
2802 - Could be split into
2811 - s_waitcnt vmcnt(0)
2818 - s_waitcnt lgkmcnt(0)
2825 - Must happen before
2834 fence-paired-atomic).
2841 fence-paired-atomic.
2843 **Acquire-Release Atomic**
2844 ---------------------------------------------------------------------------
2845 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
2848 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
2858 - Must happen before
2869 2. buffer/global_atomic
2870 atomicrmw acq_rel - workgroup - local 1. ds_atomic
2871 2. s_waitcnt lgkmcnt(0)
2875 - Must happen before
2888 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
2898 - Must happen before
2910 3. s_waitcnt lgkmcnt(0)
2914 - Must happen before
2926 atomicrmw acq_rel - agent - global 1. s_waitcnt vmcnt(0) &
2931 - Could be split into
2940 - s_waitcnt vmcnt(0)
2947 - s_waitcnt lgkmcnt(0)
2954 - Must happen before
2965 2. buffer/global_atomic
2966 3. s_waitcnt vmcnt(0)
2968 - Must happen before
2977 4. buffer_wbinvl1_vol
2979 - Must happen before
2989 atomicrmw acq_rel - agent - generic 1. s_waitcnt vmcnt(0) &
2994 - Could be split into
3003 - s_waitcnt vmcnt(0)
3010 - s_waitcnt lgkmcnt(0)
3017 - Must happen before
3029 3. s_waitcnt vmcnt(0) &
3034 - Must happen before
3043 4. buffer_wbinvl1_vol
3045 - Must happen before
3055 fence acq_rel - singlethread *none* *none*
3057 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0)
3077 - Must happen before
3100 fence-paired-atomic)
3121 fence-paired-atomic).
3126 fence acq_rel - agent *none* 1. s_waitcnt vmcnt(0) &
3141 - Could be split into
3150 - s_waitcnt vmcnt(0)
3157 - s_waitcnt lgkmcnt(0)
3164 - Must happen before
3169 global/local/generic
3178 fence-paired-atomic)
3190 global/local/generic
3199 fence-paired-atomic).
3204 2. buffer_wbinvl1_vol
3206 - Must happen before
3220 **Sequential Consistent Atomic**
3221 ---------------------------------------------------------------------------
3222 load atomic seq_cst - singlethread - global *Same as corresponding
3223 - wavefront - local load atomic acquire*.
3224 - workgroup - generic
3225 load atomic seq_cst - agent - global 1. s_waitcnt vmcnt(0)
3227 - generic - Must happen after
3274 instructions same as
3278 store atomic seq_cst - singlethread - global *Same as corresponding
3279 - wavefront - local store atomic release*.
3280 - workgroup - generic
3281 store atomic seq_cst - agent - global *Same as corresponding
3282 - system - generic store atomic release*.
3283 atomicrmw seq_cst - singlethread - global *Same as corresponding
3284 - wavefront - local atomicrmw acq_rel*.
3285 - workgroup - generic
3286 atomicrmw seq_cst - agent - global *Same as corresponding
3287 - system - generic atomicrmw acq_rel*.
3288 fence seq_cst - singlethread *none* *Same as corresponding
3289 - wavefront fence acq_rel*.
3293 ============ ============ ============== ========== =======================
3295 The memory order also adds the single thread optimization constrains defined in
3297 :ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
3299 .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
3300 :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
3302 ============ ==============================================================
3303 LLVM Memory Optimization Constraints
3305 ============ ==============================================================
3308 acquire - If a load atomic/atomicrmw then no following load/load
3309 atomic/store/ store atomic/atomicrmw/fence instruction can
3310 be moved before the acquire.
3311 - If a fence then same as load atomic, plus no preceding
3312 associated fence-paired-atomic can be moved after the fence.
3313 release - If a store atomic/atomicrmw then no preceding load/load
3314 atomic/store/ store atomic/atomicrmw/fence instruction can
3315 be moved after the release.
3316 - If a fence then same as store atomic, plus no following
3317 associated fence-paired-atomic can be moved before the
3319 acq_rel Same constraints as both acquire and release.
3320 seq_cst - If a load atomic then same constraints as acquire, plus no
3321 preceding sequentially consistent load atomic/store
3322 atomic/atomicrmw/fence instruction can be moved after the
3324 - If a store atomic then the same constraints as release, plus
3325 no following sequentially consistent load atomic/store
3326 atomic/atomicrmw/fence instruction can be moved before the
3328 - If an atomicrmw/fence then same constraints as acq_rel.
3329 ============ ==============================================================
3334 For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
3335 (such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
3336 the ``s_trap`` instruction with the following usage:
3338 .. table:: AMDGPU Trap Handler for AMDHSA OS
3339 :name: amdgpu-trap-handler-for-amdhsa-os-table
3341 =================== =============== =============== =======================
3342 Usage Code Sequence Trap Handler Description
3344 =================== =============== =============== =======================
3345 reserved ``s_trap 0x00`` Reserved by hardware.
3346 ``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA
3347 ``queue_ptr`` ``debugtrap``
3348 ``VGPR0``: intrinsic (not
3349 ``arg`` implemented).
3350 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be
3351 ``queue_ptr`` terminated and its
3352 associated queue put
3353 into the error state.
3354 ``llvm.debugtrap`` ``s_trap 0x03`` ``SGPR0-1``: If debugger not
3355 ``queue_ptr`` installed handled
3356 same as ``llvm.trap``.
3357 debugger breakpoint ``s_trap 0x07`` Reserved for debugger
3359 debugger ``s_trap 0x08`` Reserved for debugger.
3360 debugger ``s_trap 0xfe`` Reserved for debugger.
3361 debugger ``s_trap 0xff`` Reserved for debugger.
3362 =================== =============== =============== =======================
3370 For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
3371 not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
3372 instructions are handled as follows:
3374 .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
3375 :name: amdgpu-trap-handler-for-non-amdhsa-os-table
3377 =============== =============== ===========================================
3378 Usage Code Sequence Description
3379 =============== =============== ===========================================
3380 llvm.trap s_endpgm Causes wavefront to be terminated.
3381 llvm.debugtrap *none* Compiler warning given that there is no
3382 trap handler installed.
3383 =============== =============== ===========================================
3393 When generating code for the OpenCL language the target triple environment
3394 should be ``opencl`` or ``amdgizcl`` (see :ref:`amdgpu-target-triples`).
3396 When the language is OpenCL the following differences occur:
3398 1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
3399 2. The AMDGPU backend adds additional arguments to the kernel.
3400 3. Additional metadata is generated (:ref:`amdgpu-code-object-metadata`).
3403 Specify what affect this has. Hidden arguments added. Additional metadata
3411 When generating code for the OpenCL language the target triple environment
3412 should be ``hcc`` (see :ref:`amdgpu-target-triples`).
3414 When the language is OpenCL the following differences occur:
3416 1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
3419 Specify what affect this has.
3424 AMDGPU backend has LLVM-MC based assembler which is currently in development.
3425 It supports AMDGCN GFX6-GFX8.
3427 This section describes general syntax for instructions and operands. For more
3428 information about instructions, their semantics and supported combinations of
3429 operands, refer to one of instruction set architecture manuals
3430 [AMD-Souther-Islands]_ [AMD-Sea-Islands]_ [AMD-Volcanic-Islands]_.
3432 An instruction has the following syntax (register operands are normally
3433 comma-separated while extra operands are space-separated):
3435 *<opcode> <register_operand0>, ... <extra_operand0> ...*
3440 The following syntax for register operands is supported:
3442 * SGPR registers: s0, ... or s[0], ...
3443 * VGPR registers: v0, ... or v[0], ...
3444 * TTMP registers: ttmp0, ... or ttmp[0], ...
3445 * Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
3446 * Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
3447 * Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
3448 * Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
3449 * Register index expressions: v[2*2], s[1-1:2-1]
3450 * 'off' indicates that an operand is not enabled
3452 The following extra operands are supported:
3454 * offset, offset0, offset1
3456 * glc, slc, tfe bits
3457 * waitcnt: integer or combination of counter values
3460 - abs (\| \|), neg (\-)
3464 - row_shl, row_shr, row_ror, row_rol
3465 - row_mirror, row_half_mirror, row_bcast
3466 - wave_shl, wave_shr, wave_ror, wave_rol, quad_perm
3467 - row_mask, bank_mask, bound_ctrl
3471 - dst_sel, src0_sel, src1_sel (BYTE_N, WORD_M, DWORD)
3472 - dst_unused (UNUSED_PAD, UNUSED_SEXT, UNUSED_PRESERVE)
3475 Instruction Examples
3476 ~~~~~~~~~~~~~~~~~~~~
3481 .. code-block:: nasm
3483 ds_add_u32 v2, v4 offset:16
3484 ds_write_src2_b64 v2 offset0:4 offset1:8
3485 ds_cmpst_f32 v2, v4, v6
3486 ds_min_rtn_f64 v[8:9], v2, v[4:5]
3489 For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
3494 .. code-block:: nasm
3496 flat_load_dword v1, v[3:4]
3497 flat_store_dwordx3 v[3:4], v[5:7]
3498 flat_atomic_swap v1, v[3:4], v5 glc
3499 flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
3500 flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
3502 For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
3507 .. code-block:: nasm
3509 buffer_load_dword v1, off, s[4:7], s1
3510 buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
3511 buffer_store_format_xy v[1:2], off, s[4:7], s1
3513 buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
3515 For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
3520 .. code-block:: nasm
3522 s_load_dword s1, s[2:3], 0xfc
3523 s_load_dwordx8 s[8:15], s[2:3], s4
3524 s_load_dwordx16 s[88:103], s[2:3], s4
3528 For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
3533 .. code-block:: nasm
3536 s_mov_b64 s[0:1], 0x80000000
3538 s_wqm_b64 s[2:3], s[4:5]
3539 s_bcnt0_i32_b64 s1, s[2:3]
3540 s_swappc_b64 s[2:3], s[4:5]
3541 s_cbranch_join s[4:5]
3543 For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
3548 .. code-block:: nasm
3550 s_add_u32 s1, s2, s3
3551 s_and_b64 s[2:3], s[4:5], s[6:7]
3552 s_cselect_b32 s1, s2, s3
3553 s_andn2_b32 s2, s4, s6
3554 s_lshr_b64 s[2:3], s[4:5], s6
3555 s_ashr_i32 s2, s4, s6
3556 s_bfm_b64 s[2:3], s4, s6
3557 s_bfe_i64 s[2:3], s[4:5], s6
3558 s_cbranch_g_fork s[4:5], s[6:7]
3560 For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
3565 .. code-block:: nasm
3568 s_bitcmp1_b32 s1, s2
3569 s_bitcmp0_b64 s[2:3], s4
3572 For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
3577 .. code-block:: nasm
3582 s_waitcnt 0 ; Wait for all counters to be 0
3583 s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
3584 s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
3588 s_sendmsg sendmsg(MSG_INTERRUPT)
3591 For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
3593 Unless otherwise mentioned, little verification is performed on the operands
3594 of SOPP Instructions, so it is up to the programmer to be familiar with the
3595 range or acceptable values.
3600 For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
3601 the assembler will automatically use optimal encoding based on its operands.
3602 To force specific encoding, one can add a suffix to the opcode of the instruction:
3604 * _e32 for 32-bit VOP1/VOP2/VOPC
3605 * _e64 for 64-bit VOP3
3607 * _sdwa for VOP_SDWA
3609 VOP1/VOP2/VOP3/VOPC examples:
3611 .. code-block:: nasm
3614 v_mov_b32_e32 v1, v2
3616 v_cvt_f64_i32_e32 v[1:2], v2
3617 v_floor_f32_e32 v1, v2
3618 v_bfrev_b32_e32 v1, v2
3619 v_add_f32_e32 v1, v2, v3
3620 v_mul_i32_i24_e64 v1, v2, 3
3621 v_mul_i32_i24_e32 v1, -3, v3
3622 v_mul_i32_i24_e32 v1, -100, v3
3623 v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
3624 v_max_f16_e32 v1, v2, v3
3628 .. code-block:: nasm
3630 v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
3631 v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
3632 v_mov_b32 v0, v0 wave_shl:1
3633 v_mov_b32 v0, v0 row_mirror
3634 v_mov_b32 v0, v0 row_bcast:31
3635 v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
3636 v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
3637 v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
3641 .. code-block:: nasm
3643 v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
3644 v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
3645 v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
3646 v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
3647 v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
3649 For full list of supported instructions, refer to "Vector ALU instructions".
3651 HSA Code Object Directives
3652 ~~~~~~~~~~~~~~~~~~~~~~~~~~
3654 AMDGPU ABI defines auxiliary data in output code object. In assembly source,
3655 one can specify them with assembler directives.
3657 .hsa_code_object_version major, minor
3658 +++++++++++++++++++++++++++++++++++++
3660 *major* and *minor* are integers that specify the version of the HSA code
3661 object that will be generated by the assembler.
3663 .hsa_code_object_isa [major, minor, stepping, vendor, arch]
3664 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3667 *major*, *minor*, and *stepping* are all integers that describe the instruction
3668 set architecture (ISA) version of the assembly program.
3670 *vendor* and *arch* are quoted strings. *vendor* should always be equal to
3671 "AMD" and *arch* should always be equal to "AMDGPU".
3673 By default, the assembler will derive the ISA version, *vendor*, and *arch*
3674 from the value of the -mcpu option that is passed to the assembler.
3676 .amdgpu_hsa_kernel (name)
3677 +++++++++++++++++++++++++
3679 This directives specifies that the symbol with given name is a kernel entry point
3680 (label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
3685 This directive marks the beginning of a list of key / value pairs that are used
3686 to specify the amd_kernel_code_t object that will be emitted by the assembler.
3687 The list must be terminated by the *.end_amd_kernel_code_t* directive. For
3688 any amd_kernel_code_t values that are unspecified a default value will be
3689 used. The default value for all keys is 0, with the following exceptions:
3691 - *kernel_code_version_major* defaults to 1.
3692 - *machine_kind* defaults to 1.
3693 - *machine_version_major*, *machine_version_minor*, and
3694 *machine_version_stepping* are derived from the value of the -mcpu option
3695 that is passed to the assembler.
3696 - *kernel_code_entry_byte_offset* defaults to 256.
3697 - *wavefront_size* defaults to 6.
3698 - *kernarg_segment_alignment*, *group_segment_alignment*, and
3699 *private_segment_alignment* default to 4. Note that alignments are specified
3700 as a power of two, so a value of **n** means an alignment of 2^ **n**.
3702 The *.amd_kernel_code_t* directive must be placed immediately after the
3703 function label and before any instructions.
3705 For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
3706 comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
3708 Here is an example of a minimal amd_kernel_code_t specification:
3710 .. code-block:: none
3712 .hsa_code_object_version 1,0
3713 .hsa_code_object_isa
3718 .amdgpu_hsa_kernel hello_world
3723 enable_sgpr_kernarg_segment_ptr = 1
3725 compute_pgm_rsrc1_vgprs = 0
3726 compute_pgm_rsrc1_sgprs = 0
3727 compute_pgm_rsrc2_user_sgpr = 2
3728 kernarg_segment_byte_size = 8
3729 wavefront_sgpr_count = 2
3730 workitem_vgpr_count = 3
3731 .end_amd_kernel_code_t
3733 s_load_dwordx2 s[0:1], s[0:1] 0x0
3734 v_mov_b32 v0, 3.14159
3735 s_waitcnt lgkmcnt(0)
3738 flat_store_dword v[1:2], v0
3741 .size hello_world, .Lfunc_end0-hello_world
3743 Additional Documentation
3744 ========================
3746 .. [AMD-R6xx] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
3747 .. [AMD-R7xx] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
3748 .. [AMD-Evergreen] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
3749 .. [AMD-Cayman-Trinity] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
3750 .. [AMD-Souther-Islands] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
3751 .. [AMD-Sea-Islands] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
3752 .. [AMD-Volcanic-Islands] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
3753 .. [AMD-OpenCL_Programming-Guide] `AMD Accelerated Parallel Processing OpenCL Programming Guide <http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf>`_
3754 .. [AMD-APP-SDK] `AMD Accelerated Parallel Processing APP SDK Documentation <http://developer.amd.com/tools/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/documentation/>`__
3755 .. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
3756 .. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
3757 .. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
3758 .. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
3759 .. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
3760 .. [YAML] `YAML Ain’t Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
3761 .. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
3762 .. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
3763 .. [AMD-AMDGPU-Compute-Application-Binary-Interface] `AMDGPU Compute Application Binary Interface <https://github.com/RadeonOpenCompute/ROCm-ComputeABI-Doc/blob/master/AMDGPU-ABI.md>`__