OSDN Git Service

AMDGPU: Bring HSA metadata on par with the specification
[android-x86/external-llvm.git] / docs / AMDGPUUsage.rst
1 =============================
2 User Guide for AMDGPU Backend
3 =============================
4
5 .. contents::
6    :local:
7
8 Introduction
9 ============
10
11 The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
12 R600 family up until the current GCN families. It lives in the
13 ``lib/Target/AMDGPU`` directory.
14
15 LLVM
16 ====
17
18 .. _amdgpu-target-triples:
19
20 Target Triples
21 --------------
22
23 Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
24 specify the target triple:
25
26   .. table:: AMDGPU Target Triples
27      :name: amdgpu-target-triples-table
28
29      ============ ======== ========= ===========
30      Architecture Vendor   OS        Environment
31      ============ ======== ========= ===========
32      r600         amd      <empty>   <empty>
33      amdgcn       amd      <empty>   <empty>
34      amdgcn       amd      amdhsa    <empty>
35      amdgcn       amd      amdhsa    opencl
36      amdgcn       amd      amdhsa    amdgizcl
37      amdgcn       amd      amdhsa    amdgiz
38      amdgcn       amd      amdhsa    hcc
39      ============ ======== ========= ===========
40
41 ``r600-amd--``
42   Supports AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders executed on
43   the MESA runtime.
44
45 ``amdgcn-amd--``
46   Supports AMD GPUs GCN GFX6 onwards for graphics and compute shaders executed on
47   the MESA runtime.
48
49 ``amdgcn-amd-amdhsa-``
50   Supports AMD GCN GPUs GFX6 onwards for compute kernels executed on HSA [HSA]_
51   compatible runtimes such as AMD's ROCm [AMD-ROCm]_.
52
53 ``amdgcn-amd-amdhsa-opencl``
54   Supports AMD GCN GPUs GFX6 onwards for OpenCL compute kernels executed on HSA
55   [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. See
56   :ref:`amdgpu-opencl`.
57
58 ``amdgcn-amd-amdhsa-amdgizcl``
59   Same as ``amdgcn-amd-amdhsa-opencl`` except a different address space mapping
60   is used (see :ref:`amdgpu-address-spaces`).
61
62 ``amdgcn-amd-amdhsa-amdgiz``
63   Same as ``amdgcn-amd-amdhsa-`` except a different address space mapping is
64   used (see :ref:`amdgpu-address-spaces`).
65
66 ``amdgcn-amd-amdhsa-hcc``
67   Supports AMD GCN GPUs GFX6 onwards for AMD HC language compute kernels
68   executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. See
69   :ref:`amdgpu-hcc`.
70
71 .. _amdgpu-processors:
72
73 Processors
74 ----------
75
76 Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
77 names from both the *Processor* and *Alternative Processor* can be used.
78
79   .. table:: AMDGPU Processors
80      :name: amdgpu-processors-table
81
82      ========== =========== ============ ===== ======= ==================
83      Processor  Alternative Target       dGPU/ Runtime Example
84                 Processor   Triple       APU   Support Products
85                             Architecture
86      ========== =========== ============ ===== ======= ==================
87      **R600** [AMD-R6xx]_
88      --------------------------------------------------------------------
89      r600                   r600         dGPU
90      r630                   r600         dGPU
91      rs880                  r600         dGPU
92      rv670                  r600         dGPU
93      **R700** [AMD-R7xx]_
94      --------------------------------------------------------------------
95      rv710                  r600         dGPU
96      rv730                  r600         dGPU
97      rv770                  r600         dGPU
98      **Evergreen** [AMD-Evergreen]_
99      --------------------------------------------------------------------
100      cedar                  r600         dGPU
101      redwood                r600         dGPU
102      sumo                   r600         dGPU
103      juniper                r600         dGPU
104      cypress                r600         dGPU
105      **Northern Islands** [AMD-Cayman-Trinity]_
106      --------------------------------------------------------------------
107      barts                  r600         dGPU
108      turks                  r600         dGPU
109      caicos                 r600         dGPU
110      cayman                 r600         dGPU
111      **GCN GFX6 (Southern Islands (SI))** [AMD-Souther-Islands]_
112      --------------------------------------------------------------------
113      gfx600     - tahiti    amdgcn       dGPU
114      gfx601     - pitcairn  amdgcn       dGPU
115                 - verde
116                 - oland
117                 - hainan
118      **GCN GFX7 (Sea Islands (CI))** [AMD-Sea-Islands]_
119      --------------------------------------------------------------------
120      gfx700     - bonaire   amdgcn       dGPU          - Radeon HD 7790
121                                                        - Radeon HD 8770
122                                                        - R7 260
123                                                        - R7 260X
124      \          - kaveri    amdgcn       APU           - A6-7000
125                                                        - A6 Pro-7050B
126                                                        - A8-7100
127                                                        - A8 Pro-7150B
128                                                        - A10-7300
129                                                        - A10 Pro-7350B
130                                                        - FX-7500
131                                                        - A8-7200P
132                                                        - A10-7400P
133                                                        - FX-7600P
134      gfx701     - hawaii    amdgcn       dGPU  ROCm    - FirePro W8100
135                                                        - FirePro W9100
136                                                        - FirePro S9150
137                                                        - FirePro S9170
138      gfx702                              dGPU  ROCm    - Radeon R9 290
139                                                        - Radeon R9 290x
140                                                        - Radeon R390
141                                                        - Radeon R390x
142      gfx703     - kabini    amdgcn       APU           - E1-2100
143                 - mullins                              - E1-2200
144                                                        - E1-2500
145                                                        - E2-3000
146                                                        - E2-3800
147                                                        - A4-5000
148                                                        - A4-5100
149                                                        - A6-5200
150                                                        - A4 Pro-3340B
151      **GCN GFX8 (Volcanic Islands (VI))** [AMD-Volcanic-Islands]_
152      --------------------------------------------------------------------
153      gfx800     - iceland   amdgcn       dGPU          - FirePro S7150
154                                                        - FirePro S7100
155                                                        - FirePro W7100
156                                                        - Radeon R285
157                                                        - Radeon R9 380
158                                                        - Radeon R9 385
159                                                        - Mobile FirePro
160                                                          M7170
161      gfx801     - carrizo   amdgcn       APU           - A6-8500P
162                                                        - Pro A6-8500B
163                                                        - A8-8600P
164                                                        - Pro A8-8600B
165                                                        - FX-8800P
166                                                        - Pro A12-8800B
167      \                      amdgcn       APU   ROCm    - A10-8700P
168                                                        - Pro A10-8700B
169                                                        - A10-8780P
170      \                      amdgcn       APU           - A10-9600P
171                                                        - A10-9630P
172                                                        - A12-9700P
173                                                        - A12-9730P
174                                                        - FX-9800P
175                                                        - FX-9830P
176      \                      amdgcn       APU           - E2-9010
177                                                        - A6-9210
178                                                        - A9-9410
179      gfx802     - tonga     amdgcn       dGPU  ROCm    Same as gfx800
180      gfx803     - fiji      amdgcn       dGPU  ROCm    - Radeon R9 Nano
181                                                        - Radeon R9 Fury
182                                                        - Radeon R9 FuryX
183                                                        - Radeon Pro Duo
184                                                        - FirePro S9300x2
185                                                        - Radeon Instinct MI8
186      \          - polaris10 amdgcn       dGPU  ROCm    - Radeon RX 470
187                                                        - Radeon RX 480
188                                                        - Radeon Instinct MI6
189      \          - polaris11 amdgcn       dGPU  ROCm    - Radeon RX 460
190      gfx804                 amdgcn       dGPU          Same as gfx803
191      gfx810     - stoney    amdgcn       APU
192      **GCN GFX9** [AMD-Vega]_
193      --------------------------------------------------------------------
194      gfx900                 amdgcn       dGPU          - Radeon Vega
195                                                          Frontier Edition
196                                                        - Radeon RX Vega 56
197                                                        - Radeon RX Vega 64
198                                                        - Radeon RX Vega 64
199                                                          Liquid
200                                                        - Radeon Instinct MI25
201      gfx901                 amdgcn       dGPU  ROCm    Same as gfx900
202                                                        except XNACK is
203                                                        enabled
204      gfx902                 amdgcn       APU           *TBA*
205
206                                                        .. TODO
207                                                           Add product
208                                                           names.
209      gfx903                 amdgcn       APU           Same as gfx902
210                                                        except XNACK is
211                                                        enabled
212      ========== =========== ============ ===== ======= ==================
213
214 .. _amdgpu-address-spaces:
215
216 Address Spaces
217 --------------
218
219 The AMDGPU backend uses the following address space mappings.
220
221 The memory space names used in the table, aside from the region memory space, is
222 from the OpenCL standard.
223
224 LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
225
226   .. table:: Address Space Mapping
227      :name: amdgpu-address-space-mapping-table
228
229      ================== ================= ================= ================= =================
230      LLVM Address Space Memory Space
231      ------------------ -----------------------------------------------------------------------
232      \                  Current Default   amdgiz/amdgizcl   hcc               Future Default
233      ================== ================= ================= ================= =================
234      0                  Private (Scratch) Generic (Flat)    Generic (Flat)    Generic (Flat)
235      1                  Global            Global            Global            Global
236      2                  Constant          Constant          Constant          Region (GDS)
237      3                  Local (group/LDS) Local (group/LDS) Local (group/LDS) Local (group/LDS)
238      4                  Generic (Flat)    Region (GDS)      Region (GDS)      Constant
239      5                  Region (GDS)      Private (Scratch) Private (Scratch) Private (Scratch)
240      ================== ================= ================= ================= =================
241
242 Current Default
243   This is the current default address space mapping used for all languages
244   except hcc. This will shortly be deprecated.
245
246 amdgiz/amdgizcl
247   This is the current address space mapping used when ``amdgiz`` or ``amdgizcl``
248   is specified as the target triple environment value.
249
250 hcc
251   This is the current address space mapping used when ``hcc`` is specified as
252   the target triple environment value.This will shortly be deprecated.
253
254 Future Default
255   This will shortly be the only address space mapping for all languages using
256   AMDGPU backend.
257
258 .. _amdgpu-memory-scopes:
259
260 Memory Scopes
261 -------------
262
263 This section provides LLVM memory synchronization scopes supported by the AMDGPU
264 backend memory model when the target triple OS is ``amdhsa`` (see
265 :ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
266
267 The memory model supported is based on the HSA memory model [HSA]_ which is
268 based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
269 relation is transitive over the synchonizes-with relation independent of scope,
270 and synchonizes-with allows the memory scope instances to be inclusive (see
271 table :ref:`amdgpu-amdhsa-llvm-sync-scopes-amdhsa-table`).
272
273 This is different to the OpenCL [OpenCL]_ memory model which does not have scope
274 inclusion and requires the memory scopes to exactly match. However, this
275 is conservatively correct for OpenCL.
276
277   .. table:: AMDHSA LLVM Sync Scopes for AMDHSA
278      :name: amdgpu-amdhsa-llvm-sync-scopes-amdhsa-table
279
280      ================ ==========================================================
281      LLVM Sync Scope  Description
282      ================ ==========================================================
283      *none*           The default: ``system``.
284
285                       Synchronizes with, and participates in modification and
286                       seq_cst total orderings with, other operations (except
287                       image operations) for all address spaces (except private,
288                       or generic that accesses private) provided the other
289                       operation's sync scope is:
290
291                       - ``system``.
292                       - ``agent`` and executed by a thread on the same agent.
293                       - ``workgroup`` and executed by a thread in the same
294                         workgroup.
295                       - ``wavefront`` and executed by a thread in the same
296                         wavefront.
297
298      ``agent``        Synchronizes with, and participates in modification and
299                       seq_cst total orderings with, other operations (except
300                       image operations) for all address spaces (except private,
301                       or generic that accesses private) provided the other
302                       operation's sync scope is:
303
304                       - ``system`` or ``agent`` and executed by a thread on the
305                         same agent.
306                       - ``workgroup`` and executed by a thread in the same
307                         workgroup.
308                       - ``wavefront`` and executed by a thread in the same
309                         wavefront.
310
311      ``workgroup``    Synchronizes with, and participates in modification and
312                       seq_cst total orderings with, other operations (except
313                       image operations) for all address spaces (except private,
314                       or generic that accesses private) provided the other
315                       operation's sync scope is:
316
317                       - ``system``, ``agent`` or ``workgroup`` and executed by a
318                         thread in the same workgroup.
319                       - ``wavefront`` and executed by a thread in the same
320                         wavefront.
321
322      ``wavefront``    Synchronizes with, and participates in modification and
323                       seq_cst total orderings with, other operations (except
324                       image operations) for all address spaces (except private,
325                       or generic that accesses private) provided the other
326                       operation's sync scope is:
327
328                       - ``system``, ``agent``, ``workgroup`` or ``wavefront``
329                         and executed by a thread in the same wavefront.
330
331      ``singlethread`` Only synchronizes with, and participates in modification
332                       and seq_cst total orderings with, other operations (except
333                       image operations) running in the same thread for all
334                       address spaces (for example, in signal handlers).
335      ================ ==========================================================
336
337 AMDGPU Intrinsics
338 -----------------
339
340 The AMDGPU backend implements the following intrinsics.
341
342 *This section is WIP.*
343
344 .. TODO
345    List AMDGPU intrinsics
346
347 Code Object
348 ===========
349
350 The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
351 can be linked by ``lld`` to produce a standard ELF shared code object which can
352 be loaded and executed on an AMDGPU target.
353
354 Header
355 ------
356
357 The AMDGPU backend uses the following ELF header:
358
359   .. table:: AMDGPU ELF Header
360      :name: amdgpu-elf-header-table
361
362      ========================== ===============================
363      Field                      Value
364      ========================== ===============================
365      ``e_ident[EI_CLASS]``      ``ELFCLASS64``
366      ``e_ident[EI_DATA]``       ``ELFDATA2LSB``
367      ``e_ident[EI_OSABI]``      ``ELFOSABI_AMDGPU_HSA``,
368                                 ``ELFOSABI_AMDGPU_PAL`` or
369                                 ``ELFOSABI_AMDGPU_MESA3D``
370      ``e_ident[EI_ABIVERSION]`` ``ELFABIVERSION_AMDGPU_HSA``,
371                                 ``ELFABIVERSION_AMDGPU_PAL`` or
372                                 ``ELFABIVERSION_AMDGPU_MESA3D``
373      ``e_type``                 ``ET_REL`` or ``ET_DYN``
374      ``e_machine``              ``EM_AMDGPU``
375      ``e_entry``                0
376      ``e_flags``                0
377      ========================== ===============================
378
379 ..
380
381   .. table:: AMDGPU ELF Header Enumeration Values
382      :name: amdgpu-elf-header-enumeration-values-table
383
384      =============================== =====
385      Name                            Value
386      =============================== =====
387      ``EM_AMDGPU``                   224
388      ``ELFOSABI_AMDGPU_HSA``         64
389      ``ELFOSABI_AMDGPU_PAL``         65
390      ``ELFOSABI_AMDGPU_MESA3D``      66
391      ``ELFABIVERSION_AMDGPU_HSA``    1
392      ``ELFABIVERSION_AMDGPU_PAL``    0
393      ``ELFABIVERSION_AMDGPU_MESA3D`` 0
394      =============================== =====
395
396 ``e_ident[EI_CLASS]``
397   The ELF class is always ``ELFCLASS64``. The AMDGPU backend only supports 64
398   bit applications.
399
400 ``e_ident[EI_DATA]``
401   All AMDGPU targets use ELFDATA2LSB for little-endian byte ordering.
402
403 ``e_ident[EI_OSABI]``
404   One of the following AMD GPU architecture specific OS ABIs:
405
406   * ``ELFOSABI_AMDGPU_HSA`` is used to specify that the code object conforms to
407     the AMD HSA runtime ABI [HSA]_.
408
409   * ``ELFOSABI_AMDGPU_PAL`` is used to specify that the code object conforms to
410     the AMD PAL runtime ABI.
411
412   * ``ELFOSABI_AMDGPU_MESA3D`` is used to specify that the code object conforms
413     to the AMD MESA runtime ABI.
414
415 ``e_ident[EI_ABIVERSION]``
416   The ABI version of the AMD GPU architecture specific OS ABI to which the code
417   object conforms:
418
419   * ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
420     runtime ABI.
421
422   * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
423     runtime ABI.
424
425   * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
426     runtime ABI.
427
428 ``e_type``
429   Can be one of the following values:
430
431
432   ``ET_REL``
433     The type produced by the AMD GPU backend compiler as it is relocatable code
434     object.
435
436   ``ET_DYN``
437     The type produced by the linker as it is a shared code object.
438
439   The AMD HSA runtime loader requires a ``ET_DYN`` code object.
440
441 ``e_machine``
442   The value ``EM_AMDGPU`` is used for the machine for all members of the AMD GPU
443   architecture family. The specific member is specified in the
444   ``NT_AMD_AMDGPU_ISA`` entry in the ``.note`` section (see
445   :ref:`amdgpu-note-records`).
446
447 ``e_entry``
448   The entry point is 0 as the entry points for individual kernels must be
449   selected in order to invoke them through AQL packets.
450
451 ``e_flags``
452   The value is 0 as no flags are used.
453
454 Sections
455 --------
456
457 An AMDGPU target ELF code object has the standard ELF sections which include:
458
459   .. table:: AMDGPU ELF Sections
460      :name: amdgpu-elf-sections-table
461
462      ================== ================ =================================
463      Name               Type             Attributes
464      ================== ================ =================================
465      ``.bss``           ``SHT_NOBITS``   ``SHF_ALLOC`` + ``SHF_WRITE``
466      ``.data``          ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
467      ``.debug_``\ *\**  ``SHT_PROGBITS`` *none*
468      ``.dynamic``       ``SHT_DYNAMIC``  ``SHF_ALLOC``
469      ``.dynstr``        ``SHT_PROGBITS`` ``SHF_ALLOC``
470      ``.dynsym``        ``SHT_PROGBITS`` ``SHF_ALLOC``
471      ``.got``           ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
472      ``.hash``          ``SHT_HASH``     ``SHF_ALLOC``
473      ``.note``          ``SHT_NOTE``     *none*
474      ``.rela``\ *name*  ``SHT_RELA``     *none*
475      ``.rela.dyn``      ``SHT_RELA``     *none*
476      ``.rodata``        ``SHT_PROGBITS`` ``SHF_ALLOC``
477      ``.shstrtab``      ``SHT_STRTAB``   *none*
478      ``.strtab``        ``SHT_STRTAB``   *none*
479      ``.symtab``        ``SHT_SYMTAB``   *none*
480      ``.text``          ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
481      ================== ================ =================================
482
483 These sections have their standard meanings (see [ELF]_) and are only generated
484 if needed.
485
486 ``.debug``\ *\**
487   The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
488   DWARF produced by the AMDGPU backend.
489
490 ``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
491   The standard sections used by a dynamic loader.
492
493 ``.note``
494   See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
495   backend.
496
497 ``.rela``\ *name*, ``.rela.dyn``
498   For relocatable code objects, *name* is the name of the section that the
499   relocation records apply. For example, ``.rela.text`` is the section name for
500   relocation records associated with the ``.text`` section.
501
502   For linked shared code objects, ``.rela.dyn`` contains all the relocation
503   records from each of the relocatable code object's ``.rela``\ *name* sections.
504
505   See :ref:`amdgpu-relocation-records` for the relocation records supported by
506   the AMDGPU backend.
507
508 ``.text``
509   The executable machine code for the kernels and functions they call. Generated
510   as position independent code. See :ref:`amdgpu-code-conventions` for
511   information on conventions used in the isa generation.
512
513 .. _amdgpu-note-records:
514
515 Note Records
516 ------------
517
518 As required by ``ELFCLASS64``, minimal zero byte padding must be generated after
519 the ``name`` field to ensure the ``desc`` field is 4 byte aligned. In addition,
520 minimal zero byte padding must be generated to ensure the ``desc`` field size is
521 a multiple of 4 bytes. The ``sh_addralign`` field of the ``.note`` section must
522 be at least 4 to indicate at least 8 byte alignment.
523
524 The AMDGPU backend code object uses the following ELF note records in the
525 ``.note`` section. The *Description* column specifies the layout of the note
526 record’s ``desc`` field. All fields are consecutive bytes. Note records with
527 variable size strings have a corresponding ``*_size`` field that specifies the
528 number of bytes, including the terminating null character, in the string. The
529 string(s) come immediately after the preceding fields.
530
531 Additional note records can be present.
532
533   .. table:: AMDGPU ELF Note Records
534      :name: amdgpu-elf-note-records-table
535
536      ===== ============================== ======================================
537      Name  Type                           Description
538      ===== ============================== ======================================
539      "AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
540      "AMD" ``NT_AMD_AMDGPU_ISA``          <isa name null terminated string>
541      ===== ============================== ======================================
542
543 ..
544
545   .. table:: AMDGPU ELF Note Record Enumeration Values
546      :name: amdgpu-elf-note-record-enumeration-values-table
547
548      ============================== =====
549      Name                           Value
550      ============================== =====
551      *reserved*                       0-9
552      ``NT_AMD_AMDGPU_HSA_METADATA``    10
553      ``NT_AMD_AMDGPU_ISA``             11
554      ============================== =====
555
556 ``NT_AMD_AMDGPU_ISA``
557   Specifies the instruction set architecture used by the machine code contained
558   in the code object.
559
560   This note record is required for code objects containing machine code for
561   processors matching the ``amdgcn`` architecture in table
562   :ref:`amdgpu-processors`.
563
564   The null terminated string has the following syntax:
565
566     *architecture*\ ``-``\ *vendor*\ ``-``\ *os*\ ``-``\ *environment*\ ``-``\ *processor*
567
568   where:
569
570     *architecture*
571       The architecture from table :ref:`amdgpu-target-triples-table`.
572
573       This is always ``amdgcn`` when the target triple OS is ``amdhsa`` (see
574       :ref:`amdgpu-target-triples`).
575
576     *vendor*
577       The vendor from table :ref:`amdgpu-target-triples-table`.
578
579       For the AMDGPU backend this is always ``amd``.
580
581     *os*
582       The OS from table :ref:`amdgpu-target-triples-table`.
583
584     *environment*
585       An environment from table :ref:`amdgpu-target-triples-table`, or blank if
586       the environment has no affect on the execution of the code object.
587
588       For the AMDGPU backend this is currently always blank.
589     *processor*
590       The processor from table :ref:`amdgpu-processors-table`.
591
592   For example:
593
594     ``amdgcn-amd-amdhsa--gfx901``
595
596 ``NT_AMD_AMDGPU_HSA_METADATA``
597   Specifies extensible metadata associated with the code objects executed on HSA
598   [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
599   the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
600   :ref:`amdgpu-amdhsa-hsa-code-object-metadata` for the syntax of the code
601   object metadata string.
602
603 .. _amdgpu-symbols:
604
605 Symbols
606 -------
607
608 Symbols include the following:
609
610   .. table:: AMDGPU ELF Symbols
611      :name: amdgpu-elf-symbols-table
612
613      ===================== ============== ============= ==================
614      Name                  Type           Section       Description
615      ===================== ============== ============= ==================
616      *link-name*           ``STT_OBJECT`` - ``.data``   Global variable
617                                           - ``.rodata``
618                                           - ``.bss``
619      *link-name*\ ``@kd``  ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
620      *link-name*           ``STT_FUNC``   - ``.text``   Kernel entry point
621      ===================== ============== ============= ==================
622
623 Global variable
624   Global variables both used and defined by the compilation unit.
625
626   If the symbol is defined in the compilation unit then it is allocated in the
627   appropriate section according to if it has initialized data or is readonly.
628
629   If the symbol is external then its section is ``STN_UNDEF`` and the loader
630   will resolve relocations using the definition provided by another code object
631   or explicitly defined by the runtime.
632
633   All global symbols, whether defined in the compilation unit or external, are
634   accessed by the machine code indirectly through a GOT table entry. This
635   allows them to be preemptable. The GOT table is only supported when the target
636   triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
637
638   .. TODO
639      Add description of linked shared object symbols. Seems undefined symbols
640      are marked as STT_NOTYPE.
641
642 Kernel descriptor
643   Every HSA kernel has an associated kernel descriptor. It is the address of the
644   kernel descriptor that is used in the AQL dispatch packet used to invoke the
645   kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
646   defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
647
648 Kernel entry point
649   Every HSA kernel also has a symbol for its machine code entry point.
650
651 .. _amdgpu-relocation-records:
652
653 Relocation Records
654 ------------------
655
656 AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
657 relocatable fields are:
658
659 ``word32``
660   This specifies a 32-bit field occupying 4 bytes with arbitrary byte
661   alignment. These values use the same byte order as other word values in the
662   AMD GPU architecture.
663
664 ``word64``
665   This specifies a 64-bit field occupying 8 bytes with arbitrary byte
666   alignment. These values use the same byte order as other word values in the
667   AMD GPU architecture.
668
669 Following notations are used for specifying relocation calculations:
670
671 **A**
672   Represents the addend used to compute the value of the relocatable field.
673
674 **G**
675   Represents the offset into the global offset table at which the relocation
676   entry’s symbol will reside during execution.
677
678 **GOT**
679   Represents the address of the global offset table.
680
681 **P**
682   Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
683   of the storage unit being relocated (computed using ``r_offset``).
684
685 **S**
686   Represents the value of the symbol whose index resides in the relocation
687   entry.
688
689 The following relocation types are supported:
690
691   .. table:: AMDGPU ELF Relocation Records
692      :name: amdgpu-elf-relocation-records-table
693
694      ==========================  =====  ==========  ==============================
695      Relocation Type             Value  Field       Calculation
696      ==========================  =====  ==========  ==============================
697      ``R_AMDGPU_NONE``           0      *none*      *none*
698      ``R_AMDGPU_ABS32_LO``       1      ``word32``  (S + A) & 0xFFFFFFFF
699      ``R_AMDGPU_ABS32_HI``       2      ``word32``  (S + A) >> 32
700      ``R_AMDGPU_ABS64``          3      ``word64``  S + A
701      ``R_AMDGPU_REL32``          4      ``word32``  S + A - P
702      ``R_AMDGPU_REL64``          5      ``word64``  S + A - P
703      ``R_AMDGPU_ABS32``          6      ``word32``  S + A
704      ``R_AMDGPU_GOTPCREL``       7      ``word32``  G + GOT + A - P
705      ``R_AMDGPU_GOTPCREL32_LO``  8      ``word32``  (G + GOT + A - P) & 0xFFFFFFFF
706      ``R_AMDGPU_GOTPCREL32_HI``  9      ``word32``  (G + GOT + A - P) >> 32
707      ``R_AMDGPU_REL32_LO``       10     ``word32``  (S + A - P) & 0xFFFFFFFF
708      ``R_AMDGPU_REL32_HI``       11     ``word32``  (S + A - P) >> 32
709      ==========================  =====  ==========  ==============================
710
711 .. _amdgpu-dwarf:
712
713 DWARF
714 -----
715
716 Standard DWARF [DWARF]_ Version 2 sections can be generated. These contain
717 information that maps the code object executable code and data to the source
718 language constructs. It can be used by tools such as debuggers and profilers.
719
720 Address Space Mapping
721 ~~~~~~~~~~~~~~~~~~~~~
722
723 The following address space mapping is used:
724
725   .. table:: AMDGPU DWARF Address Space Mapping
726      :name: amdgpu-dwarf-address-space-mapping-table
727
728      =================== =================
729      DWARF Address Space Memory Space
730      =================== =================
731      1                   Private (Scratch)
732      2                   Local (group/LDS)
733      *omitted*           Global
734      *omitted*           Constant
735      *omitted*           Generic (Flat)
736      *not supported*     Region (GDS)
737      =================== =================
738
739 See :ref:`amdgpu-address-spaces` for information on the memory space terminology
740 used in the table.
741
742 An ``address_class`` attribute is generated on pointer type DIEs to specify the
743 DWARF address space of the value of the pointer when it is in the *private* or
744 *local* address space. Otherwise the attribute is omitted.
745
746 An ``XDEREF`` operation is generated in location list expressions for variables
747 that are allocated in the *private* and *local* address space. Otherwise no
748 ``XDREF`` is omitted.
749
750 Register Mapping
751 ~~~~~~~~~~~~~~~~
752
753 *This section is WIP.*
754
755 .. TODO
756    Define DWARF register enumeration.
757
758    If want to present a wavefront state then should expose vector registers as
759    64 wide (rather than per work-item view that LLVM uses). Either as separate
760    registers, or a 64x4 byte single register. In either case use a new LANE op
761    (akin to XDREF) to select the current lane usage in a location
762    expression. This would also allow scalar register spilling to vector register
763    lanes to be expressed (currently no debug information is being generated for
764    spilling). If choose a wide single register approach then use LANE in
765    conjunction with PIECE operation to select the dword part of the register for
766    the current lane. If the separate register approach then use LANE to select
767    the register.
768
769 Source Text
770 ~~~~~~~~~~~
771
772 *This section is WIP.*
773
774 .. TODO
775    DWARF extension to include runtime generated source text.
776
777 .. _amdgpu-code-conventions:
778
779 Code Conventions
780 ================
781
782 This section provides code conventions used for each supported target triple OS
783 (see :ref:`amdgpu-target-triples`).
784
785 AMDHSA
786 ------
787
788 This section provides code conventions used when the target triple OS is
789 ``amdhsa`` (see :ref:`amdgpu-target-triples`).
790
791 .. _amdgpu-amdhsa-hsa-code-object-metadata:
792
793 Code Object Metadata
794 ~~~~~~~~~~~~~~~~~~~~
795
796 The code object metadata specifies extensible metadata associated with the code
797 objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
798 [AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record
799 (see :ref:`amdgpu-note-records`) and is required when the target triple OS is
800 ``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
801 information necessary to support the ROCM kernel queries. For example, the
802 segment sizes needed in a dispatch packet. In addition, a high level language
803 runtime may require other information to be included. For example, the AMD
804 OpenCL runtime records kernel argument information.
805
806 The metadata is specified as a YAML formatted string (see [YAML]_ and
807 :doc:`YamlIO`).
808
809 .. TODO
810    Is the string null terminated? It probably should not if YAML allows it to
811    contain null characters, otherwise it should be.
812
813 The metadata is represented as a single YAML document comprised of the mapping
814 defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
815 referenced tables.
816
817 For boolean values, the string values of ``false`` and ``true`` are used for
818 false and true respectively.
819
820 Additional information can be added to the mappings. To avoid conflicts, any
821 non-AMD key names should be prefixed by "*vendor-name*.".
822
823   .. table:: AMDHSA Code Object Metadata Mapping
824      :name: amdgpu-amdhsa-code-object-metadata-mapping-table
825
826      ========== ============== ========= =======================================
827      String Key Value Type     Required? Description
828      ========== ============== ========= =======================================
829      "Version"  sequence of    Required  - The first integer is the major
830                 2 integers                 version. Currently 1.
831                                          - The second integer is the minor
832                                            version. Currently 0.
833      "Printf"   sequence of              Each string is encoded information
834                 strings                  about a printf function call. The
835                                          encoded information is organized as
836                                          fields separated by colon (':'):
837
838                                          ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
839
840                                          where:
841
842                                          ``ID``
843                                            A 32 bit integer as a unique id for
844                                            each printf function call
845
846                                          ``N``
847                                            A 32 bit integer equal to the number
848                                            of arguments of printf function call
849                                            minus 1
850
851                                          ``S[i]`` (where i = 0, 1, ... , N-1)
852                                            32 bit integers for the size in bytes
853                                            of the i-th FormatString argument of
854                                            the printf function call
855
856                                          FormatString
857                                            The format string passed to the
858                                            printf function call.
859      "Kernels"  sequence of    Required  Sequence of the mappings for each
860                 mapping                  kernel in the code object. See
861                                          :ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
862                                          for the definition of the mapping.
863      ========== ============== ========= =======================================
864
865 ..
866
867   .. table:: AMDHSA Code Object Kernel Metadata Mapping
868      :name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
869
870      ================= ============== ========= ================================
871      String Key        Value Type     Required? Description
872      ================= ============== ========= ================================
873      "Name"            string         Required  Source name of the kernel.
874      "SymbolName"      string         Required  Name of the kernel
875                                                 descriptor ELF symbol.
876      "Language"        string                   Source language of the kernel.
877                                                 Values include:
878
879                                                 - "OpenCL C"
880                                                 - "OpenCL C++"
881                                                 - "HCC"
882                                                 - "OpenMP"
883
884      "LanguageVersion" sequence of              - The first integer is the major
885                        2 integers                 version.
886                                                 - The second integer is the
887                                                   minor version.
888      "Attrs"           mapping                  Mapping of kernel attributes.
889                                                 See
890                                                 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
891                                                 for the mapping definition.
892      "Args"            sequence of              Sequence of mappings of the
893                        mapping                  kernel arguments. See
894                                                 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
895                                                 for the definition of the mapping.
896      "CodeProps"       mapping                  Mapping of properties related to
897                                                 the kernel code. See
898                                                 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
899                                                 for the mapping definition.
900      "DebugProps"      mapping                  Mapping of properties related to
901                                                 the kernel debugging. See
902                                                 :ref:`amdgpu-amdhsa-code-object-kernel-debug-properties-metadata-mapping-table`
903                                                 for the mapping definition.
904      ================= ============== ========= ================================
905
906 ..
907
908   .. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
909      :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
910
911      =================== ============== ========= ==============================
912      String Key          Value Type     Required? Description
913      =================== ============== ========= ==============================
914      "ReqdWorkGroupSize" sequence of              The dispatch work-group size
915                          3 integers               X, Y, Z must correspond to the
916                                                   specified values.
917
918                                                   Corresponds to the OpenCL
919                                                   ``reqd_work_group_size``
920                                                   attribute.
921      "WorkGroupSizeHint" sequence of              The dispatch work-group size
922                          3 integers               X, Y, Z is likely to be the
923                                                   specified values.
924
925                                                   Corresponds to the OpenCL
926                                                   ``work_group_size_hint``
927                                                   attribute.
928      "VecTypeHint"       string                   The name of a scalar or vector
929                                                   type.
930
931                                                   Corresponds to the OpenCL
932                                                   ``vec_type_hint`` attribute.
933
934      "RuntimeHandle"     string                   The external symbol name
935                                                   associated with a kernel.
936                                                   OpenCL runtime allocates a
937                                                   global buffer for the symbol
938                                                   and saves the kernel's address
939                                                   to it, which is used for
940                                                   device side enqueueing. Only
941                                                   available for device side
942                                                   enqueued kernels.
943      =================== ============== ========= ==============================
944
945 ..
946
947   .. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
948      :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
949
950      ================= ============== ========= ================================
951      String Key        Value Type     Required? Description
952      ================= ============== ========= ================================
953      "Name"            string                   Kernel argument name.
954      "TypeName"        string                   Kernel argument type name.
955      "Size"            integer        Required  Kernel argument size in bytes.
956      "Align"           integer        Required  Kernel argument alignment in
957                                                 bytes. Must be a power of two.
958      "ValueKind"       string         Required  Kernel argument kind that
959                                                 specifies how to set up the
960                                                 corresponding argument.
961                                                 Values include:
962
963                                                 "ByValue"
964                                                   The argument is copied
965                                                   directly into the kernarg.
966
967                                                 "GlobalBuffer"
968                                                   A global address space pointer
969                                                   to the buffer data is passed
970                                                   in the kernarg.
971
972                                                 "DynamicSharedPointer"
973                                                   A group address space pointer
974                                                   to dynamically allocated LDS
975                                                   is passed in the kernarg.
976
977                                                 "Sampler"
978                                                   A global address space
979                                                   pointer to a S# is passed in
980                                                   the kernarg.
981
982                                                 "Image"
983                                                   A global address space
984                                                   pointer to a T# is passed in
985                                                   the kernarg.
986
987                                                 "Pipe"
988                                                   A global address space pointer
989                                                   to an OpenCL pipe is passed in
990                                                   the kernarg.
991
992                                                 "Queue"
993                                                   A global address space pointer
994                                                   to an OpenCL device enqueue
995                                                   queue is passed in the
996                                                   kernarg.
997
998                                                 "HiddenGlobalOffsetX"
999                                                   The OpenCL grid dispatch
1000                                                   global offset for the X
1001                                                   dimension is passed in the
1002                                                   kernarg.
1003
1004                                                 "HiddenGlobalOffsetY"
1005                                                   The OpenCL grid dispatch
1006                                                   global offset for the Y
1007                                                   dimension is passed in the
1008                                                   kernarg.
1009
1010                                                 "HiddenGlobalOffsetZ"
1011                                                   The OpenCL grid dispatch
1012                                                   global offset for the Z
1013                                                   dimension is passed in the
1014                                                   kernarg.
1015
1016                                                 "HiddenNone"
1017                                                   An argument that is not used
1018                                                   by the kernel. Space needs to
1019                                                   be left for it, but it does
1020                                                   not need to be set up.
1021
1022                                                 "HiddenPrintfBuffer"
1023                                                   A global address space pointer
1024                                                   to the runtime printf buffer
1025                                                   is passed in kernarg.
1026
1027                                                 "HiddenDefaultQueue"
1028                                                   A global address space pointer
1029                                                   to the OpenCL device enqueue
1030                                                   queue that should be used by
1031                                                   the kernel by default is
1032                                                   passed in the kernarg.
1033
1034                                                 "HiddenCompletionAction"
1035                                                   *TBD*
1036
1037                                                   .. TODO
1038                                                      Add description.
1039
1040      "ValueType"       string         Required  Kernel argument value type. Only
1041                                                 present if "ValueKind" is
1042                                                 "ByValue". For vector data
1043                                                 types, the value is for the
1044                                                 element type. Values include:
1045
1046                                                 - "Struct"
1047                                                 - "I8"
1048                                                 - "U8"
1049                                                 - "I16"
1050                                                 - "U16"
1051                                                 - "F16"
1052                                                 - "I32"
1053                                                 - "U32"
1054                                                 - "F32"
1055                                                 - "I64"
1056                                                 - "U64"
1057                                                 - "F64"
1058
1059                                                 .. TODO
1060                                                    How can it be determined if a
1061                                                    vector type, and what size
1062                                                    vector?
1063      "PointeeAlign"    integer                  Alignment in bytes of pointee
1064                                                 type for pointer type kernel
1065                                                 argument. Must be a power
1066                                                 of 2. Only present if
1067                                                 "ValueKind" is
1068                                                 "DynamicSharedPointer".
1069      "AddrSpaceQual"   string                   Kernel argument address space
1070                                                 qualifier. Only present if
1071                                                 "ValueKind" is "GlobalBuffer" or
1072                                                 "DynamicSharedPointer". Values
1073                                                 are:
1074
1075                                                 - "Private"
1076                                                 - "Global"
1077                                                 - "Constant"
1078                                                 - "Local"
1079                                                 - "Generic"
1080                                                 - "Region"
1081
1082                                                 .. TODO
1083                                                    Is GlobalBuffer only Global
1084                                                    or Constant? Is
1085                                                    DynamicSharedPointer always
1086                                                    Local? Can HCC allow Generic?
1087                                                    How can Private or Region
1088                                                    ever happen?
1089      "AccQual"         string                   Kernel argument access
1090                                                 qualifier. Only present if
1091                                                 "ValueKind" is "Image" or
1092                                                 "Pipe". Values
1093                                                 are:
1094
1095                                                 - "ReadOnly"
1096                                                 - "WriteOnly"
1097                                                 - "ReadWrite"
1098
1099                                                 .. TODO
1100                                                    Does this apply to
1101                                                    GlobalBuffer?
1102      "ActualAccQual"   string                   The actual memory accesses
1103                                                 performed by the kernel on the
1104                                                 kernel argument. Only present if
1105                                                 "ValueKind" is "GlobalBuffer",
1106                                                 "Image", or "Pipe". This may be
1107                                                 more restrictive than indicated
1108                                                 by "AccQual" to reflect what the
1109                                                 kernel actual does. If not
1110                                                 present then the runtime must
1111                                                 assume what is implied by
1112                                                 "AccQual" and "IsConst". Values
1113                                                 are:
1114
1115                                                 - "ReadOnly"
1116                                                 - "WriteOnly"
1117                                                 - "ReadWrite"
1118
1119      "IsConst"         boolean                  Indicates if the kernel argument
1120                                                 is const qualified. Only present
1121                                                 if "ValueKind" is
1122                                                 "GlobalBuffer".
1123
1124      "IsRestrict"      boolean                  Indicates if the kernel argument
1125                                                 is restrict qualified. Only
1126                                                 present if "ValueKind" is
1127                                                 "GlobalBuffer".
1128
1129      "IsVolatile"      boolean                  Indicates if the kernel argument
1130                                                 is volatile qualified. Only
1131                                                 present if "ValueKind" is
1132                                                 "GlobalBuffer".
1133
1134      "IsPipe"          boolean                  Indicates if the kernel argument
1135                                                 is pipe qualified. Only present
1136                                                 if "ValueKind" is "Pipe".
1137
1138                                                 .. TODO
1139                                                    Can GlobalBuffer be pipe
1140                                                    qualified?
1141      ================= ============== ========= ================================
1142
1143 ..
1144
1145   .. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
1146      :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
1147
1148      ============================ ============== ========= =====================
1149      String Key                   Value Type     Required? Description
1150      ============================ ============== ========= =====================
1151      "KernargSegmentSize"         integer        Required  The size in bytes of
1152                                                            the kernarg segment
1153                                                            that holds the values
1154                                                            of the arguments to
1155                                                            the kernel.
1156      "GroupSegmentFixedSize"      integer        Required  The amount of group
1157                                                            segment memory
1158                                                            required by a
1159                                                            work-group in
1160                                                            bytes. This does not
1161                                                            include any
1162                                                            dynamically allocated
1163                                                            group segment memory
1164                                                            that may be added
1165                                                            when the kernel is
1166                                                            dispatched.
1167      "PrivateSegmentFixedSize"    integer        Required  The amount of fixed
1168                                                            private address space
1169                                                            memory required for a
1170                                                            work-item in
1171                                                            bytes. If
1172                                                            IsDynamicCallstack
1173                                                            is 1 then additional
1174                                                            space must be added
1175                                                            to this value for the
1176                                                            call stack.
1177      "KernargSegmentAlign"        integer        Required  The maximum byte
1178                                                            alignment of
1179                                                            arguments in the
1180                                                            kernarg segment. Must
1181                                                            be a power of 2.
1182      "WavefrontSize"              integer        Required  Wavefront size. Must
1183                                                            be a power of 2.
1184      "NumSGPRs"                   integer                  Number of scalar
1185                                                            registers used by a
1186                                                            wavefront for
1187                                                            GFX6-GFX9. This
1188                                                            includes the special
1189                                                            SGPRs for VCC, Flat
1190                                                            Scratch (GFX7-GFX9)
1191                                                            and XNACK (for
1192                                                            GFX8-GFX9). It does
1193                                                            not include the 16
1194                                                            SGPR added if a trap
1195                                                            handler is
1196                                                            enabled. It is not
1197                                                            rounded up to the
1198                                                            allocation
1199                                                            granularity.
1200      "NumVGPRs"                   integer                  Number of vector
1201                                                            registers used by
1202                                                            each work-item for
1203                                                            GFX6-GFX9
1204      "MaxFlatWorkgroupSize"       integer                  Maximum flat
1205                                                            work-group size
1206                                                            supported by the
1207                                                            kernel in work-items.
1208      "IsDynamicCallStack"         boolean                  Indicates if the
1209                                                            generated machine
1210                                                            code is using a
1211                                                            dynamically sized
1212                                                            call stack.
1213      "IsXNACKEnabled"             boolean                  Indicates if the
1214                                                            generated machine
1215                                                            code is capable of
1216                                                            supporting XNACK.
1217      ============================ ============== ========= =====================
1218
1219 ..
1220
1221   .. table:: AMDHSA Code Object Kernel Debug Properties Metadata Mapping
1222      :name: amdgpu-amdhsa-code-object-kernel-debug-properties-metadata-mapping-table
1223
1224      =================================== ============== ========= ==============
1225      String Key                          Value Type     Required? Description
1226      =================================== ============== ========= ==============
1227      "DebuggerABIVersion"                sequence of
1228                                          2 integers
1229      "ReservedNumVGPRs"                  integer
1230      "ReservedFirstVGPR"                 integer
1231      "PrivateSegmentBufferSGPR"          integer
1232      "WavefrontPrivateSegmentOffsetSGPR" integer
1233      =================================== ============== ========= ==============
1234
1235 .. TODO
1236    Plan to remove the debug properties metadata.   
1237
1238 Kernel Dispatch
1239 ~~~~~~~~~~~~~~~
1240
1241 The HSA architected queuing language (AQL) defines a user space memory interface
1242 that can be used to control the dispatch of kernels, in an agent independent
1243 way. An agent can have zero or more AQL queues created for it using the ROCm
1244 runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
1245 *HSA Platform System Architecture Specification* [HSA]_ for the AQL queue
1246 mechanics and packet layouts.
1247
1248 The packet processor of a kernel agent is responsible for detecting and
1249 dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
1250 packet processor is implemented by the hardware command processor (CP),
1251 asynchronous dispatch controller (ADC) and shader processor input controller
1252 (SPI).
1253
1254 The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
1255 mode driver to initialize and register the AQL queue with CP.
1256
1257 To dispatch a kernel the following actions are performed. This can occur in the
1258 CPU host program, or from an HSA kernel executing on a GPU.
1259
1260 1. A pointer to an AQL queue for the kernel agent on which the kernel is to be
1261    executed is obtained.
1262 2. A pointer to the kernel descriptor (see
1263    :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
1264    obtained. It must be for a kernel that is contained in a code object that that
1265    was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
1266    associated.
1267 3. Space is allocated for the kernel arguments using the ROCm runtime allocator
1268    for a memory region with the kernarg property for the kernel agent that will
1269    execute the kernel. It must be at least 16 byte aligned.
1270 4. Kernel argument values are assigned to the kernel argument memory
1271    allocation. The layout is defined in the *HSA Programmer’s Language Reference*
1272    [HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
1273    memory in the same way constant memory is accessed. (Note that the HSA
1274    specification allows an implementation to copy the kernel argument contents to
1275    another location that is accessed by the kernel.)
1276 5. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
1277    api uses 64 bit atomic operations to reserve space in the AQL queue for the
1278    packet. The packet must be set up, and the final write must use an atomic
1279    store release to set the packet kind to ensure the packet contents are
1280    visible to the kernel agent. AQL defines a doorbell signal mechanism to
1281    notify the kernel agent that the AQL queue has been updated. These rules, and
1282    the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
1283    System Architecture Specification* [HSA]_.
1284 6. A kernel dispatch packet includes information about the actual dispatch,
1285    such as grid and work-group size, together with information from the code
1286    object about the kernel, such as segment sizes. The ROCm runtime queries on
1287    the kernel symbol can be used to obtain the code object values which are
1288    recorded in the :ref:`amdgpu-amdhsa-hsa-code-object-metadata`.
1289 7. CP executes micro-code and is responsible for detecting and setting up the
1290    GPU to execute the wavefronts of a kernel dispatch.
1291 8. CP ensures that when the a wavefront starts executing the kernel machine
1292    code, the scalar general purpose registers (SGPR) and vector general purpose
1293    registers (VGPR) are set up as required by the machine code. The required
1294    setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
1295    register state is defined in
1296    :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
1297 9. The prolog of the kernel machine code (see
1298    :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
1299    before continuing executing the machine code that corresponds to the kernel.
1300 10. When the kernel dispatch has completed execution, CP signals the completion
1301     signal specified in the kernel dispatch packet if not 0.
1302
1303 .. _amdgpu-amdhsa-memory-spaces:
1304
1305 Memory Spaces
1306 ~~~~~~~~~~~~~
1307
1308 The memory space properties are:
1309
1310   .. table:: AMDHSA Memory Spaces
1311      :name: amdgpu-amdhsa-memory-spaces-table
1312
1313      ================= =========== ======== ======= ==================
1314      Memory Space Name HSA Segment Hardware Address NULL Value
1315                        Name        Name     Size
1316      ================= =========== ======== ======= ==================
1317      Private           private     scratch  32      0x00000000
1318      Local             group       LDS      32      0xFFFFFFFF
1319      Global            global      global   64      0x0000000000000000
1320      Constant          constant    *same as 64      0x0000000000000000
1321                                    global*
1322      Generic           flat        flat     64      0x0000000000000000
1323      Region            N/A         GDS      32      *not implemented
1324                                                     for AMDHSA*
1325      ================= =========== ======== ======= ==================
1326
1327 The global and constant memory spaces both use global virtual addresses, which
1328 are the same virtual address space used by the CPU. However, some virtual
1329 addresses may only be accessible to the CPU, some only accessible by the GPU,
1330 and some by both.
1331
1332 Using the constant memory space indicates that the data will not change during
1333 the execution of the kernel. This allows scalar read instructions to be
1334 used. The vector and scalar L1 caches are invalidated of volatile data before
1335 each kernel dispatch execution to allow constant memory to change values between
1336 kernel dispatches.
1337
1338 The local memory space uses the hardware Local Data Store (LDS) which is
1339 automatically allocated when the hardware creates work-groups of wavefronts, and
1340 freed when all the wavefronts of a work-group have terminated. The data store
1341 (DS) instructions can be used to access it.
1342
1343 The private memory space uses the hardware scratch memory support. If the kernel
1344 uses scratch, then the hardware allocates memory that is accessed using
1345 wavefront lane dword (4 byte) interleaving. The mapping used from private
1346 address to physical address is:
1347
1348   ``wavefront-scratch-base +
1349   (private-address * wavefront-size * 4) +
1350   (wavefront-lane-id * 4)``
1351
1352 There are different ways that the wavefront scratch base address is determined
1353 by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
1354 memory can be accessed in an interleaved manner using buffer instruction with
1355 the scratch buffer descriptor and per wave scratch offset, by the scratch
1356 instructions, or by flat instructions. If each lane of a wavefront accesses the
1357 same private address, the interleaving results in adjacent dwords being accessed
1358 and hence requires fewer cache lines to be fetched. Multi-dword access is not
1359 supported except by flat and scratch instructions in GFX9.
1360
1361 The generic address space uses the hardware flat address support available in
1362 GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
1363 local appertures), that are outside the range of addressible global memory, to
1364 map from a flat address to a private or local address.
1365
1366 FLAT instructions can take a flat address and access global, private (scratch)
1367 and group (LDS) memory depending in if the address is within one of the
1368 apperture ranges. Flat access to scratch requires hardware aperture setup and
1369 setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
1370 access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
1371 (see :ref:`amdgpu-amdhsa-m0`).
1372
1373 To convert between a segment address and a flat address the base address of the
1374 appertures address can be used. For GFX7-GFX8 these are available in the
1375 :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
1376 Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
1377 GFX9 the appature base addresses are directly available as inline constant
1378 registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
1379 address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
1380 which makes it easier to convert from flat to segment or segment to flat.
1381
1382 Image and Samplers
1383 ~~~~~~~~~~~~~~~~~~
1384
1385 Image and sample handles created by the ROCm runtime are 64 bit addresses of a
1386 hardware 32 byte V# and 48 byte S# object respectively. In order to support the
1387 HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
1388 enumeration values for the queries that are not trivially deducible from the S#
1389 representation.
1390
1391 HSA Signals
1392 ~~~~~~~~~~~
1393
1394 HSA signal handles created by the ROCm runtime are 64 bit addresses of a
1395 structure allocated in memory accessible from both the CPU and GPU. The
1396 structure is defined by the ROCm runtime and subject to change between releases
1397 (see [AMD-ROCm-github]_).
1398
1399 .. _amdgpu-amdhsa-hsa-aql-queue:
1400
1401 HSA AQL Queue
1402 ~~~~~~~~~~~~~
1403
1404 The HSA AQL queue structure is defined by the ROCm runtime and subject to change
1405 between releases (see [AMD-ROCm-github]_). For some processors it contains
1406 fields needed to implement certain language features such as the flat address
1407 aperture bases. It also contains fields used by CP such as managing the
1408 allocation of scratch memory.
1409
1410 .. _amdgpu-amdhsa-kernel-descriptor:
1411
1412 Kernel Descriptor
1413 ~~~~~~~~~~~~~~~~~
1414
1415 A kernel descriptor consists of the information needed by CP to initiate the
1416 execution of a kernel, including the entry point address of the machine code
1417 that implements the kernel.
1418
1419 Kernel Descriptor for GFX6-GFX9
1420 +++++++++++++++++++++++++++++++
1421
1422 CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
1423
1424   .. table:: Kernel Descriptor for GFX6-GFX9
1425      :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
1426
1427      ======= ======= =============================== ===========================
1428      Bits    Size    Field Name                      Description
1429      ======= ======= =============================== ===========================
1430      31:0    4 bytes group_segment_fixed_size        The amount of fixed local
1431                                                      address space memory
1432                                                      required for a work-group
1433                                                      in bytes. This does not
1434                                                      include any dynamically
1435                                                      allocated local address
1436                                                      space memory that may be
1437                                                      added when the kernel is
1438                                                      dispatched.
1439      63:32   4 bytes private_segment_fixed_size      The amount of fixed
1440                                                      private address space
1441                                                      memory required for a
1442                                                      work-item in bytes. If
1443                                                      is_dynamic_callstack is 1
1444                                                      then additional space must
1445                                                      be added to this value for
1446                                                      the call stack.
1447      95:64   4 bytes max_flat_workgroup_size         Maximum flat work-group
1448                                                      size supported by the
1449                                                      kernel in work-items.
1450      96      1 bit   is_dynamic_call_stack           Indicates if the generated
1451                                                      machine code is using a
1452                                                      dynamically sized call
1453                                                      stack.
1454      97      1 bit   is_xnack_enabled                Indicates if the generated
1455                                                      machine code is capable of
1456                                                      suppoting XNACK.
1457      127:98  30 bits                                 Reserved. Must be 0.
1458      191:128 8 bytes kernel_code_entry_byte_offset   Byte offset (possibly
1459                                                      negative) from base
1460                                                      address of kernel
1461                                                      descriptor to kernel's
1462                                                      entry point instruction
1463                                                      which must be 256 byte
1464                                                      aligned.
1465      383:192 24                                      Reserved. Must be 0.
1466              bytes
1467      415:384 4 bytes compute_pgm_rsrc1               Compute Shader (CS)
1468                                                      program settings used by
1469                                                      CP to set up
1470                                                      ``COMPUTE_PGM_RSRC1``
1471                                                      configuration
1472                                                      register. See
1473                                                      :ref:`amdgpu-amdhsa-compute_pgm_rsrc1_t-gfx6-gfx9-table`.
1474      447:416 4 bytes compute_pgm_rsrc2               Compute Shader (CS)
1475                                                      program settings used by
1476                                                      CP to set up
1477                                                      ``COMPUTE_PGM_RSRC2``
1478                                                      configuration
1479                                                      register. See
1480                                                      :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
1481      448     1 bit   enable_sgpr_private_segment     Enable the setup of the
1482                      _buffer                         SGPR user data registers
1483                                                      (see
1484                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1485
1486                                                      The total number of SGPR
1487                                                      user data registers
1488                                                      requested must not exceed
1489                                                      16 and match value in
1490                                                      ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
1491                                                      Any requests beyond 16
1492                                                      will be ignored.
1493      449     1 bit   enable_sgpr_dispatch_ptr        *see above*
1494      450     1 bit   enable_sgpr_queue_ptr           *see above*
1495      451     1 bit   enable_sgpr_kernarg_segment_ptr *see above*
1496      452     1 bit   enable_sgpr_dispatch_id         *see above*
1497      453     1 bit   enable_sgpr_flat_scratch_init   *see above*
1498      454     1 bit   enable_sgpr_private_segment     *see above*
1499                      _size
1500      455     1 bit   enable_sgpr_grid_workgroup      Not implemented in CP and
1501                      _count_X                        should always be 0.
1502      456     1 bit   enable_sgpr_grid_workgroup      Not implemented in CP and
1503                      _count_Y                        should always be 0.
1504      457     1 bit   enable_sgpr_grid_workgroup      Not implemented in CP and
1505                      _count_Z                        should always be 0.
1506      463:458 6 bits                                  Reserved. Must be 0.
1507      511:464 4                                       Reserved. Must be 0.
1508              bytes
1509      512     **Total size 64 bytes.**
1510      ======= ===================================================================
1511
1512 ..
1513
1514   .. table:: compute_pgm_rsrc1 for GFX6-GFX9
1515      :name: amdgpu-amdhsa-compute_pgm_rsrc1_t-gfx6-gfx9-table
1516
1517      ======= ======= =============================== ===========================================================================
1518      Bits    Size    Field Name                      Description
1519      ======= ======= =============================== ===========================================================================
1520      5:0     6 bits  granulated_workitem_vgpr_count  Number of vector registers
1521                                                      used by each work-item,
1522                                                      granularity is device
1523                                                      specific:
1524
1525                                                      GFX6-9
1526                                                        roundup((max-vgpg + 1)
1527                                                        / 4) - 1
1528
1529                                                      Used by CP to set up
1530                                                      ``COMPUTE_PGM_RSRC1.VGPRS``.
1531      9:6     4 bits  granulated_wavefront_sgpr_count Number of scalar registers
1532                                                      used by a wavefront,
1533                                                      granularity is device
1534                                                      specific:
1535
1536                                                      GFX6-8
1537                                                        roundup((max-sgpg + 1)
1538                                                        / 8) - 1
1539                                                      GFX9
1540                                                        roundup((max-sgpg + 1)
1541                                                        / 16) - 1
1542
1543                                                      Includes the special SGPRs
1544                                                      for VCC, Flat Scratch (for
1545                                                      GFX7 onwards) and XNACK
1546                                                      (for GFX8 onwards). It does
1547                                                      not include the 16 SGPR
1548                                                      added if a trap handler is
1549                                                      enabled.
1550
1551                                                      Used by CP to set up
1552                                                      ``COMPUTE_PGM_RSRC1.SGPRS``.
1553      11:10   2 bits  priority                        Must be 0.
1554
1555                                                      Start executing wavefront
1556                                                      at the specified priority.
1557
1558                                                      CP is responsible for
1559                                                      filling in
1560                                                      ``COMPUTE_PGM_RSRC1.PRIORITY``.
1561      13:12   2 bits  float_mode_round_32             Wavefront starts execution
1562                                                      with specified rounding
1563                                                      mode for single (32
1564                                                      bit) floating point
1565                                                      precision floating point
1566                                                      operations.
1567
1568                                                      Floating point rounding
1569                                                      mode values are defined in
1570                                                      :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1571
1572                                                      Used by CP to set up
1573                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1574      15:14   2 bits  float_mode_round_16_64          Wavefront starts execution
1575                                                      with specified rounding
1576                                                      denorm mode for half/double (16
1577                                                      and 64 bit) floating point
1578                                                      precision floating point
1579                                                      operations.
1580
1581                                                      Floating point rounding
1582                                                      mode values are defined in
1583                                                      :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1584
1585                                                      Used by CP to set up
1586                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1587      17:16   2 bits  float_mode_denorm_32            Wavefront starts execution
1588                                                      with specified denorm mode
1589                                                      for single (32
1590                                                      bit)  floating point
1591                                                      precision floating point
1592                                                      operations.
1593
1594                                                      Floating point denorm mode
1595                                                      values are defined in
1596                                                      :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1597
1598                                                      Used by CP to set up
1599                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1600      19:18   2 bits  float_mode_denorm_16_64         Wavefront starts execution
1601                                                      with specified denorm mode
1602                                                      for half/double (16
1603                                                      and 64 bit) floating point
1604                                                      precision floating point
1605                                                      operations.
1606
1607                                                      Floating point denorm mode
1608                                                      values are defined in
1609                                                      :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1610
1611                                                      Used by CP to set up
1612                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1613      20      1 bit   priv                            Must be 0.
1614
1615                                                      Start executing wavefront
1616                                                      in privilege trap handler
1617                                                      mode.
1618
1619                                                      CP is responsible for
1620                                                      filling in
1621                                                      ``COMPUTE_PGM_RSRC1.PRIV``.
1622      21      1 bit   enable_dx10_clamp               Wavefront starts execution
1623                                                      with DX10 clamp mode
1624                                                      enabled. Used by the vector
1625                                                      ALU to force DX-10 style
1626                                                      treatment of NaN's (when
1627                                                      set, clamp NaN to zero,
1628                                                      otherwise pass NaN
1629                                                      through).
1630
1631                                                      Used by CP to set up
1632                                                      ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
1633      22      1 bit   debug_mode                      Must be 0.
1634
1635                                                      Start executing wavefront
1636                                                      in single step mode.
1637
1638                                                      CP is responsible for
1639                                                      filling in
1640                                                      ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
1641      23      1 bit   enable_ieee_mode                Wavefront starts execution
1642                                                      with IEEE mode
1643                                                      enabled. Floating point
1644                                                      opcodes that support
1645                                                      exception flag gathering
1646                                                      will quiet and propagate
1647                                                      signaling-NaN inputs per
1648                                                      IEEE 754-2008. Min_dx10 and
1649                                                      max_dx10 become IEEE
1650                                                      754-2008 compliant due to
1651                                                      signaling-NaN propagation
1652                                                      and quieting.
1653
1654                                                      Used by CP to set up
1655                                                      ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
1656      24      1 bit   bulky                           Must be 0.
1657
1658                                                      Only one work-group allowed
1659                                                      to execute on a compute
1660                                                      unit.
1661
1662                                                      CP is responsible for
1663                                                      filling in
1664                                                      ``COMPUTE_PGM_RSRC1.BULKY``.
1665      25      1 bit   cdbg_user                       Must be 0.
1666
1667                                                      Flag that can be used to
1668                                                      control debugging code.
1669
1670                                                      CP is responsible for
1671                                                      filling in
1672                                                      ``COMPUTE_PGM_RSRC1.CDBG_USER``.
1673      31:26   6 bits                                  Reserved. Must be 0.
1674      32      **Total size 4 bytes**
1675      ======= ===================================================================================================================
1676
1677 ..
1678
1679   .. table:: compute_pgm_rsrc2 for GFX6-GFX9
1680      :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
1681
1682      ======= ======= =============================== ===========================================================================
1683      Bits    Size    Field Name                      Description
1684      ======= ======= =============================== ===========================================================================
1685      0       1 bit   enable_sgpr_private_segment     Enable the setup of the
1686                      _wave_offset                    SGPR wave scratch offset
1687                                                      system register (see
1688                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1689
1690                                                      Used by CP to set up
1691                                                      ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
1692      5:1     5 bits  user_sgpr_count                 The total number of SGPR
1693                                                      user data registers
1694                                                      requested. This number must
1695                                                      match the number of user
1696                                                      data registers enabled.
1697
1698                                                      Used by CP to set up
1699                                                      ``COMPUTE_PGM_RSRC2.USER_SGPR``.
1700      6       1 bit   enable_trap_handler             Set to 1 if code contains a
1701                                                      TRAP instruction which
1702                                                      requires a trap handler to
1703                                                      be enabled.
1704
1705                                                      CP sets
1706                                                      ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``
1707                                                      if the runtime has
1708                                                      installed a trap handler
1709                                                      regardless of the setting
1710                                                      of this field.
1711      7       1 bit   enable_sgpr_workgroup_id_x      Enable the setup of the
1712                                                      system SGPR register for
1713                                                      the work-group id in the X
1714                                                      dimension (see
1715                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1716
1717                                                      Used by CP to set up
1718                                                      ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
1719      8       1 bit   enable_sgpr_workgroup_id_y      Enable the setup of the
1720                                                      system SGPR register for
1721                                                      the work-group id in the Y
1722                                                      dimension (see
1723                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1724
1725                                                      Used by CP to set up
1726                                                      ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
1727      9       1 bit   enable_sgpr_workgroup_id_z      Enable the setup of the
1728                                                      system SGPR register for
1729                                                      the work-group id in the Z
1730                                                      dimension (see
1731                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1732
1733                                                      Used by CP to set up
1734                                                      ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
1735      10      1 bit   enable_sgpr_workgroup_info      Enable the setup of the
1736                                                      system SGPR register for
1737                                                      work-group information (see
1738                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1739
1740                                                      Used by CP to set up
1741                                                      ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
1742      12:11   2 bits  enable_vgpr_workitem_id         Enable the setup of the
1743                                                      VGPR system registers used
1744                                                      for the work-item ID.
1745                                                      :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
1746                                                      defines the values.
1747
1748                                                      Used by CP to set up
1749                                                      ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
1750      13      1 bit   enable_exception_address_watch  Must be 0.
1751
1752                                                      Wavefront starts execution
1753                                                      with address watch
1754                                                      exceptions enabled which
1755                                                      are generated when L1 has
1756                                                      witnessed a thread access
1757                                                      an *address of
1758                                                      interest*.
1759
1760                                                      CP is responsible for
1761                                                      filling in the address
1762                                                      watch bit in
1763                                                      ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1764                                                      according to what the
1765                                                      runtime requests.
1766      14      1 bit   enable_exception_memory         Must be 0.
1767
1768                                                      Wavefront starts execution
1769                                                      with memory violation
1770                                                      exceptions exceptions
1771                                                      enabled which are generated
1772                                                      when a memory violation has
1773                                                      occurred for this wave from
1774                                                      L1 or LDS
1775                                                      (write-to-read-only-memory,
1776                                                      mis-aligned atomic, LDS
1777                                                      address out of range,
1778                                                      illegal address, etc.).
1779
1780                                                      CP sets the memory
1781                                                      violation bit in
1782                                                      ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1783                                                      according to what the
1784                                                      runtime requests.
1785      23:15   9 bits  granulated_lds_size             Must be 0.
1786
1787                                                      CP uses the rounded value
1788                                                      from the dispatch packet,
1789                                                      not this value, as the
1790                                                      dispatch may contain
1791                                                      dynamically allocated group
1792                                                      segment memory. CP writes
1793                                                      directly to
1794                                                      ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
1795
1796                                                      Amount of group segment
1797                                                      (LDS) to allocate for each
1798                                                      work-group. Granularity is
1799                                                      device specific:
1800
1801                                                      GFX6:
1802                                                        roundup(lds-size / (64 * 4))
1803                                                      GFX7-GFX9:
1804                                                        roundup(lds-size / (128 * 4))
1805
1806      24      1 bit   enable_exception_ieee_754_fp    Wavefront starts execution
1807                      _invalid_operation              with specified exceptions
1808                                                      enabled.
1809
1810                                                      Used by CP to set up
1811                                                      ``COMPUTE_PGM_RSRC2.EXCP_EN``
1812                                                      (set from bits 0..6).
1813
1814                                                      IEEE 754 FP Invalid
1815                                                      Operation
1816      25      1 bit   enable_exception_fp_denormal    FP Denormal one or more
1817                      _source                         input operands is a
1818                                                      denormal number
1819      26      1 bit   enable_exception_ieee_754_fp    IEEE 754 FP Division by
1820                      _division_by_zero               Zero
1821      27      1 bit   enable_exception_ieee_754_fp    IEEE 754 FP FP Overflow
1822                      _overflow
1823      28      1 bit   enable_exception_ieee_754_fp    IEEE 754 FP Underflow
1824                      _underflow
1825      29      1 bit   enable_exception_ieee_754_fp    IEEE 754 FP Inexact
1826                      _inexact
1827      30      1 bit   enable_exception_int_divide_by  Integer Division by Zero
1828                      _zero                           (rcp_iflag_f32 instruction
1829                                                      only)
1830      31      1 bit                                   Reserved. Must be 0.
1831      32      **Total size 4 bytes.**
1832      ======= ===================================================================================================================
1833
1834 ..
1835
1836   .. table:: Floating Point Rounding Mode Enumeration Values
1837      :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
1838
1839      ===================================== ===== ===============================
1840      Enumeration Name                      Value Description
1841      ===================================== ===== ===============================
1842      AMD_FLOAT_ROUND_MODE_NEAR_EVEN        0     Round Ties To Even
1843      AMD_FLOAT_ROUND_MODE_PLUS_INFINITY    1     Round Toward +infinity
1844      AMD_FLOAT_ROUND_MODE_MINUS_INFINITY   2     Round Toward -infinity
1845      AMD_FLOAT_ROUND_MODE_ZERO             3     Round Toward 0
1846      ===================================== ===== ===============================
1847
1848 ..
1849
1850   .. table:: Floating Point Denorm Mode Enumeration Values
1851      :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
1852
1853      ===================================== ===== ===============================
1854      Enumeration Name                      Value Description
1855      ===================================== ===== ===============================
1856      AMD_FLOAT_DENORM_MODE_FLUSH_SRC_DST   0     Flush Source and Destination
1857                                                  Denorms
1858      AMD_FLOAT_DENORM_MODE_FLUSH_DST       1     Flush Output Denorms
1859      AMD_FLOAT_DENORM_MODE_FLUSH_SRC       2     Flush Source Denorms
1860      AMD_FLOAT_DENORM_MODE_FLUSH_NONE      3     No Flush
1861      ===================================== ===== ===============================
1862
1863 ..
1864
1865   .. table:: System VGPR Work-Item ID Enumeration Values
1866      :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
1867
1868      ===================================== ===== ===============================
1869      Enumeration Name                      Value Description
1870      ===================================== ===== ===============================
1871      AMD_SYSTEM_VGPR_WORKITEM_ID_X         0     Set work-item X dimension ID.
1872      AMD_SYSTEM_VGPR_WORKITEM_ID_X_Y       1     Set work-item X and Y
1873                                                  dimensions ID.
1874      AMD_SYSTEM_VGPR_WORKITEM_ID_X_Y_Z     2     Set work-item X, Y and Z
1875                                                  dimensions ID.
1876      AMD_SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3     Undefined.
1877      ===================================== ===== ===============================
1878
1879 .. _amdgpu-amdhsa-initial-kernel-execution-state:
1880
1881 Initial Kernel Execution State
1882 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1883
1884 This section defines the register state that will be set up by the packet
1885 processor prior to the start of execution of every wavefront. This is limited by
1886 the constraints of the hardware controllers of CP/ADC/SPI.
1887
1888 The order of the SGPR registers is defined, but the compiler can specify which
1889 ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
1890 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
1891 for enabled registers are dense starting at SGPR0: the first enabled register is
1892 SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
1893 an SGPR number.
1894
1895 The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
1896 all waves of the grid. It is possible to specify more than 16 User SGPRs using
1897 the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
1898 initialized. These are then immediately followed by the System SGPRs that are
1899 set up by ADC/SPI and can have different values for each wave of the grid
1900 dispatch.
1901
1902 SGPR register initial state is defined in
1903 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
1904
1905   .. table:: SGPR Register Set Up Order
1906      :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
1907
1908      ========== ========================== ====== ==============================
1909      SGPR Order Name                       Number Description
1910                 (kernel descriptor enable  of
1911                 field)                     SGPRs
1912      ========== ========================== ====== ==============================
1913      First      Private Segment Buffer     4      V# that can be used, together
1914                 (enable_sgpr_private              with Scratch Wave Offset as an
1915                 _segment_buffer)                  offset, to access the private
1916                                                   memory space using a segment
1917                                                   address.
1918
1919                                                   CP uses the value provided by
1920                                                   the runtime.
1921      then       Dispatch Ptr               2      64 bit address of AQL dispatch
1922                 (enable_sgpr_dispatch_ptr)        packet for kernel dispatch
1923                                                   actually executing.
1924      then       Queue Ptr                  2      64 bit address of amd_queue_t
1925                 (enable_sgpr_queue_ptr)           object for AQL queue on which
1926                                                   the dispatch packet was
1927                                                   queued.
1928      then       Kernarg Segment Ptr        2      64 bit address of Kernarg
1929                 (enable_sgpr_kernarg              segment. This is directly
1930                 _segment_ptr)                     copied from the
1931                                                   kernarg_address in the kernel
1932                                                   dispatch packet.
1933
1934                                                   Having CP load it once avoids
1935                                                   loading it at the beginning of
1936                                                   every wavefront.
1937      then       Dispatch Id                2      64 bit Dispatch ID of the
1938                 (enable_sgpr_dispatch_id)         dispatch packet being
1939                                                   executed.
1940      then       Flat Scratch Init          2      This is 2 SGPRs:
1941                 (enable_sgpr_flat_scratch
1942                 _init)                            GFX6
1943                                                     Not supported.
1944                                                   GFX7-GFX8
1945                                                     The first SGPR is a 32 bit
1946                                                     byte offset from
1947                                                     ``SH_HIDDEN_PRIVATE_BASE_VIMID``
1948                                                     to per SPI base of memory
1949                                                     for scratch for the queue
1950                                                     executing the kernel
1951                                                     dispatch. CP obtains this
1952                                                     from the runtime. (The
1953                                                     Scratch Segment Buffer base
1954                                                     address is
1955                                                     ``SH_HIDDEN_PRIVATE_BASE_VIMID``
1956                                                     plus this offset.) The value
1957                                                     of Scratch Wave Offset must
1958                                                     be added to this offset by
1959                                                     the kernel machine code,
1960                                                     right shifted by 8, and
1961                                                     moved to the FLAT_SCRATCH_HI
1962                                                     SGPR register.
1963                                                     FLAT_SCRATCH_HI corresponds
1964                                                     to SGPRn-4 on GFX7, and
1965                                                     SGPRn-6 on GFX8 (where SGPRn
1966                                                     is the highest numbered SGPR
1967                                                     allocated to the wave).
1968                                                     FLAT_SCRATCH_HI is
1969                                                     multiplied by 256 (as it is
1970                                                     in units of 256 bytes) and
1971                                                     added to
1972                                                     ``SH_HIDDEN_PRIVATE_BASE_VIMID``
1973                                                     to calculate the per wave
1974                                                     FLAT SCRATCH BASE in flat
1975                                                     memory instructions that
1976                                                     access the scratch
1977                                                     apperture.
1978
1979                                                     The second SGPR is 32 bit
1980                                                     byte size of a single
1981                                                     work-item’s scratch memory
1982                                                     usage. CP obtains this from
1983                                                     the runtime, and it is
1984                                                     always a multiple of DWORD.
1985                                                     CP checks that the value in
1986                                                     the kernel dispatch packet
1987                                                     Private Segment Byte Size is
1988                                                     not larger, and requests the
1989                                                     runtime to increase the
1990                                                     queue's scratch size if
1991                                                     necessary. The kernel code
1992                                                     must move it to
1993                                                     FLAT_SCRATCH_LO which is
1994                                                     SGPRn-3 on GFX7 and SGPRn-5
1995                                                     on GFX8. FLAT_SCRATCH_LO is
1996                                                     used as the FLAT SCRATCH
1997                                                     SIZE in flat memory
1998                                                     instructions. Having CP load
1999                                                     it once avoids loading it at
2000                                                     the beginning of every
2001                                                     wavefront. GFX9 This is the
2002                                                     64 bit base address of the
2003                                                     per SPI scratch backing
2004                                                     memory managed by SPI for
2005                                                     the queue executing the
2006                                                     kernel dispatch. CP obtains
2007                                                     this from the runtime (and
2008                                                     divides it if there are
2009                                                     multiple Shader Arrays each
2010                                                     with its own SPI). The value
2011                                                     of Scratch Wave Offset must
2012                                                     be added by the kernel
2013                                                     machine code and the result
2014                                                     moved to the FLAT_SCRATCH
2015                                                     SGPR which is SGPRn-6 and
2016                                                     SGPRn-5. It is used as the
2017                                                     FLAT SCRATCH BASE in flat
2018                                                     memory instructions. then
2019                                                     Private Segment Size 1 The
2020                                                     32 bit byte size of a
2021                                                     (enable_sgpr_private single
2022                                                     work-item's
2023                                                     scratch_segment_size) memory
2024                                                     allocation. This is the
2025                                                     value from the kernel
2026                                                     dispatch packet Private
2027                                                     Segment Byte Size rounded up
2028                                                     by CP to a multiple of
2029                                                     DWORD.
2030
2031                                                   Having CP load it once avoids
2032                                                   loading it at the beginning of
2033                                                   every wavefront.
2034
2035                                                   This is not used for
2036                                                   GFX7-GFX8 since it is the same
2037                                                   value as the second SGPR of
2038                                                   Flat Scratch Init. However, it
2039                                                   may be needed for GFX9 which
2040                                                   changes the meaning of the
2041                                                   Flat Scratch Init value.
2042      then       Grid Work-Group Count X    1      32 bit count of the number of
2043                 (enable_sgpr_grid                 work-groups in the X dimension
2044                 _workgroup_count_X)               for the grid being
2045                                                   executed. Computed from the
2046                                                   fields in the kernel dispatch
2047                                                   packet as ((grid_size.x +
2048                                                   workgroup_size.x - 1) /
2049                                                   workgroup_size.x).
2050      then       Grid Work-Group Count Y    1      32 bit count of the number of
2051                 (enable_sgpr_grid                 work-groups in the Y dimension
2052                 _workgroup_count_Y &&             for the grid being
2053                 less than 16 previous             executed. Computed from the
2054                 SGPRs)                            fields in the kernel dispatch
2055                                                   packet as ((grid_size.y +
2056                                                   workgroup_size.y - 1) /
2057                                                   workgroupSize.y).
2058
2059                                                   Only initialized if <16
2060                                                   previous SGPRs initialized.
2061      then       Grid Work-Group Count Z    1      32 bit count of the number of
2062                 (enable_sgpr_grid                 work-groups in the Z dimension
2063                 _workgroup_count_Z &&             for the grid being
2064                 less than 16 previous             executed. Computed from the
2065                 SGPRs)                            fields in the kernel dispatch
2066                                                   packet as ((grid_size.z +
2067                                                   workgroup_size.z - 1) /
2068                                                   workgroupSize.z).
2069
2070                                                   Only initialized if <16
2071                                                   previous SGPRs initialized.
2072      then       Work-Group Id X            1      32 bit work-group id in X
2073                 (enable_sgpr_workgroup_id         dimension of grid for
2074                 _X)                               wavefront.
2075      then       Work-Group Id Y            1      32 bit work-group id in Y
2076                 (enable_sgpr_workgroup_id         dimension of grid for
2077                 _Y)                               wavefront.
2078      then       Work-Group Id Z            1      32 bit work-group id in Z
2079                 (enable_sgpr_workgroup_id         dimension of grid for
2080                 _Z)                               wavefront.
2081      then       Work-Group Info            1      {first_wave, 14’b0000,
2082                 (enable_sgpr_workgroup            ordered_append_term[10:0],
2083                 _info)                            threadgroup_size_in_waves[5:0]}
2084      then       Scratch Wave Offset        1      32 bit byte offset from base
2085                 (enable_sgpr_private              of scratch base of queue
2086                 _segment_wave_offset)             executing the kernel
2087                                                   dispatch. Must be used as an
2088                                                   offset with Private
2089                                                   segment address when using
2090                                                   Scratch Segment Buffer. It
2091                                                   must be used to set up FLAT
2092                                                   SCRATCH for flat addressing
2093                                                   (see
2094                                                   :ref:`amdgpu-amdhsa-flat-scratch`).
2095      ========== ========================== ====== ==============================
2096
2097 The order of the VGPR registers is defined, but the compiler can specify which
2098 ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
2099 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2100 for enabled registers are dense starting at VGPR0: the first enabled register is
2101 VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
2102 VGPR number.
2103
2104 VGPR register initial state is defined in
2105 :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
2106
2107   .. table:: VGPR Register Set Up Order
2108      :name: amdgpu-amdhsa-vgpr-register-set-up-order-table
2109
2110      ========== ========================== ====== ==============================
2111      VGPR Order Name                       Number Description
2112                 (kernel descriptor enable  of
2113                 field)                     VGPRs
2114      ========== ========================== ====== ==============================
2115      First      Work-Item Id X             1      32 bit work item id in X
2116                 (Always initialized)              dimension of work-group for
2117                                                   wavefront lane.
2118      then       Work-Item Id Y             1      32 bit work item id in Y
2119                 (enable_vgpr_workitem_id          dimension of work-group for
2120                 > 0)                              wavefront lane.
2121      then       Work-Item Id Z             1      32 bit work item id in Z
2122                 (enable_vgpr_workitem_id          dimension of work-group for
2123                 > 1)                              wavefront lane.
2124      ========== ========================== ====== ==============================
2125
2126 The setting of registers is is done by GPU CP/ADC/SPI hardware as follows:
2127
2128 1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
2129    registers.
2130 2. Work-group Id registers X, Y, Z are set by ADC which supports any
2131    combination including none.
2132 3. Scratch Wave Offset is set by SPI in a per wave basis which is why its value
2133    cannot included with the flat scratch init value which is per queue.
2134 4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
2135    or (X, Y, Z).
2136
2137 Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
2138 value to the hardware required SGPRn-3 and SGPRn-4 respectively.
2139
2140 The global segment can be accessed either using buffer instructions (GFX6 which
2141 has V# 64 bit address support), flat instructions (GFX7-9), or global
2142 instructions (GFX9).
2143
2144 If buffer operations are used then the compiler can generate a V# with the
2145 following properties:
2146
2147 * base address of 0
2148 * no swizzle
2149 * ATC: 1 if IOMMU present (such as APU)
2150 * ptr64: 1
2151 * MTYPE set to support memory coherence that matches the runtime (such as CC for
2152   APU and NC for dGPU).
2153
2154 .. _amdgpu-amdhsa-kernel-prolog:
2155
2156 Kernel Prolog
2157 ~~~~~~~~~~~~~
2158
2159 .. _amdgpu-amdhsa-m0:
2160
2161 M0
2162 ++
2163
2164 GFX6-GFX8
2165   The M0 register must be initialized with a value at least the total LDS size
2166   if the kernel may access LDS via DS or flat operations. Total LDS size is
2167   available in dispatch packet. For M0, it is also possible to use maximum
2168   possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
2169   GFX7-GFX8).
2170 GFX9
2171   The M0 register is not used for range checking LDS accesses and so does not
2172   need to be initialized in the prolog.
2173
2174 .. _amdgpu-amdhsa-flat-scratch:
2175
2176 Flat Scratch
2177 ++++++++++++
2178
2179 If the kernel may use flat operations to access scratch memory, the prolog code
2180 must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
2181 are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wave
2182 Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
2183
2184 GFX6
2185   Flat scratch is not supported.
2186
2187 GFX7-8
2188   1. The low word of Flat Scratch Init is 32 bit byte offset from
2189      ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
2190      being managed by SPI for the queue executing the kernel dispatch. This is
2191      the same value used in the Scratch Segment Buffer V# base address. The
2192      prolog must add the value of Scratch Wave Offset to get the wave's byte
2193      scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
2194      FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
2195      by 8 before moving into FLAT_SCRATCH_LO.
2196   2. The second word of Flat Scratch Init is 32 bit byte size of a single
2197      work-items scratch memory usage. This is directly loaded from the kernel
2198      dispatch packet Private Segment Byte Size and rounded up to a multiple of
2199      DWORD. Having CP load it once avoids loading it at the beginning of every
2200      wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
2201      SIZE.
2202 GFX9
2203   The Flat Scratch Init is the 64 bit address of the base of scratch backing
2204   memory being managed by SPI for the queue executing the kernel dispatch. The
2205   prolog must add the value of Scratch Wave Offset and moved to the FLAT_SCRATCH
2206   pair for use as the flat scratch base in flat memory instructions.
2207
2208 .. _amdgpu-amdhsa-memory-model:
2209
2210 Memory Model
2211 ~~~~~~~~~~~~
2212
2213 This section describes the mapping of LLVM memory model onto AMDGPU machine code
2214 (see :ref:`memmodel`). *The implementation is WIP.*
2215
2216 .. TODO
2217    Update when implementation complete.
2218
2219    Support more relaxed OpenCL memory model to be controlled by environment
2220    component of target triple.
2221
2222 The AMDGPU backend supports the memory synchronization scopes specified in
2223 :ref:`amdgpu-memory-scopes`.
2224
2225 The code sequences used to implement the memory model are defined in table
2226 :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
2227
2228 The sequences specify the order of instructions that a single thread must
2229 execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
2230 to other memory instructions executed by the same thread. This allows them to be
2231 moved earlier or later which can allow them to be combined with other instances
2232 of the same instruction, or hoisted/sunk out of loops to improve
2233 performance. Only the instructions related to the memory model are given;
2234 additional ``s_waitcnt`` instructions are required to ensure registers are
2235 defined before being used. These may be able to be combined with the memory
2236 model ``s_waitcnt`` instructions as described above.
2237
2238 The AMDGPU memory model supports both the HSA [HSA]_ memory model, and the
2239 OpenCL [OpenCL]_ memory model. The HSA memory model uses a single happens-before
2240 relation for all address spaces (see :ref:`amdgpu-address-spaces`). The OpenCL
2241 memory model which has separate happens-before relations for the global and
2242 local address spaces, and only a fence specifying both global and local address
2243 space joins the relationships. Since the LLVM ``memfence`` instruction does not
2244 allow an address space to be specified the OpenCL fence has to convervatively
2245 assume both local and global address space was specified. However, optimizations
2246 can often be done to eliminate the additional ``s_waitcnt``instructions when
2247 there are no intervening corresponding ``ds/flat_load/store/atomic`` memory
2248 instructions. The code sequences in the table indicate what can be omitted for
2249 the OpenCL memory. The target triple environment is used to determine if the
2250 source language is OpenCL (see :ref:`amdgpu-opencl`).
2251
2252 ``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
2253 operations.
2254
2255 ``buffer/global/flat_load/store/atomic`` instructions to global memory are
2256 termed vector memory operations.
2257
2258 For GFX6-GFX9:
2259
2260 * Each agent has multiple compute units (CU).
2261 * Each CU has multiple SIMDs that execute wavefronts.
2262 * The wavefronts for a single work-group are executed in the same CU but may be
2263   executed by different SIMDs.
2264 * Each CU has a single LDS memory shared by the wavefronts of the work-groups
2265   executing on it.
2266 * All LDS operations of a CU are performed as wavefront wide operations in a
2267   global order and involve no caching. Completion is reported to a wavefront in
2268   execution order.
2269 * The LDS memory has multiple request queues shared by the SIMDs of a
2270   CU. Therefore, the LDS operations performed by different waves of a work-group
2271   can be reordered relative to each other, which can result in reordering the
2272   visibility of vector memory operations with respect to LDS operations of other
2273   wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
2274   ensure synchronization between LDS operations and vector memory operations
2275   between waves of a work-group, but not between operations performed by the
2276   same wavefront.
2277 * The vector memory operations are performed as wavefront wide operations and
2278   completion is reported to a wavefront in execution order. The exception is
2279   that for GFX7-9 ``flat_load/store/atomic`` instructions can report out of
2280   vector memory order if they access LDS memory, and out of LDS operation order
2281   if they access global memory.
2282 * The vector memory operations access a vector L1 cache shared by all wavefronts
2283   on a CU. Therefore, no special action is required for coherence between
2284   wavefronts in the same work-group. A ``buffer_wbinvl1_vol`` is required for
2285   coherence between waves executing in different work-groups as they may be
2286   executing on different CUs.
2287 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
2288   on a group of CUs. The scalar and vector L1 caches are not coherent. However,
2289   scalar operations are used in a restricted way so do not impact the memory
2290   model. See :ref:`amdgpu-amdhsa-memory-spaces`.
2291 * The vector and scalar memory operations use an L2 cache shared by all CUs on
2292   the same agent.
2293 * The L2 cache has independent channels to service disjoint ranges of virtual
2294   addresses.
2295 * Each CU has a separate request queue per channel. Therefore, the vector and
2296   scalar memory operations performed by waves executing in different work-groups
2297   (which may be executing on different CUs) of an agent can be reordered
2298   relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
2299   synchronization between vector memory operations of different CUs. It ensures a
2300   previous vector memory operation has completed before executing a subsequent
2301   vector memory or LDS operation and so can be used to meet the requirements of
2302   acquire and release.
2303 * The L2 cache can be kept coherent with other agents on some targets, or ranges
2304   of virtual addresses can be set up to bypass it to ensure system coherence.
2305
2306 Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-8),
2307 or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
2308 memory, atomic memory orderings are not meaningful and all accesses are treated
2309 as non-atomic.
2310
2311 Constant address space uses ``buffer/global_load`` instructions (or equivalent
2312 scalar memory instructions). Since the constant address space contents do not
2313 change during the execution of a kernel dispatch it is not legal to perform
2314 stores, and atomic memory orderings are not meaningful and all access are
2315 treated as non-atomic.
2316
2317 A memory synchronization scope wider than work-group is not meaningful for the
2318 group (LDS) address space and is treated as work-group.
2319
2320 The memory model does not support the region address space which is treated as
2321 non-atomic.
2322
2323 Acquire memory ordering is not meaningful on store atomic instructions and is
2324 treated as non-atomic.
2325
2326 Release memory ordering is not meaningful on load atomic instructions and is
2327 treated a non-atomic.
2328
2329 Acquire-release memory ordering is not meaningful on load or store atomic
2330 instructions and is treated as acquire and release respectively.
2331
2332 AMDGPU backend only uses scalar memory operations to access memory that is
2333 proven to not change during the execution of the kernel dispatch. This includes
2334 constant address space and global address space for program scope const
2335 variables. Therefore the kernel machine code does not have to maintain the
2336 scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
2337 and vector L1 caches are invalidated between kernel dispatches by CP since
2338 constant address space data may change between kernel dispatch executions. See
2339 :ref:`amdgpu-amdhsa-memory-spaces`.
2340
2341 The one execption is if scalar writes are used to spill SGPR registers. In this
2342 case the AMDGPU backend ensures the memory location used to spill is never
2343 accessed by vector memory operations at the same time. If scalar writes are used
2344 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
2345 return since the locations may be used for vector memory instructions by a
2346 future wave that uses the same scratch area, or a function call that creates a
2347 frame at the same address, respectively. There is no need for a ``s_dcache_inv``
2348 as all scalar writes are write-before-read in the same thread.
2349
2350 Scratch backing memory (which is used for the private address space) is accessed
2351 with MTYPE NC_NV (non-coherenent non-volatile). Since the private address space
2352 is only accessed by a single thread, and is always write-before-read,
2353 there is never a need to invalidate these entries from the L1 cache. Hence all
2354 cache invalidates are done as ``*_vol`` to only invalidate the volatile cache
2355 lines.
2356
2357 On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
2358 to invalidate the L2 cache. This also causes it to be treated as non-volatile
2359 and so is not invalidated by ``*_vol``. On APU it is accessed as CC (cache
2360 coherent) and so the L2 cache will coherent with the CPU and other agents.
2361
2362   .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
2363      :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
2364
2365      ============ ============ ============== ========== =======================
2366      LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code
2367                   Ordering     Sync Scope     Address
2368                                               Space
2369      ============ ============ ============== ========== =======================
2370      **Non-Atomic**
2371      ---------------------------------------------------------------------------
2372      load         *none*       *none*         - global   non-volatile
2373                                               - generic    1. buffer/global/flat_load
2374                                                          volatile
2375                                                            1. buffer/global/flat_load
2376                                                               glc=1
2377      load         *none*       *none*         - local    1. ds_load
2378      store        *none*       *none*         - global   1. buffer/global/flat_store
2379                                               - generic
2380      store        *none*       *none*         - local    1. ds_store
2381      **Unordered Atomic**
2382      ---------------------------------------------------------------------------
2383      load atomic  unordered    *any*          *any*      *Same as non-atomic*.
2384      store atomic unordered    *any*          *any*      *Same as non-atomic*.
2385      atomicrmw    unordered    *any*          *any*      *Same as monotonic
2386                                                          atomic*.
2387      **Monotonic Atomic**
2388      ---------------------------------------------------------------------------
2389      load atomic  monotonic    - singlethread - global   1. buffer/global/flat_load
2390                                - wavefront    - generic
2391                                - workgroup
2392      load atomic  monotonic    - singlethread - local    1. ds_load
2393                                - wavefront
2394                                - workgroup
2395      load atomic  monotonic    - agent        - global   1. buffer/global/flat_load
2396                                - system       - generic     glc=1
2397      store atomic monotonic    - singlethread - global   1. buffer/global/flat_store
2398                                - wavefront    - generic
2399                                - workgroup
2400                                - agent
2401                                - system
2402      store atomic monotonic    - singlethread - local    1. ds_store
2403                                - wavefront
2404                                - workgroup
2405      atomicrmw    monotonic    - singlethread - global   1. buffer/global/flat_atomic
2406                                - wavefront    - generic
2407                                - workgroup
2408                                - agent
2409                                - system
2410      atomicrmw    monotonic    - singlethread - local    1. ds_atomic
2411                                - wavefront
2412                                - workgroup
2413      **Acquire Atomic**
2414      ---------------------------------------------------------------------------
2415      load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load
2416                                - wavefront    - local
2417                                               - generic
2418      load atomic  acquire      - workgroup    - global   1. buffer/global_load
2419      load atomic  acquire      - workgroup    - local    1. ds/flat_load
2420                                               - generic  2. s_waitcnt lgkmcnt(0)
2421
2422                                                            - If OpenCL, omit
2423                                                              waitcnt.
2424                                                            - Must happen before
2425                                                              any following
2426                                                              global/generic
2427                                                              load/load
2428                                                              atomic/store/store
2429                                                              atomic/atomicrmw.
2430                                                            - Ensures any
2431                                                              following global
2432                                                              data read is no
2433                                                              older than the load
2434                                                              atomic value being
2435                                                              acquired.
2436
2437      load atomic  acquire      - agent        - global   1. buffer/global_load
2438                                - system                     glc=1
2439                                                          2. s_waitcnt vmcnt(0)
2440
2441                                                            - Must happen before
2442                                                              following
2443                                                              buffer_wbinvl1_vol.
2444                                                            - Ensures the load
2445                                                              has completed
2446                                                              before invalidating
2447                                                              the cache.
2448
2449                                                          3. buffer_wbinvl1_vol
2450
2451                                                            - Must happen before
2452                                                              any following
2453                                                              global/generic
2454                                                              load/load
2455                                                              atomic/atomicrmw.
2456                                                            - Ensures that
2457                                                              following
2458                                                              loads will not see
2459                                                              stale global data.
2460
2461      load atomic  acquire      - agent        - generic  1. flat_load glc=1
2462                                - system                  2. s_waitcnt vmcnt(0) &
2463                                                             lgkmcnt(0)
2464
2465                                                            - If OpenCL omit
2466                                                              lgkmcnt(0).
2467                                                            - Must happen before
2468                                                              following
2469                                                              buffer_wbinvl1_vol.
2470                                                            - Ensures the flat_load
2471                                                              has completed
2472                                                              before invalidating
2473                                                              the cache.
2474
2475                                                          3. buffer_wbinvl1_vol
2476
2477                                                            - Must happen before
2478                                                              any following
2479                                                              global/generic
2480                                                              load/load
2481                                                              atomic/atomicrmw.
2482                                                            - Ensures that
2483                                                              following loads
2484                                                              will not see stale
2485                                                              global data.
2486
2487      atomicrmw    acquire      - singlethread - global   1. buffer/global/ds/flat_atomic
2488                                - wavefront    - local
2489                                               - generic
2490      atomicrmw    acquire      - workgroup    - global   1. buffer/global_atomic
2491      atomicrmw    acquire      - workgroup    - local    1. ds/flat_atomic
2492                                               - generic  2. waitcnt lgkmcnt(0)
2493
2494                                                            - If OpenCL, omit
2495                                                              waitcnt.
2496                                                            - Must happen before
2497                                                              any following
2498                                                              global/generic
2499                                                              load/load
2500                                                              atomic/store/store
2501                                                              atomic/atomicrmw.
2502                                                            - Ensures any
2503                                                              following global
2504                                                              data read is no
2505                                                              older than the
2506                                                              atomicrmw value
2507                                                              being acquired.
2508
2509      atomicrmw    acquire      - agent        - global   1. buffer/global_atomic
2510                                - system                  2. s_waitcnt vmcnt(0)
2511
2512                                                            - Must happen before
2513                                                              following
2514                                                              buffer_wbinvl1_vol.
2515                                                            - Ensures the
2516                                                              atomicrmw has
2517                                                              completed before
2518                                                              invalidating the
2519                                                              cache.
2520
2521                                                          3. buffer_wbinvl1_vol
2522
2523                                                            - Must happen before
2524                                                              any following
2525                                                              global/generic
2526                                                              load/load
2527                                                              atomic/atomicrmw.
2528                                                            - Ensures that
2529                                                              following loads
2530                                                              will not see stale
2531                                                              global data.
2532
2533      atomicrmw    acquire      - agent        - generic  1. flat_atomic
2534                                - system                  2. s_waitcnt vmcnt(0) &
2535                                                             lgkmcnt(0)
2536
2537                                                            - If OpenCL, omit
2538                                                              lgkmcnt(0).
2539                                                            - Must happen before
2540                                                              following
2541                                                              buffer_wbinvl1_vol.
2542                                                            - Ensures the
2543                                                              atomicrmw has
2544                                                              completed before
2545                                                              invalidating the
2546                                                              cache.
2547
2548                                                          3. buffer_wbinvl1_vol
2549
2550                                                            - Must happen before
2551                                                              any following
2552                                                              global/generic
2553                                                              load/load
2554                                                              atomic/atomicrmw.
2555                                                            - Ensures that
2556                                                              following loads
2557                                                              will not see stale
2558                                                              global data.
2559
2560      fence        acquire      - singlethread *none*     *none*
2561                                - wavefront
2562      fence        acquire      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
2563
2564                                                            - If OpenCL and
2565                                                              address space is
2566                                                              not generic, omit
2567                                                              waitcnt. However,
2568                                                              since LLVM
2569                                                              currently has no
2570                                                              address space on
2571                                                              the fence need to
2572                                                              conservatively
2573                                                              always generate. If
2574                                                              fence had an
2575                                                              address space then
2576                                                              set to address
2577                                                              space of OpenCL
2578                                                              fence flag, or to
2579                                                              generic if both
2580                                                              local and global
2581                                                              flags are
2582                                                              specified.
2583                                                            - Must happen after
2584                                                              any preceding
2585                                                              local/generic load
2586                                                              atomic/atomicrmw
2587                                                              with an equal or
2588                                                              wider sync scope
2589                                                              and memory ordering
2590                                                              stronger than
2591                                                              unordered (this is
2592                                                              termed the
2593                                                              fence-paired-atomic).
2594                                                            - Must happen before
2595                                                              any following
2596                                                              global/generic
2597                                                              load/load
2598                                                              atomic/store/store
2599                                                              atomic/atomicrmw.
2600                                                            - Ensures any
2601                                                              following global
2602                                                              data read is no
2603                                                              older than the
2604                                                              value read by the
2605                                                              fence-paired-atomic.
2606
2607      fence        acquire      - agent        *none*     1. s_waitcnt vmcnt(0) &
2608                                - system                     lgkmcnt(0)
2609
2610                                                            - If OpenCL and
2611                                                              address space is
2612                                                              not generic, omit
2613                                                              lgkmcnt(0).
2614                                                              However, since LLVM
2615                                                              currently has no
2616                                                              address space on
2617                                                              the fence need to
2618                                                              conservatively
2619                                                              always generate
2620                                                              (see comment for
2621                                                              previous fence).
2622                                                            - Could be split into
2623                                                              separate s_waitcnt
2624                                                              vmcnt(0) and
2625                                                              s_waitcnt
2626                                                              lgkmcnt(0) to allow
2627                                                              them to be
2628                                                              independently moved
2629                                                              according to the
2630                                                              following rules.
2631                                                            - s_waitcnt vmcnt(0)
2632                                                              must happen after
2633                                                              any preceding
2634                                                              global/generic load
2635                                                              atomic/atomicrmw
2636                                                              with an equal or
2637                                                              wider sync scope
2638                                                              and memory ordering
2639                                                              stronger than
2640                                                              unordered (this is
2641                                                              termed the
2642                                                              fence-paired-atomic).
2643                                                            - s_waitcnt lgkmcnt(0)
2644                                                              must happen after
2645                                                              any preceding
2646                                                              group/generic load
2647                                                              atomic/atomicrmw
2648                                                              with an equal or
2649                                                              wider sync scope
2650                                                              and memory ordering
2651                                                              stronger than
2652                                                              unordered (this is
2653                                                              termed the
2654                                                              fence-paired-atomic).
2655                                                            - Must happen before
2656                                                              the following
2657                                                              buffer_wbinvl1_vol.
2658                                                            - Ensures that the
2659                                                              fence-paired atomic
2660                                                              has completed
2661                                                              before invalidating
2662                                                              the
2663                                                              cache. Therefore
2664                                                              any following
2665                                                              locations read must
2666                                                              be no older than
2667                                                              the value read by
2668                                                              the
2669                                                              fence-paired-atomic.
2670
2671                                                          2. buffer_wbinvl1_vol
2672
2673                                                            - Must happen before
2674                                                              any following global/generic
2675                                                              load/load
2676                                                              atomic/store/store
2677                                                              atomic/atomicrmw.
2678                                                            - Ensures that
2679                                                              following loads
2680                                                              will not see stale
2681                                                              global data.
2682
2683      **Release Atomic**
2684      ---------------------------------------------------------------------------
2685      store atomic release      - singlethread - global   1. buffer/global/ds/flat_store
2686                                - wavefront    - local
2687                                               - generic
2688      store atomic release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
2689                                               - generic
2690                                                            - If OpenCL, omit
2691                                                              waitcnt.
2692                                                            - Must happen after
2693                                                              any preceding
2694                                                              local/generic
2695                                                              load/store/load
2696                                                              atomic/store
2697                                                              atomic/atomicrmw.
2698                                                            - Must happen before
2699                                                              the following
2700                                                              store.
2701                                                            - Ensures that all
2702                                                              memory operations
2703                                                              to local have
2704                                                              completed before
2705                                                              performing the
2706                                                              store that is being
2707                                                              released.
2708
2709                                                          2. buffer/global/flat_store
2710      store atomic release      - workgroup    - local    1. ds_store
2711      store atomic release      - agent        - global   1. s_waitcnt vmcnt(0) &
2712                                - system       - generic     lgkmcnt(0)
2713
2714                                                            - If OpenCL, omit
2715                                                              lgkmcnt(0).
2716                                                            - Could be split into
2717                                                              separate s_waitcnt
2718                                                              vmcnt(0) and
2719                                                              s_waitcnt
2720                                                              lgkmcnt(0) to allow
2721                                                              them to be
2722                                                              independently moved
2723                                                              according to the
2724                                                              following rules.
2725                                                            - s_waitcnt vmcnt(0)
2726                                                              must happen after
2727                                                              any preceding
2728                                                              global/generic
2729                                                              load/store/load
2730                                                              atomic/store
2731                                                              atomic/atomicrmw.
2732                                                            - s_waitcnt lgkmcnt(0)
2733                                                              must happen after
2734                                                              any preceding
2735                                                              local/generic
2736                                                              load/store/load
2737                                                              atomic/store
2738                                                              atomic/atomicrmw.
2739                                                            - Must happen before
2740                                                              the following
2741                                                              store.
2742                                                            - Ensures that all
2743                                                              memory operations
2744                                                              to global have
2745                                                              completed before
2746                                                              performing the
2747                                                              store that is being
2748                                                              released.
2749
2750                                                          2. buffer/global/ds/flat_store
2751      atomicrmw    release      - singlethread - global   1. buffer/global/ds/flat_atomic
2752                                - wavefront    - local
2753                                               - generic
2754      atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
2755                                               - generic
2756                                                            - If OpenCL, omit
2757                                                              waitcnt.
2758                                                            - Must happen after
2759                                                              any preceding
2760                                                              local/generic
2761                                                              load/store/load
2762                                                              atomic/store
2763                                                              atomic/atomicrmw.
2764                                                            - Must happen before
2765                                                              the following
2766                                                              atomicrmw.
2767                                                            - Ensures that all
2768                                                              memory operations
2769                                                              to local have
2770                                                              completed before
2771                                                              performing the
2772                                                              atomicrmw that is
2773                                                              being released.
2774
2775                                                          2. buffer/global/flat_atomic
2776      atomicrmw    release      - workgroup    - local    1. ds_atomic
2777      atomicrmw    release      - agent        - global   1. s_waitcnt vmcnt(0) &
2778                                - system       - generic     lgkmcnt(0)
2779
2780                                                            - If OpenCL, omit
2781                                                              lgkmcnt(0).
2782                                                            - Could be split into
2783                                                              separate s_waitcnt
2784                                                              vmcnt(0) and
2785                                                              s_waitcnt
2786                                                              lgkmcnt(0) to allow
2787                                                              them to be
2788                                                              independently moved
2789                                                              according to the
2790                                                              following rules.
2791                                                            - s_waitcnt vmcnt(0)
2792                                                              must happen after
2793                                                              any preceding
2794                                                              global/generic
2795                                                              load/store/load
2796                                                              atomic/store
2797                                                              atomic/atomicrmw.
2798                                                            - s_waitcnt lgkmcnt(0)
2799                                                              must happen after
2800                                                              any preceding
2801                                                              local/generic
2802                                                              load/store/load
2803                                                              atomic/store
2804                                                              atomic/atomicrmw.
2805                                                            - Must happen before
2806                                                              the following
2807                                                              atomicrmw.
2808                                                            - Ensures that all
2809                                                              memory operations
2810                                                              to global and local
2811                                                              have completed
2812                                                              before performing
2813                                                              the atomicrmw that
2814                                                              is being released.
2815
2816                                                          2. buffer/global/ds/flat_atomic*
2817      fence        release      - singlethread *none*     *none*
2818                                - wavefront
2819      fence        release      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
2820
2821                                                            - If OpenCL and
2822                                                              address space is
2823                                                              not generic, omit
2824                                                              waitcnt. However,
2825                                                              since LLVM
2826                                                              currently has no
2827                                                              address space on
2828                                                              the fence need to
2829                                                              conservatively
2830                                                              always generate
2831                                                              (see comment for
2832                                                              previous fence).
2833                                                            - Must happen after
2834                                                              any preceding
2835                                                              local/generic
2836                                                              load/load
2837                                                              atomic/store/store
2838                                                              atomic/atomicrmw.
2839                                                            - Must happen before
2840                                                              any following store
2841                                                              atomic/atomicrmw
2842                                                              with an equal or
2843                                                              wider sync scope
2844                                                              and memory ordering
2845                                                              stronger than
2846                                                              unordered (this is
2847                                                              termed the
2848                                                              fence-paired-atomic).
2849                                                            - Ensures that all
2850                                                              memory operations
2851                                                              to local have
2852                                                              completed before
2853                                                              performing the
2854                                                              following
2855                                                              fence-paired-atomic.
2856
2857      fence        release      - agent        *none*     1. s_waitcnt vmcnt(0) &
2858                                - system                     lgkmcnt(0)
2859
2860                                                            - If OpenCL and
2861                                                              address space is
2862                                                              not generic, omit
2863                                                              lgkmcnt(0).
2864                                                              However, since LLVM
2865                                                              currently has no
2866                                                              address space on
2867                                                              the fence need to
2868                                                              conservatively
2869                                                              always generate
2870                                                              (see comment for
2871                                                              previous fence).
2872                                                            - Could be split into
2873                                                              separate s_waitcnt
2874                                                              vmcnt(0) and
2875                                                              s_waitcnt
2876                                                              lgkmcnt(0) to allow
2877                                                              them to be
2878                                                              independently moved
2879                                                              according to the
2880                                                              following rules.
2881                                                            - s_waitcnt vmcnt(0)
2882                                                              must happen after
2883                                                              any preceding
2884                                                              global/generic
2885                                                              load/store/load
2886                                                              atomic/store
2887                                                              atomic/atomicrmw.
2888                                                            - s_waitcnt lgkmcnt(0)
2889                                                              must happen after
2890                                                              any preceding
2891                                                              local/generic
2892                                                              load/store/load
2893                                                              atomic/store
2894                                                              atomic/atomicrmw.
2895                                                            - Must happen before
2896                                                              any following store
2897                                                              atomic/atomicrmw
2898                                                              with an equal or
2899                                                              wider sync scope
2900                                                              and memory ordering
2901                                                              stronger than
2902                                                              unordered (this is
2903                                                              termed the
2904                                                              fence-paired-atomic).
2905                                                            - Ensures that all
2906                                                              memory operations
2907                                                              to global have
2908                                                              completed before
2909                                                              performing the
2910                                                              following
2911                                                              fence-paired-atomic.
2912
2913      **Acquire-Release Atomic**
2914      ---------------------------------------------------------------------------
2915      atomicrmw    acq_rel      - singlethread - global   1. buffer/global/ds/flat_atomic
2916                                - wavefront    - local
2917                                               - generic
2918      atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
2919
2920                                                            - If OpenCL, omit
2921                                                              waitcnt.
2922                                                            - Must happen after
2923                                                              any preceding
2924                                                              local/generic
2925                                                              load/store/load
2926                                                              atomic/store
2927                                                              atomic/atomicrmw.
2928                                                            - Must happen before
2929                                                              the following
2930                                                              atomicrmw.
2931                                                            - Ensures that all
2932                                                              memory operations
2933                                                              to local have
2934                                                              completed before
2935                                                              performing the
2936                                                              atomicrmw that is
2937                                                              being released.
2938
2939                                                          2. buffer/global_atomic
2940      atomicrmw    acq_rel      - workgroup    - local    1. ds_atomic
2941                                                          2. s_waitcnt lgkmcnt(0)
2942
2943                                                            - If OpenCL, omit
2944                                                              waitcnt.
2945                                                            - Must happen before
2946                                                              any following
2947                                                              global/generic
2948                                                              load/load
2949                                                              atomic/store/store
2950                                                              atomic/atomicrmw.
2951                                                            - Ensures any
2952                                                              following global
2953                                                              data read is no
2954                                                              older than the load
2955                                                              atomic value being
2956                                                              acquired.
2957
2958      atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)
2959
2960                                                            - If OpenCL, omit
2961                                                              waitcnt.
2962                                                            - Must happen after
2963                                                              any preceding
2964                                                              local/generic
2965                                                              load/store/load
2966                                                              atomic/store
2967                                                              atomic/atomicrmw.
2968                                                            - Must happen before
2969                                                              the following
2970                                                              atomicrmw.
2971                                                            - Ensures that all
2972                                                              memory operations
2973                                                              to local have
2974                                                              completed before
2975                                                              performing the
2976                                                              atomicrmw that is
2977                                                              being released.
2978
2979                                                          2. flat_atomic
2980                                                          3. s_waitcnt lgkmcnt(0)
2981
2982                                                            - If OpenCL, omit
2983                                                              waitcnt.
2984                                                            - Must happen before
2985                                                              any following
2986                                                              global/generic
2987                                                              load/load
2988                                                              atomic/store/store
2989                                                              atomic/atomicrmw.
2990                                                            - Ensures any
2991                                                              following global
2992                                                              data read is no
2993                                                              older than the load
2994                                                              atomic value being
2995                                                              acquired.
2996      atomicrmw    acq_rel      - agent        - global   1. s_waitcnt vmcnt(0) &
2997                                - system                     lgkmcnt(0)
2998
2999                                                            - If OpenCL, omit
3000                                                              lgkmcnt(0).
3001                                                            - Could be split into
3002                                                              separate s_waitcnt
3003                                                              vmcnt(0) and
3004                                                              s_waitcnt
3005                                                              lgkmcnt(0) to allow
3006                                                              them to be
3007                                                              independently moved
3008                                                              according to the
3009                                                              following rules.
3010                                                            - s_waitcnt vmcnt(0)
3011                                                              must happen after
3012                                                              any preceding
3013                                                              global/generic
3014                                                              load/store/load
3015                                                              atomic/store
3016                                                              atomic/atomicrmw.
3017                                                            - s_waitcnt lgkmcnt(0)
3018                                                              must happen after
3019                                                              any preceding
3020                                                              local/generic
3021                                                              load/store/load
3022                                                              atomic/store
3023                                                              atomic/atomicrmw.
3024                                                            - Must happen before
3025                                                              the following
3026                                                              atomicrmw.
3027                                                            - Ensures that all
3028                                                              memory operations
3029                                                              to global have
3030                                                              completed before
3031                                                              performing the
3032                                                              atomicrmw that is
3033                                                              being released.
3034
3035                                                          2. buffer/global_atomic
3036                                                          3. s_waitcnt vmcnt(0)
3037
3038                                                            - Must happen before
3039                                                              following
3040                                                              buffer_wbinvl1_vol.
3041                                                            - Ensures the
3042                                                              atomicrmw has
3043                                                              completed before
3044                                                              invalidating the
3045                                                              cache.
3046
3047                                                          4. buffer_wbinvl1_vol
3048
3049                                                            - Must happen before
3050                                                              any following
3051                                                              global/generic
3052                                                              load/load
3053                                                              atomic/atomicrmw.
3054                                                            - Ensures that
3055                                                              following loads
3056                                                              will not see stale
3057                                                              global data.
3058
3059      atomicrmw    acq_rel      - agent        - generic  1. s_waitcnt vmcnt(0) &
3060                                - system                     lgkmcnt(0)
3061
3062                                                            - If OpenCL, omit
3063                                                              lgkmcnt(0).
3064                                                            - Could be split into
3065                                                              separate s_waitcnt
3066                                                              vmcnt(0) and
3067                                                              s_waitcnt
3068                                                              lgkmcnt(0) to allow
3069                                                              them to be
3070                                                              independently moved
3071                                                              according to the
3072                                                              following rules.
3073                                                            - s_waitcnt vmcnt(0)
3074                                                              must happen after
3075                                                              any preceding
3076                                                              global/generic
3077                                                              load/store/load
3078                                                              atomic/store
3079                                                              atomic/atomicrmw.
3080                                                            - s_waitcnt lgkmcnt(0)
3081                                                              must happen after
3082                                                              any preceding
3083                                                              local/generic
3084                                                              load/store/load
3085                                                              atomic/store
3086                                                              atomic/atomicrmw.
3087                                                            - Must happen before
3088                                                              the following
3089                                                              atomicrmw.
3090                                                            - Ensures that all
3091                                                              memory operations
3092                                                              to global have
3093                                                              completed before
3094                                                              performing the
3095                                                              atomicrmw that is
3096                                                              being released.
3097
3098                                                          2. flat_atomic
3099                                                          3. s_waitcnt vmcnt(0) &
3100                                                             lgkmcnt(0)
3101
3102                                                            - If OpenCL, omit
3103                                                              lgkmcnt(0).
3104                                                            - Must happen before
3105                                                              following
3106                                                              buffer_wbinvl1_vol.
3107                                                            - Ensures the
3108                                                              atomicrmw has
3109                                                              completed before
3110                                                              invalidating the
3111                                                              cache.
3112
3113                                                          4. buffer_wbinvl1_vol
3114
3115                                                            - Must happen before
3116                                                              any following
3117                                                              global/generic
3118                                                              load/load
3119                                                              atomic/atomicrmw.
3120                                                            - Ensures that
3121                                                              following loads
3122                                                              will not see stale
3123                                                              global data.
3124
3125      fence        acq_rel      - singlethread *none*     *none*
3126                                - wavefront
3127      fence        acq_rel      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
3128
3129                                                            - If OpenCL and
3130                                                              address space is
3131                                                              not generic, omit
3132                                                              waitcnt. However,
3133                                                              since LLVM
3134                                                              currently has no
3135                                                              address space on
3136                                                              the fence need to
3137                                                              conservatively
3138                                                              always generate
3139                                                              (see comment for
3140                                                              previous fence).
3141                                                            - Must happen after
3142                                                              any preceding
3143                                                              local/generic
3144                                                              load/load
3145                                                              atomic/store/store
3146                                                              atomic/atomicrmw.
3147                                                            - Must happen before
3148                                                              any following
3149                                                              global/generic
3150                                                              load/load
3151                                                              atomic/store/store
3152                                                              atomic/atomicrmw.
3153                                                            - Ensures that all
3154                                                              memory operations
3155                                                              to local have
3156                                                              completed before
3157                                                              performing any
3158                                                              following global
3159                                                              memory operations.
3160                                                            - Ensures that the
3161                                                              preceding
3162                                                              local/generic load
3163                                                              atomic/atomicrmw
3164                                                              with an equal or
3165                                                              wider sync scope
3166                                                              and memory ordering
3167                                                              stronger than
3168                                                              unordered (this is
3169                                                              termed the
3170                                                              fence-paired-atomic)
3171                                                              has completed
3172                                                              before following
3173                                                              global memory
3174                                                              operations. This
3175                                                              satisfies the
3176                                                              requirements of
3177                                                              acquire.
3178                                                            - Ensures that all
3179                                                              previous memory
3180                                                              operations have
3181                                                              completed before a
3182                                                              following
3183                                                              local/generic store
3184                                                              atomic/atomicrmw
3185                                                              with an equal or
3186                                                              wider sync scope
3187                                                              and memory ordering
3188                                                              stronger than
3189                                                              unordered (this is
3190                                                              termed the
3191                                                              fence-paired-atomic).
3192                                                              This satisfies the
3193                                                              requirements of
3194                                                              release.
3195
3196      fence        acq_rel      - agent        *none*     1. s_waitcnt vmcnt(0) &
3197                                - system                     lgkmcnt(0)
3198
3199                                                            - If OpenCL and
3200                                                              address space is
3201                                                              not generic, omit
3202                                                              lgkmcnt(0).
3203                                                              However, since LLVM
3204                                                              currently has no
3205                                                              address space on
3206                                                              the fence need to
3207                                                              conservatively
3208                                                              always generate
3209                                                              (see comment for
3210                                                              previous fence).
3211                                                            - Could be split into
3212                                                              separate s_waitcnt
3213                                                              vmcnt(0) and
3214                                                              s_waitcnt
3215                                                              lgkmcnt(0) to allow
3216                                                              them to be
3217                                                              independently moved
3218                                                              according to the
3219                                                              following rules.
3220                                                            - s_waitcnt vmcnt(0)
3221                                                              must happen after
3222                                                              any preceding
3223                                                              global/generic
3224                                                              load/store/load
3225                                                              atomic/store
3226                                                              atomic/atomicrmw.
3227                                                            - s_waitcnt lgkmcnt(0)
3228                                                              must happen after
3229                                                              any preceding
3230                                                              local/generic
3231                                                              load/store/load
3232                                                              atomic/store
3233                                                              atomic/atomicrmw.
3234                                                            - Must happen before
3235                                                              the following
3236                                                              buffer_wbinvl1_vol.
3237                                                            - Ensures that the
3238                                                              preceding
3239                                                              global/local/generic
3240                                                              load
3241                                                              atomic/atomicrmw
3242                                                              with an equal or
3243                                                              wider sync scope
3244                                                              and memory ordering
3245                                                              stronger than
3246                                                              unordered (this is
3247                                                              termed the
3248                                                              fence-paired-atomic)
3249                                                              has completed
3250                                                              before invalidating
3251                                                              the cache. This
3252                                                              satisfies the
3253                                                              requirements of
3254                                                              acquire.
3255                                                            - Ensures that all
3256                                                              previous memory
3257                                                              operations have
3258                                                              completed before a
3259                                                              following
3260                                                              global/local/generic
3261                                                              store
3262                                                              atomic/atomicrmw
3263                                                              with an equal or
3264                                                              wider sync scope
3265                                                              and memory ordering
3266                                                              stronger than
3267                                                              unordered (this is
3268                                                              termed the
3269                                                              fence-paired-atomic).
3270                                                              This satisfies the
3271                                                              requirements of
3272                                                              release.
3273
3274                                                          2. buffer_wbinvl1_vol
3275
3276                                                            - Must happen before
3277                                                              any following
3278                                                              global/generic
3279                                                              load/load
3280                                                              atomic/store/store
3281                                                              atomic/atomicrmw.
3282                                                            - Ensures that
3283                                                              following loads
3284                                                              will not see stale
3285                                                              global data. This
3286                                                              satisfies the
3287                                                              requirements of
3288                                                              acquire.
3289
3290      **Sequential Consistent Atomic**
3291      ---------------------------------------------------------------------------
3292      load atomic  seq_cst      - singlethread - global   *Same as corresponding
3293                                - wavefront    - local    load atomic acquire*.
3294                                - workgroup    - generic
3295      load atomic  seq_cst      - agent        - global   1. s_waitcnt vmcnt(0)
3296                                - system       - local
3297                                               - generic    - Must happen after
3298                                                              preceding
3299                                                              global/generic load
3300                                                              atomic/store
3301                                                              atomic/atomicrmw
3302                                                              with memory
3303                                                              ordering of seq_cst
3304                                                              and with equal or
3305                                                              wider sync scope.
3306                                                              (Note that seq_cst
3307                                                              fences have their
3308                                                              own s_waitcnt
3309                                                              vmcnt(0) and so do
3310                                                              not need to be
3311                                                              considered.)
3312                                                            - Ensures any
3313                                                              preceding
3314                                                              sequential
3315                                                              consistent global
3316                                                              memory instructions
3317                                                              have completed
3318                                                              before executing
3319                                                              this sequentially
3320                                                              consistent
3321                                                              instruction. This
3322                                                              prevents reordering
3323                                                              a seq_cst store
3324                                                              followed by a
3325                                                              seq_cst load (Note
3326                                                              that seq_cst is
3327                                                              stronger than
3328                                                              acquire/release as
3329                                                              the reordering of
3330                                                              load acquire
3331                                                              followed by a store
3332                                                              release is
3333                                                              prevented by the
3334                                                              waitcnt vmcnt(0) of
3335                                                              the release, but
3336                                                              there is nothing
3337                                                              preventing a store
3338                                                              release followed by
3339                                                              load acquire from
3340                                                              competing out of
3341                                                              order.)
3342
3343                                                          2. *Following
3344                                                             instructions same as
3345                                                             corresponding load
3346                                                             atomic acquire*.
3347
3348      store atomic seq_cst      - singlethread - global   *Same as corresponding
3349                                - wavefront    - local    store atomic release*.
3350                                - workgroup    - generic
3351      store atomic seq_cst      - agent        - global   *Same as corresponding
3352                                - system       - generic  store atomic release*.
3353      atomicrmw    seq_cst      - singlethread - global   *Same as corresponding
3354                                - wavefront    - local    atomicrmw acq_rel*.
3355                                - workgroup    - generic
3356      atomicrmw    seq_cst      - agent        - global   *Same as corresponding
3357                                - system       - generic  atomicrmw acq_rel*.
3358      fence        seq_cst      - singlethread *none*     *Same as corresponding
3359                                - wavefront               fence acq_rel*.
3360                                - workgroup
3361                                - agent
3362                                - system
3363      ============ ============ ============== ========== =======================
3364
3365 The memory order also adds the single thread optimization constrains defined in
3366 table
3367 :ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
3368
3369   .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
3370      :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
3371
3372      ============ ==============================================================
3373      LLVM Memory  Optimization Constraints
3374      Ordering
3375      ============ ==============================================================
3376      unordered    *none*
3377      monotonic    *none*
3378      acquire      - If a load atomic/atomicrmw then no following load/load
3379                     atomic/store/ store atomic/atomicrmw/fence instruction can
3380                     be moved before the acquire.
3381                   - If a fence then same as load atomic, plus no preceding
3382                     associated fence-paired-atomic can be moved after the fence.
3383      release      - If a store atomic/atomicrmw then no preceding load/load
3384                     atomic/store/ store atomic/atomicrmw/fence instruction can
3385                     be moved after the release.
3386                   - If a fence then same as store atomic, plus no following
3387                     associated fence-paired-atomic can be moved before the
3388                     fence.
3389      acq_rel      Same constraints as both acquire and release.
3390      seq_cst      - If a load atomic then same constraints as acquire, plus no
3391                     preceding sequentially consistent load atomic/store
3392                     atomic/atomicrmw/fence instruction can be moved after the
3393                     seq_cst.
3394                   - If a store atomic then the same constraints as release, plus
3395                     no following sequentially consistent load atomic/store
3396                     atomic/atomicrmw/fence instruction can be moved before the
3397                     seq_cst.
3398                   - If an atomicrmw/fence then same constraints as acq_rel.
3399      ============ ==============================================================
3400
3401 Trap Handler ABI
3402 ~~~~~~~~~~~~~~~~
3403
3404 For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
3405 (such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
3406 the ``s_trap`` instruction with the following usage:
3407
3408   .. table:: AMDGPU Trap Handler for AMDHSA OS
3409      :name: amdgpu-trap-handler-for-amdhsa-os-table
3410
3411      =================== =============== =============== =======================
3412      Usage               Code Sequence   Trap Handler    Description
3413                                          Inputs
3414      =================== =============== =============== =======================
3415      reserved            ``s_trap 0x00``                 Reserved by hardware.
3416      ``debugtrap(arg)``  ``s_trap 0x01`` ``SGPR0-1``:    Reserved for HSA
3417                                            ``queue_ptr`` ``debugtrap``
3418                                          ``VGPR0``:      intrinsic (not
3419                                            ``arg``       implemented).
3420      ``llvm.trap``       ``s_trap 0x02`` ``SGPR0-1``:    Causes dispatch to be
3421                                            ``queue_ptr`` terminated and its
3422                                                          associated queue put
3423                                                          into the error state.
3424      ``llvm.debugtrap``  ``s_trap 0x03`` ``SGPR0-1``:    If debugger not
3425                                            ``queue_ptr`` installed handled
3426                                                          same as ``llvm.trap``.
3427      debugger breakpoint ``s_trap 0x07``                 Reserved for  debugger
3428                                                          breakpoints.
3429      debugger            ``s_trap 0x08``                 Reserved for debugger.
3430      debugger            ``s_trap 0xfe``                 Reserved for debugger.
3431      debugger            ``s_trap 0xff``                 Reserved for debugger.
3432      =================== =============== =============== =======================
3433
3434 Unspecified OS
3435 --------------
3436
3437 This section provides code conventions used when the target triple OS is
3438 empty (see :ref:`amdgpu-target-triples`).
3439
3440 Trap Handler ABI
3441 ~~~~~~~~~~~~~~~~
3442
3443 For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
3444 not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
3445 instructions are handled as follows:
3446
3447   .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
3448      :name: amdgpu-trap-handler-for-non-amdhsa-os-table
3449
3450      =============== =============== ===========================================
3451      Usage           Code Sequence   Description
3452      =============== =============== ===========================================
3453      llvm.trap       s_endpgm        Causes wavefront to be terminated.
3454      llvm.debugtrap  *none*          Compiler warning given that there is no
3455                                      trap handler installed.
3456      =============== =============== ===========================================
3457
3458 Source Languages
3459 ================
3460
3461 .. _amdgpu-opencl:
3462
3463 OpenCL
3464 ------
3465
3466 When generating code for the OpenCL language the target triple environment
3467 should be ``opencl`` or ``amdgizcl`` (see :ref:`amdgpu-target-triples`).
3468
3469 When the language is OpenCL the following differences occur:
3470
3471 1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
3472 2. The AMDGPU backend adds additional arguments to the kernel.
3473 3. Additional metadata is generated
3474    (:ref:`amdgpu-amdhsa-hsa-code-object-metadata`).
3475
3476 .. TODO
3477    Specify what affect this has. Hidden arguments added. Additional metadata
3478    generated.
3479
3480 .. _amdgpu-hcc:
3481
3482 HCC
3483 ---
3484
3485 When generating code for the OpenCL language the target triple environment
3486 should be ``hcc`` (see :ref:`amdgpu-target-triples`).
3487
3488 When the language is OpenCL the following differences occur:
3489
3490 1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
3491
3492 .. TODO
3493    Specify what affect this has.
3494
3495 Assembler
3496 ---------
3497
3498 AMDGPU backend has LLVM-MC based assembler which is currently in development.
3499 It supports AMDGCN GFX6-GFX8.
3500
3501 This section describes general syntax for instructions and operands. For more
3502 information about instructions, their semantics and supported combinations of
3503 operands, refer to one of instruction set architecture manuals
3504 [AMD-Souther-Islands]_, [AMD-Sea-Islands]_, [AMD-Volcanic-Islands]_ and
3505 [AMD-Vega]_.
3506
3507 An instruction has the following syntax (register operands are normally
3508 comma-separated while extra operands are space-separated):
3509
3510 *<opcode> <register_operand0>, ... <extra_operand0> ...*
3511
3512 Operands
3513 ~~~~~~~~
3514
3515 The following syntax for register operands is supported:
3516
3517 * SGPR registers: s0, ... or s[0], ...
3518 * VGPR registers: v0, ... or v[0], ...
3519 * TTMP registers: ttmp0, ... or ttmp[0], ...
3520 * Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
3521 * Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
3522 * Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
3523 * Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
3524 * Register index expressions: v[2*2], s[1-1:2-1]
3525 * 'off' indicates that an operand is not enabled
3526
3527 The following extra operands are supported:
3528
3529 * offset, offset0, offset1
3530 * idxen, offen bits
3531 * glc, slc, tfe bits
3532 * waitcnt: integer or combination of counter values
3533 * VOP3 modifiers:
3534
3535   - abs (\| \|), neg (\-)
3536
3537 * DPP modifiers:
3538
3539   - row_shl, row_shr, row_ror, row_rol
3540   - row_mirror, row_half_mirror, row_bcast
3541   - wave_shl, wave_shr, wave_ror, wave_rol, quad_perm
3542   - row_mask, bank_mask, bound_ctrl
3543
3544 * SDWA modifiers:
3545
3546   - dst_sel, src0_sel, src1_sel (BYTE_N, WORD_M, DWORD)
3547   - dst_unused (UNUSED_PAD, UNUSED_SEXT, UNUSED_PRESERVE)
3548   - abs, neg, sext
3549
3550 Instruction Examples
3551 ~~~~~~~~~~~~~~~~~~~~
3552
3553 DS
3554 ~~
3555
3556 .. code-block:: nasm
3557
3558   ds_add_u32 v2, v4 offset:16
3559   ds_write_src2_b64 v2 offset0:4 offset1:8
3560   ds_cmpst_f32 v2, v4, v6
3561   ds_min_rtn_f64 v[8:9], v2, v[4:5]
3562
3563
3564 For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
3565
3566 FLAT
3567 ++++
3568
3569 .. code-block:: nasm
3570
3571   flat_load_dword v1, v[3:4]
3572   flat_store_dwordx3 v[3:4], v[5:7]
3573   flat_atomic_swap v1, v[3:4], v5 glc
3574   flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
3575   flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
3576
3577 For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
3578
3579 MUBUF
3580 +++++
3581
3582 .. code-block:: nasm
3583
3584   buffer_load_dword v1, off, s[4:7], s1
3585   buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
3586   buffer_store_format_xy v[1:2], off, s[4:7], s1
3587   buffer_wbinvl1
3588   buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
3589
3590 For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
3591
3592 SMRD/SMEM
3593 +++++++++
3594
3595 .. code-block:: nasm
3596
3597   s_load_dword s1, s[2:3], 0xfc
3598   s_load_dwordx8 s[8:15], s[2:3], s4
3599   s_load_dwordx16 s[88:103], s[2:3], s4
3600   s_dcache_inv_vol
3601   s_memtime s[4:5]
3602
3603 For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
3604
3605 SOP1
3606 ++++
3607
3608 .. code-block:: nasm
3609
3610   s_mov_b32 s1, s2
3611   s_mov_b64 s[0:1], 0x80000000
3612   s_cmov_b32 s1, 200
3613   s_wqm_b64 s[2:3], s[4:5]
3614   s_bcnt0_i32_b64 s1, s[2:3]
3615   s_swappc_b64 s[2:3], s[4:5]
3616   s_cbranch_join s[4:5]
3617
3618 For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
3619
3620 SOP2
3621 ++++
3622
3623 .. code-block:: nasm
3624
3625   s_add_u32 s1, s2, s3
3626   s_and_b64 s[2:3], s[4:5], s[6:7]
3627   s_cselect_b32 s1, s2, s3
3628   s_andn2_b32 s2, s4, s6
3629   s_lshr_b64 s[2:3], s[4:5], s6
3630   s_ashr_i32 s2, s4, s6
3631   s_bfm_b64 s[2:3], s4, s6
3632   s_bfe_i64 s[2:3], s[4:5], s6
3633   s_cbranch_g_fork s[4:5], s[6:7]
3634
3635 For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
3636
3637 SOPC
3638 ++++
3639
3640 .. code-block:: nasm
3641
3642   s_cmp_eq_i32 s1, s2
3643   s_bitcmp1_b32 s1, s2
3644   s_bitcmp0_b64 s[2:3], s4
3645   s_setvskip s3, s5
3646
3647 For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
3648
3649 SOPP
3650 ++++
3651
3652 .. code-block:: nasm
3653
3654   s_barrier
3655   s_nop 2
3656   s_endpgm
3657   s_waitcnt 0 ; Wait for all counters to be 0
3658   s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
3659   s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
3660   s_sethalt 9
3661   s_sleep 10
3662   s_sendmsg 0x1
3663   s_sendmsg sendmsg(MSG_INTERRUPT)
3664   s_trap 1
3665
3666 For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
3667
3668 Unless otherwise mentioned, little verification is performed on the operands
3669 of SOPP Instructions, so it is up to the programmer to be familiar with the
3670 range or acceptable values.
3671
3672 VALU
3673 ++++
3674
3675 For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
3676 the assembler will automatically use optimal encoding based on its operands.
3677 To force specific encoding, one can add a suffix to the opcode of the instruction:
3678
3679 * _e32 for 32-bit VOP1/VOP2/VOPC
3680 * _e64 for 64-bit VOP3
3681 * _dpp for VOP_DPP
3682 * _sdwa for VOP_SDWA
3683
3684 VOP1/VOP2/VOP3/VOPC examples:
3685
3686 .. code-block:: nasm
3687
3688   v_mov_b32 v1, v2
3689   v_mov_b32_e32 v1, v2
3690   v_nop
3691   v_cvt_f64_i32_e32 v[1:2], v2
3692   v_floor_f32_e32 v1, v2
3693   v_bfrev_b32_e32 v1, v2
3694   v_add_f32_e32 v1, v2, v3
3695   v_mul_i32_i24_e64 v1, v2, 3
3696   v_mul_i32_i24_e32 v1, -3, v3
3697   v_mul_i32_i24_e32 v1, -100, v3
3698   v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
3699   v_max_f16_e32 v1, v2, v3
3700
3701 VOP_DPP examples:
3702
3703 .. code-block:: nasm
3704
3705   v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
3706   v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
3707   v_mov_b32 v0, v0 wave_shl:1
3708   v_mov_b32 v0, v0 row_mirror
3709   v_mov_b32 v0, v0 row_bcast:31
3710   v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
3711   v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
3712   v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
3713
3714 VOP_SDWA examples:
3715
3716 .. code-block:: nasm
3717
3718   v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
3719   v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
3720   v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
3721   v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
3722   v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
3723
3724 For full list of supported instructions, refer to "Vector ALU instructions".
3725
3726 HSA Code Object Directives
3727 ~~~~~~~~~~~~~~~~~~~~~~~~~~
3728
3729 AMDGPU ABI defines auxiliary data in output code object. In assembly source,
3730 one can specify them with assembler directives.
3731
3732 .hsa_code_object_version major, minor
3733 +++++++++++++++++++++++++++++++++++++
3734
3735 *major* and *minor* are integers that specify the version of the HSA code
3736 object that will be generated by the assembler.
3737
3738 .hsa_code_object_isa [major, minor, stepping, vendor, arch]
3739 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3740
3741
3742 *major*, *minor*, and *stepping* are all integers that describe the instruction
3743 set architecture (ISA) version of the assembly program.
3744
3745 *vendor* and *arch* are quoted strings.  *vendor* should always be equal to
3746 "AMD" and *arch* should always be equal to "AMDGPU".
3747
3748 By default, the assembler will derive the ISA version, *vendor*, and *arch*
3749 from the value of the -mcpu option that is passed to the assembler.
3750
3751 .amdgpu_hsa_kernel (name)
3752 +++++++++++++++++++++++++
3753
3754 This directives specifies that the symbol with given name is a kernel entry point
3755 (label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
3756
3757 .amd_kernel_code_t
3758 ++++++++++++++++++
3759
3760 This directive marks the beginning of a list of key / value pairs that are used
3761 to specify the amd_kernel_code_t object that will be emitted by the assembler.
3762 The list must be terminated by the *.end_amd_kernel_code_t* directive.  For
3763 any amd_kernel_code_t values that are unspecified a default value will be
3764 used.  The default value for all keys is 0, with the following exceptions:
3765
3766 - *kernel_code_version_major* defaults to 1.
3767 - *machine_kind* defaults to 1.
3768 - *machine_version_major*, *machine_version_minor*, and
3769   *machine_version_stepping* are derived from the value of the -mcpu option
3770   that is passed to the assembler.
3771 - *kernel_code_entry_byte_offset* defaults to 256.
3772 - *wavefront_size* defaults to 6.
3773 - *kernarg_segment_alignment*, *group_segment_alignment*, and
3774   *private_segment_alignment* default to 4.  Note that alignments are specified
3775   as a power of two, so a value of **n** means an alignment of 2^ **n**.
3776
3777 The *.amd_kernel_code_t* directive must be placed immediately after the
3778 function label and before any instructions.
3779
3780 For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
3781 comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
3782
3783 Here is an example of a minimal amd_kernel_code_t specification:
3784
3785 .. code-block:: none
3786
3787    .hsa_code_object_version 1,0
3788    .hsa_code_object_isa
3789
3790    .hsatext
3791    .globl  hello_world
3792    .p2align 8
3793    .amdgpu_hsa_kernel hello_world
3794
3795    hello_world:
3796
3797       .amd_kernel_code_t
3798          enable_sgpr_kernarg_segment_ptr = 1
3799          is_ptr64 = 1
3800          compute_pgm_rsrc1_vgprs = 0
3801          compute_pgm_rsrc1_sgprs = 0
3802          compute_pgm_rsrc2_user_sgpr = 2
3803          kernarg_segment_byte_size = 8
3804          wavefront_sgpr_count = 2
3805          workitem_vgpr_count = 3
3806      .end_amd_kernel_code_t
3807
3808      s_load_dwordx2 s[0:1], s[0:1] 0x0
3809      v_mov_b32 v0, 3.14159
3810      s_waitcnt lgkmcnt(0)
3811      v_mov_b32 v1, s0
3812      v_mov_b32 v2, s1
3813      flat_store_dword v[1:2], v0
3814      s_endpgm
3815    .Lfunc_end0:
3816         .size   hello_world, .Lfunc_end0-hello_world
3817
3818 Additional Documentation
3819 ========================
3820
3821 .. [AMD-R6xx] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
3822 .. [AMD-R7xx] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
3823 .. [AMD-Evergreen] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
3824 .. [AMD-Cayman-Trinity] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
3825 .. [AMD-Souther-Islands] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
3826 .. [AMD-Sea-Islands] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
3827 .. [AMD-Volcanic-Islands] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
3828 .. [AMD-Vega] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
3829 .. [AMD-OpenCL_Programming-Guide]  `AMD Accelerated Parallel Processing OpenCL Programming Guide <http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf>`_
3830 .. [AMD-APP-SDK] `AMD Accelerated Parallel Processing APP SDK Documentation <http://developer.amd.com/tools/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/documentation/>`__
3831 .. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
3832 .. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
3833 .. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
3834 .. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
3835 .. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
3836 .. [YAML] `YAML Ain’t Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
3837 .. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
3838 .. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
3839 .. [AMD-AMDGPU-Compute-Application-Binary-Interface] `AMDGPU Compute Application Binary Interface <https://github.com/RadeonOpenCompute/ROCm-ComputeABI-Doc/blob/master/AMDGPU-ABI.md>`__