OSDN Git Service

Revert "AMDGPU: Add 32-bit constant address space"
[android-x86/external-llvm.git] / docs / AMDGPUUsage.rst
1 =============================
2 User Guide for AMDGPU Backend
3 =============================
4
5 .. contents::
6    :local:
7
8 Introduction
9 ============
10
11 The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
12 R600 family up until the current GCN families. It lives in the
13 ``lib/Target/AMDGPU`` directory.
14
15 LLVM
16 ====
17
18 .. _amdgpu-target-triples:
19
20 Target Triples
21 --------------
22
23 Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
24 specify the target triple:
25
26   .. table:: AMDGPU Architectures
27      :name: amdgpu-architecture-table
28
29      ============ ==============================================================
30      Architecture Description
31      ============ ==============================================================
32      ``r600``     AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
33      ``amdgcn``   AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
34      ============ ==============================================================
35
36   .. table:: AMDGPU Vendors
37      :name: amdgpu-vendor-table
38
39      ============ ==============================================================
40      Vendor       Description
41      ============ ==============================================================
42      ``amd``      Can be used for all AMD GPU usage.
43      ``mesa3d``   Can be used if the OS is ``mesa3d``.
44      ============ ==============================================================
45
46   .. table:: AMDGPU Operating Systems
47      :name: amdgpu-os-table
48
49      ============== ============================================================
50      OS             Description
51      ============== ============================================================
52      *<empty>*      Defaults to the *unknown* OS.
53      ``amdhsa``     Compute kernels executed on HSA [HSA]_ compatible runtimes
54                     such as AMD's ROCm [AMD-ROCm]_.
55      ``amdpal``     Graphic shaders and compute kernels executed on AMD PAL
56                     runtime.
57      ``mesa3d``     Graphic shaders and compute kernels executed on Mesa 3D
58                     runtime.
59      ============== ============================================================
60
61   .. table:: AMDGPU Environments
62      :name: amdgpu-environment-table
63
64      ============ ==============================================================
65      Environment  Description
66      ============ ==============================================================
67      *<empty>*    Defaults to ``opencl``.
68      ``opencl``   OpenCL compute kernel (see :ref:`amdgpu-opencl`).
69      ``amdgizcl`` Same as ``opencl`` except a different address space mapping is
70                   used (see :ref:`amdgpu-address-spaces`).
71      ``amdgiz``   Same as ``opencl`` except a different address space mapping is
72                   used (see :ref:`amdgpu-address-spaces`).
73      ``hcc``      AMD HC language compute kernel (see :ref:`amdgpu-hcc`).
74      ============ ==============================================================
75
76 .. _amdgpu-processors:
77
78 Processors
79 ----------
80
81 Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
82 names from both the *Processor* and *Alternative Processor* can be used.
83
84   .. table:: AMDGPU Processors
85      :name: amdgpu-processor-table
86
87      =========== =============== ============ ===== ========= ======= ==================
88      Processor   Alternative     Target       dGPU/ Target    ROCm    Example
89                  Processor       Triple       APU   Features  Support Products
90                                  Architecture       Supported
91                                                     [Default]
92      =========== =============== ============ ===== ========= ======= ==================
93      **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
94      -----------------------------------------------------------------------------------
95      ``r600``                    ``r600``     dGPU
96      ``r630``                    ``r600``     dGPU
97      ``rs880``                   ``r600``     dGPU
98      ``rv670``                   ``r600``     dGPU
99      **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
100      -----------------------------------------------------------------------------------
101      ``rv710``                   ``r600``     dGPU
102      ``rv730``                   ``r600``     dGPU
103      ``rv770``                   ``r600``     dGPU
104      **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
105      -----------------------------------------------------------------------------------
106      ``cedar``                   ``r600``     dGPU
107      ``redwood``                 ``r600``     dGPU
108      ``sumo``                    ``r600``     dGPU
109      ``juniper``                 ``r600``     dGPU
110      ``cypress``                 ``r600``     dGPU
111      **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
112      -----------------------------------------------------------------------------------
113      ``barts``                   ``r600``     dGPU
114      ``turks``                   ``r600``     dGPU
115      ``caicos``                  ``r600``     dGPU
116      ``cayman``                  ``r600``     dGPU
117      **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
118      -----------------------------------------------------------------------------------
119      ``gfx600``  - ``tahiti``    ``amdgcn``   dGPU
120      ``gfx601``  - ``pitcairn``  ``amdgcn``   dGPU
121                  - ``verde``
122                  - ``oland``
123                  - ``hainan``
124      **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
125      -----------------------------------------------------------------------------------
126      ``gfx700``  - ``kaveri``    ``amdgcn``   APU                     - A6-7000
127                                                                       - A6 Pro-7050B
128                                                                       - A8-7100
129                                                                       - A8 Pro-7150B
130                                                                       - A10-7300
131                                                                       - A10 Pro-7350B
132                                                                       - FX-7500
133                                                                       - A8-7200P
134                                                                       - A10-7400P
135                                                                       - FX-7600P
136      ``gfx701``  - ``hawaii``    ``amdgcn``   dGPU            ROCm    - FirePro W8100
137                                                                       - FirePro W9100
138                                                                       - FirePro S9150
139                                                                       - FirePro S9170
140      ``gfx702``                  ``amdgcn``   dGPU            ROCm    - Radeon R9 290
141                                                                       - Radeon R9 290x
142                                                                       - Radeon R390
143                                                                       - Radeon R390x
144      ``gfx703``  - ``kabini``    ``amdgcn``   APU                     - E1-2100
145                  - ``mullins``                                        - E1-2200
146                                                                       - E1-2500
147                                                                       - E2-3000
148                                                                       - E2-3800
149                                                                       - A4-5000
150                                                                       - A4-5100
151                                                                       - A6-5200
152                                                                       - A4 Pro-3340B
153      ``gfx704``  - ``bonaire``   ``amdgcn``   dGPU                    - Radeon HD 7790
154                                                                       - Radeon HD 8770
155                                                                       - R7 260
156                                                                       - R7 260X
157      **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
158      -----------------------------------------------------------------------------------
159      ``gfx801``  - ``carrizo``   ``amdgcn``   APU   - xnack           - A6-8500P
160                                                       [on]            - Pro A6-8500B
161                                                                       - A8-8600P
162                                                                       - Pro A8-8600B
163                                                                       - FX-8800P
164                                                                       - Pro A12-8800B
165      \                           ``amdgcn``   APU   - xnack   ROCm    - A10-8700P
166                                                       [on]            - Pro A10-8700B
167                                                                       - A10-8780P
168      \                           ``amdgcn``   APU   - xnack           - A10-9600P
169                                                       [on]            - A10-9630P
170                                                                       - A12-9700P
171                                                                       - A12-9730P
172                                                                       - FX-9800P
173                                                                       - FX-9830P
174      \                           ``amdgcn``   APU   - xnack           - E2-9010
175                                                       [on]            - A6-9210
176                                                                       - A9-9410
177      ``gfx802``  - ``tonga``     ``amdgcn``   dGPU  - xnack   ROCm    - FirePro S7150
178                  - ``iceland``                        [off]           - FirePro S7100
179                                                                       - FirePro W7100
180                                                                       - Radeon R285
181                                                                       - Radeon R9 380
182                                                                       - Radeon R9 385
183                                                                       - Mobile FirePro
184                                                                         M7170
185      ``gfx803``  - ``fiji``      ``amdgcn``   dGPU  - xnack   ROCm    - Radeon R9 Nano
186                                                       [off]           - Radeon R9 Fury
187                                                                       - Radeon R9 FuryX
188                                                                       - Radeon Pro Duo
189                                                                       - FirePro S9300x2
190                                                                       - Radeon Instinct MI8
191      \           - ``polaris10`` ``amdgcn``   dGPU  - xnack   ROCm    - Radeon RX 470
192                                                       [off]           - Radeon RX 480
193                                                                       - Radeon Instinct MI6
194      \           - ``polaris11`` ``amdgcn``   dGPU  - xnack   ROCm    - Radeon RX 460
195                                                       [off]
196      ``gfx810``  - ``stoney``    ``amdgcn``   APU   - xnack
197                                                       [on]
198      **GCN GFX9** [AMD-GCN-GFX9]_
199      -----------------------------------------------------------------------------------
200      ``gfx900``                  ``amdgcn``   dGPU  - xnack   ROCm    - Radeon Vega
201                                                       [off]             Frontier Edition
202                                                                       - Radeon RX Vega 56
203                                                                       - Radeon RX Vega 64
204                                                                       - Radeon RX Vega 64
205                                                                         Liquid
206                                                                       - Radeon Instinct MI25
207      ``gfx902``                  ``amdgcn``   APU   - xnack           *TBA*
208                                                       [on]
209                                                                       .. TODO
210                                                                          Add product
211                                                                          names.
212      =========== =============== ============ ===== ========= ======= ==================
213
214 .. _amdgpu-target-features:
215
216 Target Features
217 ---------------
218
219 Target features control how code is generated to support certain
220 processor specific features. Not all target features are supported by
221 all processors. The runtime must ensure that the features supported by
222 the device used to execute the code match the features enabled when
223 generating the code. A mismatch of features may result in incorrect
224 execution, or a reduction in performance.
225
226 The target features supported by each processor, and the default value
227 used if not specified explicitly, is listed in
228 :ref:`amdgpu-processor-table`.
229
230 Use the ``clang -m[no-]<TargetFeature>`` option to specify the AMD GPU
231 target features.
232
233 For example:
234
235 ``-mxnack``
236   Enable the ``xnack`` feature.
237 ``-mno-xnack``
238   Disable the ``xnack`` feature.
239
240   .. table:: AMDGPU Target Features
241      :name: amdgpu-target-feature-table
242
243      ============== ==================================================
244      Target Feature Description
245      ============== ==================================================
246      -m[no-]xnack   Enable/disable generating code that has
247                     memory clauses that are compatible with
248                     having XNACK replay enabled.
249
250                     This is used for demand paging and page
251                     migration. If XNACK replay is enabled in
252                     the device, then if a page fault occurs
253                     the code may execute incorrectly if the
254                     ``xnack`` feature is not enabled. Executing
255                     code that has the feature enabled on a
256                     device that does not have XNACK replay
257                     enabled will execute correctly, but may
258                     be less performant than code with the
259                     feature disabled.
260      ============== ==================================================
261
262 .. _amdgpu-address-spaces:
263
264 Address Spaces
265 --------------
266
267 The AMDGPU backend uses the following address space mappings.
268
269 The memory space names used in the table, aside from the region memory space, is
270 from the OpenCL standard.
271
272 LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
273
274   .. table:: Address Space Mapping
275      :name: amdgpu-address-space-mapping-table
276
277      ================== ================= ================= ================= =================
278      LLVM Address Space Memory Space
279      ------------------ -----------------------------------------------------------------------
280      \                  Current Default   amdgiz/amdgizcl   hcc               Future Default
281      ================== ================= ================= ================= =================
282      0                  Private (Scratch) Generic (Flat)    Generic (Flat)    Generic (Flat)
283      1                  Global            Global            Global            Global
284      2                  Constant          Constant          Constant          Region (GDS)
285      3                  Local (group/LDS) Local (group/LDS) Local (group/LDS) Local (group/LDS)
286      4                  Generic (Flat)    Region (GDS)      Region (GDS)      Constant
287      5                  Region (GDS)      Private (Scratch) Private (Scratch) Private (Scratch)
288      ================== ================= ================= ================= =================
289
290 Current Default
291   This is the current default address space mapping used for all languages
292   except hcc. This will shortly be deprecated.
293
294 amdgiz/amdgizcl
295   This is the current address space mapping used when ``amdgiz`` or ``amdgizcl``
296   is specified as the target triple environment value.
297
298 hcc
299   This is the current address space mapping used when ``hcc`` is specified as
300   the target triple environment value.This will shortly be deprecated.
301
302 Future Default
303   This will shortly be the only address space mapping for all languages using
304   AMDGPU backend.
305
306 .. _amdgpu-memory-scopes:
307
308 Memory Scopes
309 -------------
310
311 This section provides LLVM memory synchronization scopes supported by the AMDGPU
312 backend memory model when the target triple OS is ``amdhsa`` (see
313 :ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
314
315 The memory model supported is based on the HSA memory model [HSA]_ which is
316 based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
317 relation is transitive over the synchonizes-with relation independent of scope,
318 and synchonizes-with allows the memory scope instances to be inclusive (see
319 table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
320
321 This is different to the OpenCL [OpenCL]_ memory model which does not have scope
322 inclusion and requires the memory scopes to exactly match. However, this
323 is conservatively correct for OpenCL.
324
325   .. table:: AMDHSA LLVM Sync Scopes
326      :name: amdgpu-amdhsa-llvm-sync-scopes-table
327
328      ================ ==========================================================
329      LLVM Sync Scope  Description
330      ================ ==========================================================
331      *none*           The default: ``system``.
332
333                       Synchronizes with, and participates in modification and
334                       seq_cst total orderings with, other operations (except
335                       image operations) for all address spaces (except private,
336                       or generic that accesses private) provided the other
337                       operation's sync scope is:
338
339                       - ``system``.
340                       - ``agent`` and executed by a thread on the same agent.
341                       - ``workgroup`` and executed by a thread in the same
342                         workgroup.
343                       - ``wavefront`` and executed by a thread in the same
344                         wavefront.
345
346      ``agent``        Synchronizes with, and participates in modification and
347                       seq_cst total orderings with, other operations (except
348                       image operations) for all address spaces (except private,
349                       or generic that accesses private) provided the other
350                       operation's sync scope is:
351
352                       - ``system`` or ``agent`` and executed by a thread on the
353                         same agent.
354                       - ``workgroup`` and executed by a thread in the same
355                         workgroup.
356                       - ``wavefront`` and executed by a thread in the same
357                         wavefront.
358
359      ``workgroup``    Synchronizes with, and participates in modification and
360                       seq_cst total orderings with, other operations (except
361                       image operations) for all address spaces (except private,
362                       or generic that accesses private) provided the other
363                       operation's sync scope is:
364
365                       - ``system``, ``agent`` or ``workgroup`` and executed by a
366                         thread in the same workgroup.
367                       - ``wavefront`` and executed by a thread in the same
368                         wavefront.
369
370      ``wavefront``    Synchronizes with, and participates in modification and
371                       seq_cst total orderings with, other operations (except
372                       image operations) for all address spaces (except private,
373                       or generic that accesses private) provided the other
374                       operation's sync scope is:
375
376                       - ``system``, ``agent``, ``workgroup`` or ``wavefront``
377                         and executed by a thread in the same wavefront.
378
379      ``singlethread`` Only synchronizes with, and participates in modification
380                       and seq_cst total orderings with, other operations (except
381                       image operations) running in the same thread for all
382                       address spaces (for example, in signal handlers).
383      ================ ==========================================================
384
385 AMDGPU Intrinsics
386 -----------------
387
388 The AMDGPU backend implements the following intrinsics.
389
390 *This section is WIP.*
391
392 .. TODO
393    List AMDGPU intrinsics
394
395 Code Object
396 ===========
397
398 The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
399 can be linked by ``lld`` to produce a standard ELF shared code object which can
400 be loaded and executed on an AMDGPU target.
401
402 Header
403 ------
404
405 The AMDGPU backend uses the following ELF header:
406
407   .. table:: AMDGPU ELF Header
408      :name: amdgpu-elf-header-table
409
410      ========================== ===============================
411      Field                      Value
412      ========================== ===============================
413      ``e_ident[EI_CLASS]``      ``ELFCLASS64``
414      ``e_ident[EI_DATA]``       ``ELFDATA2LSB``
415      ``e_ident[EI_OSABI]``      - ``ELFOSABI_NONE``
416                                 - ``ELFOSABI_AMDGPU_HSA``
417                                 - ``ELFOSABI_AMDGPU_PAL``
418                                 - ``ELFOSABI_AMDGPU_MESA3D``
419      ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA``
420                                 - ``ELFABIVERSION_AMDGPU_PAL``
421                                 - ``ELFABIVERSION_AMDGPU_MESA3D``
422      ``e_type``                 - ``ET_REL``
423                                 - ``ET_DYN``
424      ``e_machine``              ``EM_AMDGPU``
425      ``e_entry``                0
426      ``e_flags``                See :ref:`amdgpu-elf-header-e_flags-table`
427      ========================== ===============================
428
429 ..
430
431   .. table:: AMDGPU ELF Header Enumeration Values
432      :name: amdgpu-elf-header-enumeration-values-table
433
434      =============================== =====
435      Name                            Value
436      =============================== =====
437      ``EM_AMDGPU``                   224
438      ``ELFOSABI_NONE``               0
439      ``ELFOSABI_AMDGPU_HSA``         64
440      ``ELFOSABI_AMDGPU_PAL``         65
441      ``ELFOSABI_AMDGPU_MESA3D``      66
442      ``ELFABIVERSION_AMDGPU_HSA``    1
443      ``ELFABIVERSION_AMDGPU_PAL``    0
444      ``ELFABIVERSION_AMDGPU_MESA3D`` 0
445      =============================== =====
446
447 ``e_ident[EI_CLASS]``
448   The ELF class is:
449
450   * ``ELFCLASS32`` for ``r600`` architecture.
451
452   * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64
453     bit applications.
454
455 ``e_ident[EI_DATA]``
456   All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
457
458 ``e_ident[EI_OSABI]``
459   One of the following AMD GPU architecture specific OS ABIs
460   (see :ref:`amdgpu-os-table`):
461
462   * ``ELFOSABI_NONE`` for *unknown* OS.
463
464   * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
465
466   * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
467
468   * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
469
470 ``e_ident[EI_ABIVERSION]``
471   The ABI version of the AMD GPU architecture specific OS ABI to which the code
472   object conforms:
473
474   * ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
475     runtime ABI.
476
477   * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
478     runtime ABI.
479
480   * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
481     3D runtime ABI.
482
483 ``e_type``
484   Can be one of the following values:
485
486
487   ``ET_REL``
488     The type produced by the AMD GPU backend compiler as it is relocatable code
489     object.
490
491   ``ET_DYN``
492     The type produced by the linker as it is a shared code object.
493
494   The AMD HSA runtime loader requires a ``ET_DYN`` code object.
495
496 ``e_machine``
497   The value ``EM_AMDGPU`` is used for the machine for all processors supported
498   by the ``r600`` and ``amdgcn`` architectures (see
499   :ref:`amdgpu-processor-table`). The specific processor is specified in the
500   ``EF_AMDGPU_MACH`` bit field of the ``e_flags`` (see
501   :ref:`amdgpu-elf-header-e_flags-table`).
502
503 ``e_entry``
504   The entry point is 0 as the entry points for individual kernels must be
505   selected in order to invoke them through AQL packets.
506
507 ``e_flags``
508   The AMDGPU backend uses the following ELF header flags:
509
510   .. table:: AMDGPU ELF Header ``e_flags``
511      :name: amdgpu-elf-header-e_flags-table
512
513      ================================= ========== =============================
514      Name                              Value      Description
515      ================================= ========== =============================
516      **AMDGPU Processor Flag**                    See :ref:`amdgpu-processor-table`.
517      -------------------------------------------- -----------------------------
518      ``EF_AMDGPU_MACH``                0x000000ff AMDGPU processor selection
519                                                   mask for
520                                                   ``EF_AMDGPU_MACH_xxx`` values
521                                                   defined in
522                                                   :ref:`amdgpu-ef-amdgpu-mach-table`.
523      ``EF_AMDGPU_XNACK``               0x00000100 Indicates if the ``xnack``
524                                                   target feature is
525                                                   enabled for all code
526                                                   contained in the code object.
527                                                   See
528                                                   :ref:`amdgpu-target-features`.
529      ================================= ========== =============================
530
531   .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
532      :name: amdgpu-ef-amdgpu-mach-table
533
534      ================================= ========== =============================
535      Name                              Value      Description (see
536                                                   :ref:`amdgpu-processor-table`)
537      ================================= ========== =============================
538      ``EF_AMDGPU_MACH_NONE``           0          *not specified*
539      ``EF_AMDGPU_MACH_R600_R600``      1          ``r600``
540      ``EF_AMDGPU_MACH_R600_R630``      2          ``r630``
541      ``EF_AMDGPU_MACH_R600_RS880``     3          ``rs880``
542      ``EF_AMDGPU_MACH_R600_RV670``     4          ``rv670``
543      ``EF_AMDGPU_MACH_R600_RV710``     5          ``rv710``
544      ``EF_AMDGPU_MACH_R600_RV730``     6          ``rv730``
545      ``EF_AMDGPU_MACH_R600_RV770``     7          ``rv770``
546      ``EF_AMDGPU_MACH_R600_CEDAR``     8          ``cedar``
547      ``EF_AMDGPU_MACH_R600_REDWOOD``   9          ``redwood``
548      ``EF_AMDGPU_MACH_R600_SUMO``      10         ``sumo``
549      ``EF_AMDGPU_MACH_R600_JUNIPER``   11         ``juniper``
550      ``EF_AMDGPU_MACH_R600_CYPRESS``   12         ``cypress``
551      ``EF_AMDGPU_MACH_R600_BARTS``     13         ``barts``
552      ``EF_AMDGPU_MACH_R600_TURKS``     14         ``turks``
553      ``EF_AMDGPU_MACH_R600_CAICOS``    15         ``caicos``
554      ``EF_AMDGPU_MACH_R600_CAYMAN``    16         ``cayman``
555      *reserved*                        17-31      Reserved for ``r600``
556                                                   architecture processors.
557      ``EF_AMDGPU_MACH_AMDGCN_GFX600``  32         ``gfx600``
558      ``EF_AMDGPU_MACH_AMDGCN_GFX601``  33         ``gfx601``
559      ``EF_AMDGPU_MACH_AMDGCN_GFX700``  34         ``gfx700``
560      ``EF_AMDGPU_MACH_AMDGCN_GFX701``  35         ``gfx701``
561      ``EF_AMDGPU_MACH_AMDGCN_GFX702``  36         ``gfx702``
562      ``EF_AMDGPU_MACH_AMDGCN_GFX703``  37         ``gfx703``
563      ``EF_AMDGPU_MACH_AMDGCN_GFX704``  38         ``gfx704``
564      *reserved*                        39         Reserved.
565      ``EF_AMDGPU_MACH_AMDGCN_GFX801``  40         ``gfx801``
566      ``EF_AMDGPU_MACH_AMDGCN_GFX802``  41         ``gfx802``
567      ``EF_AMDGPU_MACH_AMDGCN_GFX803``  42         ``gfx803``
568      ``EF_AMDGPU_MACH_AMDGCN_GFX810``  43         ``gfx810``
569      ``EF_AMDGPU_MACH_AMDGCN_GFX900``  44         ``gfx900``
570      ``EF_AMDGPU_MACH_AMDGCN_GFX902``  45         ``gfx902``
571      ================================= ========== =============================
572
573 Sections
574 --------
575
576 An AMDGPU target ELF code object has the standard ELF sections which include:
577
578   .. table:: AMDGPU ELF Sections
579      :name: amdgpu-elf-sections-table
580
581      ================== ================ =================================
582      Name               Type             Attributes
583      ================== ================ =================================
584      ``.bss``           ``SHT_NOBITS``   ``SHF_ALLOC`` + ``SHF_WRITE``
585      ``.data``          ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
586      ``.debug_``\ *\**  ``SHT_PROGBITS`` *none*
587      ``.dynamic``       ``SHT_DYNAMIC``  ``SHF_ALLOC``
588      ``.dynstr``        ``SHT_PROGBITS`` ``SHF_ALLOC``
589      ``.dynsym``        ``SHT_PROGBITS`` ``SHF_ALLOC``
590      ``.got``           ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
591      ``.hash``          ``SHT_HASH``     ``SHF_ALLOC``
592      ``.note``          ``SHT_NOTE``     *none*
593      ``.rela``\ *name*  ``SHT_RELA``     *none*
594      ``.rela.dyn``      ``SHT_RELA``     *none*
595      ``.rodata``        ``SHT_PROGBITS`` ``SHF_ALLOC``
596      ``.shstrtab``      ``SHT_STRTAB``   *none*
597      ``.strtab``        ``SHT_STRTAB``   *none*
598      ``.symtab``        ``SHT_SYMTAB``   *none*
599      ``.text``          ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
600      ================== ================ =================================
601
602 These sections have their standard meanings (see [ELF]_) and are only generated
603 if needed.
604
605 ``.debug``\ *\**
606   The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
607   DWARF produced by the AMDGPU backend.
608
609 ``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
610   The standard sections used by a dynamic loader.
611
612 ``.note``
613   See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
614   backend.
615
616 ``.rela``\ *name*, ``.rela.dyn``
617   For relocatable code objects, *name* is the name of the section that the
618   relocation records apply. For example, ``.rela.text`` is the section name for
619   relocation records associated with the ``.text`` section.
620
621   For linked shared code objects, ``.rela.dyn`` contains all the relocation
622   records from each of the relocatable code object's ``.rela``\ *name* sections.
623
624   See :ref:`amdgpu-relocation-records` for the relocation records supported by
625   the AMDGPU backend.
626
627 ``.text``
628   The executable machine code for the kernels and functions they call. Generated
629   as position independent code. See :ref:`amdgpu-code-conventions` for
630   information on conventions used in the isa generation.
631
632 .. _amdgpu-note-records:
633
634 Note Records
635 ------------
636
637 As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero byte padding must
638 be generated after the ``name`` field to ensure the ``desc`` field is 4 byte
639 aligned. In addition, minimal zero byte padding must be generated to ensure the
640 ``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` field of the
641 ``.note`` section must be at least 4 to indicate at least 8 byte alignment.
642
643 The AMDGPU backend code object uses the following ELF note records in the
644 ``.note`` section. The *Description* column specifies the layout of the note
645 record's ``desc`` field. All fields are consecutive bytes. Note records with
646 variable size strings have a corresponding ``*_size`` field that specifies the
647 number of bytes, including the terminating null character, in the string. The
648 string(s) come immediately after the preceding fields.
649
650 Additional note records can be present.
651
652   .. table:: AMDGPU ELF Note Records
653      :name: amdgpu-elf-note-records-table
654
655      ===== ============================== ======================================
656      Name  Type                           Description
657      ===== ============================== ======================================
658      "AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
659      ===== ============================== ======================================
660
661 ..
662
663   .. table:: AMDGPU ELF Note Record Enumeration Values
664      :name: amdgpu-elf-note-record-enumeration-values-table
665
666      ============================== =====
667      Name                           Value
668      ============================== =====
669      *reserved*                       0-9
670      ``NT_AMD_AMDGPU_HSA_METADATA``    10
671      *reserved*                        11
672      ============================== =====
673
674 ``NT_AMD_AMDGPU_HSA_METADATA``
675   Specifies extensible metadata associated with the code objects executed on HSA
676   [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
677   the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
678   :ref:`amdgpu-amdhsa-hsa-code-object-metadata` for the syntax of the code
679   object metadata string.
680
681 .. _amdgpu-symbols:
682
683 Symbols
684 -------
685
686 Symbols include the following:
687
688   .. table:: AMDGPU ELF Symbols
689      :name: amdgpu-elf-symbols-table
690
691      ===================== ============== ============= ==================
692      Name                  Type           Section       Description
693      ===================== ============== ============= ==================
694      *link-name*           ``STT_OBJECT`` - ``.data``   Global variable
695                                           - ``.rodata``
696                                           - ``.bss``
697      *link-name*\ ``@kd``  ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
698      *link-name*           ``STT_FUNC``   - ``.text``   Kernel entry point
699      ===================== ============== ============= ==================
700
701 Global variable
702   Global variables both used and defined by the compilation unit.
703
704   If the symbol is defined in the compilation unit then it is allocated in the
705   appropriate section according to if it has initialized data or is readonly.
706
707   If the symbol is external then its section is ``STN_UNDEF`` and the loader
708   will resolve relocations using the definition provided by another code object
709   or explicitly defined by the runtime.
710
711   All global symbols, whether defined in the compilation unit or external, are
712   accessed by the machine code indirectly through a GOT table entry. This
713   allows them to be preemptable. The GOT table is only supported when the target
714   triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
715
716   .. TODO
717      Add description of linked shared object symbols. Seems undefined symbols
718      are marked as STT_NOTYPE.
719
720 Kernel descriptor
721   Every HSA kernel has an associated kernel descriptor. It is the address of the
722   kernel descriptor that is used in the AQL dispatch packet used to invoke the
723   kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
724   defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
725
726 Kernel entry point
727   Every HSA kernel also has a symbol for its machine code entry point.
728
729 .. _amdgpu-relocation-records:
730
731 Relocation Records
732 ------------------
733
734 AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
735 relocatable fields are:
736
737 ``word32``
738   This specifies a 32-bit field occupying 4 bytes with arbitrary byte
739   alignment. These values use the same byte order as other word values in the
740   AMD GPU architecture.
741
742 ``word64``
743   This specifies a 64-bit field occupying 8 bytes with arbitrary byte
744   alignment. These values use the same byte order as other word values in the
745   AMD GPU architecture.
746
747 Following notations are used for specifying relocation calculations:
748
749 **A**
750   Represents the addend used to compute the value of the relocatable field.
751
752 **G**
753   Represents the offset into the global offset table at which the relocation
754   entry's symbol will reside during execution.
755
756 **GOT**
757   Represents the address of the global offset table.
758
759 **P**
760   Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
761   of the storage unit being relocated (computed using ``r_offset``).
762
763 **S**
764   Represents the value of the symbol whose index resides in the relocation
765   entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``.
766
767 **B**
768   Represents the base address of a loaded executable or shared object which is
769   the difference between the ELF address and the actual load address. Relocations
770   using this are only valid in executable or shared objects.
771
772 The following relocation types are supported:
773
774   .. table:: AMDGPU ELF Relocation Records
775      :name: amdgpu-elf-relocation-records-table
776
777      ========================== ======= =====  ==========  ==============================
778      Relocation Type            Kind    Value  Field       Calculation
779      ========================== ======= =====  ==========  ==============================
780      ``R_AMDGPU_NONE``                  0      *none*      *none*
781      ``R_AMDGPU_ABS32_LO``      Dynamic 1      ``word32``  (S + A) & 0xFFFFFFFF
782      ``R_AMDGPU_ABS32_HI``      Dynamic 2      ``word32``  (S + A) >> 32
783      ``R_AMDGPU_ABS64``         Dynamic 3      ``word64``  S + A
784      ``R_AMDGPU_REL32``         Static  4      ``word32``  S + A - P
785      ``R_AMDGPU_REL64``         Static  5      ``word64``  S + A - P
786      ``R_AMDGPU_ABS32``         Static  6      ``word32``  S + A
787      ``R_AMDGPU_GOTPCREL``      Static  7      ``word32``  G + GOT + A - P
788      ``R_AMDGPU_GOTPCREL32_LO`` Static  8      ``word32``  (G + GOT + A - P) & 0xFFFFFFFF
789      ``R_AMDGPU_GOTPCREL32_HI`` Static  9      ``word32``  (G + GOT + A - P) >> 32
790      ``R_AMDGPU_REL32_LO``      Static  10     ``word32``  (S + A - P) & 0xFFFFFFFF
791      ``R_AMDGPU_REL32_HI``      Static  11     ``word32``  (S + A - P) >> 32
792      *reserved*                         12
793      ``R_AMDGPU_RELATIVE64``    Dynamic 13     ``word64``  B + A
794      ========================== ======= =====  ==========  ==============================
795
796 .. _amdgpu-dwarf:
797
798 DWARF
799 -----
800
801 Standard DWARF [DWARF]_ Version 2 sections can be generated. These contain
802 information that maps the code object executable code and data to the source
803 language constructs. It can be used by tools such as debuggers and profilers.
804
805 Address Space Mapping
806 ~~~~~~~~~~~~~~~~~~~~~
807
808 The following address space mapping is used:
809
810   .. table:: AMDGPU DWARF Address Space Mapping
811      :name: amdgpu-dwarf-address-space-mapping-table
812
813      =================== =================
814      DWARF Address Space Memory Space
815      =================== =================
816      1                   Private (Scratch)
817      2                   Local (group/LDS)
818      *omitted*           Global
819      *omitted*           Constant
820      *omitted*           Generic (Flat)
821      *not supported*     Region (GDS)
822      =================== =================
823
824 See :ref:`amdgpu-address-spaces` for information on the memory space terminology
825 used in the table.
826
827 An ``address_class`` attribute is generated on pointer type DIEs to specify the
828 DWARF address space of the value of the pointer when it is in the *private* or
829 *local* address space. Otherwise the attribute is omitted.
830
831 An ``XDEREF`` operation is generated in location list expressions for variables
832 that are allocated in the *private* and *local* address space. Otherwise no
833 ``XDREF`` is omitted.
834
835 Register Mapping
836 ~~~~~~~~~~~~~~~~
837
838 *This section is WIP.*
839
840 .. TODO
841    Define DWARF register enumeration.
842
843    If want to present a wavefront state then should expose vector registers as
844    64 wide (rather than per work-item view that LLVM uses). Either as separate
845    registers, or a 64x4 byte single register. In either case use a new LANE op
846    (akin to XDREF) to select the current lane usage in a location
847    expression. This would also allow scalar register spilling to vector register
848    lanes to be expressed (currently no debug information is being generated for
849    spilling). If choose a wide single register approach then use LANE in
850    conjunction with PIECE operation to select the dword part of the register for
851    the current lane. If the separate register approach then use LANE to select
852    the register.
853
854 Source Text
855 ~~~~~~~~~~~
856
857 *This section is WIP.*
858
859 .. TODO
860    DWARF extension to include runtime generated source text.
861
862 .. _amdgpu-code-conventions:
863
864 Code Conventions
865 ================
866
867 This section provides code conventions used for each supported target triple OS
868 (see :ref:`amdgpu-target-triples`).
869
870 AMDHSA
871 ------
872
873 This section provides code conventions used when the target triple OS is
874 ``amdhsa`` (see :ref:`amdgpu-target-triples`).
875
876 .. _amdgpu-amdhsa-hsa-code-object-metadata:
877
878 Code Object Metadata
879 ~~~~~~~~~~~~~~~~~~~~
880
881 The code object metadata specifies extensible metadata associated with the code
882 objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
883 [AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record
884 (see :ref:`amdgpu-note-records`) and is required when the target triple OS is
885 ``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
886 information necessary to support the ROCM kernel queries. For example, the
887 segment sizes needed in a dispatch packet. In addition, a high level language
888 runtime may require other information to be included. For example, the AMD
889 OpenCL runtime records kernel argument information.
890
891 The metadata is specified as a YAML formatted string (see [YAML]_ and
892 :doc:`YamlIO`).
893
894 .. TODO
895    Is the string null terminated? It probably should not if YAML allows it to
896    contain null characters, otherwise it should be.
897
898 The metadata is represented as a single YAML document comprised of the mapping
899 defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
900 referenced tables.
901
902 For boolean values, the string values of ``false`` and ``true`` are used for
903 false and true respectively.
904
905 Additional information can be added to the mappings. To avoid conflicts, any
906 non-AMD key names should be prefixed by "*vendor-name*.".
907
908   .. table:: AMDHSA Code Object Metadata Mapping
909      :name: amdgpu-amdhsa-code-object-metadata-mapping-table
910
911      ========== ============== ========= =======================================
912      String Key Value Type     Required? Description
913      ========== ============== ========= =======================================
914      "Version"  sequence of    Required  - The first integer is the major
915                 2 integers                 version. Currently 1.
916                                          - The second integer is the minor
917                                            version. Currently 0.
918      "Printf"   sequence of              Each string is encoded information
919                 strings                  about a printf function call. The
920                                          encoded information is organized as
921                                          fields separated by colon (':'):
922
923                                          ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
924
925                                          where:
926
927                                          ``ID``
928                                            A 32 bit integer as a unique id for
929                                            each printf function call
930
931                                          ``N``
932                                            A 32 bit integer equal to the number
933                                            of arguments of printf function call
934                                            minus 1
935
936                                          ``S[i]`` (where i = 0, 1, ... , N-1)
937                                            32 bit integers for the size in bytes
938                                            of the i-th FormatString argument of
939                                            the printf function call
940
941                                          FormatString
942                                            The format string passed to the
943                                            printf function call.
944      "Kernels"  sequence of    Required  Sequence of the mappings for each
945                 mapping                  kernel in the code object. See
946                                          :ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
947                                          for the definition of the mapping.
948      ========== ============== ========= =======================================
949
950 ..
951
952   .. table:: AMDHSA Code Object Kernel Metadata Mapping
953      :name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
954
955      ================= ============== ========= ================================
956      String Key        Value Type     Required? Description
957      ================= ============== ========= ================================
958      "Name"            string         Required  Source name of the kernel.
959      "SymbolName"      string         Required  Name of the kernel
960                                                 descriptor ELF symbol.
961      "Language"        string                   Source language of the kernel.
962                                                 Values include:
963
964                                                 - "OpenCL C"
965                                                 - "OpenCL C++"
966                                                 - "HCC"
967                                                 - "OpenMP"
968
969      "LanguageVersion" sequence of              - The first integer is the major
970                        2 integers                 version.
971                                                 - The second integer is the
972                                                   minor version.
973      "Attrs"           mapping                  Mapping of kernel attributes.
974                                                 See
975                                                 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
976                                                 for the mapping definition.
977      "Args"            sequence of              Sequence of mappings of the
978                        mapping                  kernel arguments. See
979                                                 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
980                                                 for the definition of the mapping.
981      "CodeProps"       mapping                  Mapping of properties related to
982                                                 the kernel code. See
983                                                 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
984                                                 for the mapping definition.
985      ================= ============== ========= ================================
986
987 ..
988
989   .. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
990      :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
991
992      =================== ============== ========= ==============================
993      String Key          Value Type     Required? Description
994      =================== ============== ========= ==============================
995      "ReqdWorkGroupSize" sequence of              If not 0, 0, 0 then all values
996                          3 integers               must be >=1 and the dispatch
997                                                   work-group size X, Y, Z must
998                                                   correspond to the specified
999                                                   values. Defaults to 0, 0, 0.
1000
1001                                                   Corresponds to the OpenCL
1002                                                   ``reqd_work_group_size``
1003                                                   attribute.
1004      "WorkGroupSizeHint" sequence of              The dispatch work-group size
1005                          3 integers               X, Y, Z is likely to be the
1006                                                   specified values.
1007
1008                                                   Corresponds to the OpenCL
1009                                                   ``work_group_size_hint``
1010                                                   attribute.
1011      "VecTypeHint"       string                   The name of a scalar or vector
1012                                                   type.
1013
1014                                                   Corresponds to the OpenCL
1015                                                   ``vec_type_hint`` attribute.
1016
1017      "RuntimeHandle"     string                   The external symbol name
1018                                                   associated with a kernel.
1019                                                   OpenCL runtime allocates a
1020                                                   global buffer for the symbol
1021                                                   and saves the kernel's address
1022                                                   to it, which is used for
1023                                                   device side enqueueing. Only
1024                                                   available for device side
1025                                                   enqueued kernels.
1026      =================== ============== ========= ==============================
1027
1028 ..
1029
1030   .. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
1031      :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
1032
1033      ================= ============== ========= ================================
1034      String Key        Value Type     Required? Description
1035      ================= ============== ========= ================================
1036      "Name"            string                   Kernel argument name.
1037      "TypeName"        string                   Kernel argument type name.
1038      "Size"            integer        Required  Kernel argument size in bytes.
1039      "Align"           integer        Required  Kernel argument alignment in
1040                                                 bytes. Must be a power of two.
1041      "ValueKind"       string         Required  Kernel argument kind that
1042                                                 specifies how to set up the
1043                                                 corresponding argument.
1044                                                 Values include:
1045
1046                                                 "ByValue"
1047                                                   The argument is copied
1048                                                   directly into the kernarg.
1049
1050                                                 "GlobalBuffer"
1051                                                   A global address space pointer
1052                                                   to the buffer data is passed
1053                                                   in the kernarg.
1054
1055                                                 "DynamicSharedPointer"
1056                                                   A group address space pointer
1057                                                   to dynamically allocated LDS
1058                                                   is passed in the kernarg.
1059
1060                                                 "Sampler"
1061                                                   A global address space
1062                                                   pointer to a S# is passed in
1063                                                   the kernarg.
1064
1065                                                 "Image"
1066                                                   A global address space
1067                                                   pointer to a T# is passed in
1068                                                   the kernarg.
1069
1070                                                 "Pipe"
1071                                                   A global address space pointer
1072                                                   to an OpenCL pipe is passed in
1073                                                   the kernarg.
1074
1075                                                 "Queue"
1076                                                   A global address space pointer
1077                                                   to an OpenCL device enqueue
1078                                                   queue is passed in the
1079                                                   kernarg.
1080
1081                                                 "HiddenGlobalOffsetX"
1082                                                   The OpenCL grid dispatch
1083                                                   global offset for the X
1084                                                   dimension is passed in the
1085                                                   kernarg.
1086
1087                                                 "HiddenGlobalOffsetY"
1088                                                   The OpenCL grid dispatch
1089                                                   global offset for the Y
1090                                                   dimension is passed in the
1091                                                   kernarg.
1092
1093                                                 "HiddenGlobalOffsetZ"
1094                                                   The OpenCL grid dispatch
1095                                                   global offset for the Z
1096                                                   dimension is passed in the
1097                                                   kernarg.
1098
1099                                                 "HiddenNone"
1100                                                   An argument that is not used
1101                                                   by the kernel. Space needs to
1102                                                   be left for it, but it does
1103                                                   not need to be set up.
1104
1105                                                 "HiddenPrintfBuffer"
1106                                                   A global address space pointer
1107                                                   to the runtime printf buffer
1108                                                   is passed in kernarg.
1109
1110                                                 "HiddenDefaultQueue"
1111                                                   A global address space pointer
1112                                                   to the OpenCL device enqueue
1113                                                   queue that should be used by
1114                                                   the kernel by default is
1115                                                   passed in the kernarg.
1116
1117                                                 "HiddenCompletionAction"
1118                                                   A global address space pointer
1119                                                   to help link enqueued kernels into
1120                                                   the ancestor tree for determining
1121                                                   when the parent kernel has finished.
1122
1123      "ValueType"       string         Required  Kernel argument value type. Only
1124                                                 present if "ValueKind" is
1125                                                 "ByValue". For vector data
1126                                                 types, the value is for the
1127                                                 element type. Values include:
1128
1129                                                 - "Struct"
1130                                                 - "I8"
1131                                                 - "U8"
1132                                                 - "I16"
1133                                                 - "U16"
1134                                                 - "F16"
1135                                                 - "I32"
1136                                                 - "U32"
1137                                                 - "F32"
1138                                                 - "I64"
1139                                                 - "U64"
1140                                                 - "F64"
1141
1142                                                 .. TODO
1143                                                    How can it be determined if a
1144                                                    vector type, and what size
1145                                                    vector?
1146      "PointeeAlign"    integer                  Alignment in bytes of pointee
1147                                                 type for pointer type kernel
1148                                                 argument. Must be a power
1149                                                 of 2. Only present if
1150                                                 "ValueKind" is
1151                                                 "DynamicSharedPointer".
1152      "AddrSpaceQual"   string                   Kernel argument address space
1153                                                 qualifier. Only present if
1154                                                 "ValueKind" is "GlobalBuffer" or
1155                                                 "DynamicSharedPointer". Values
1156                                                 are:
1157
1158                                                 - "Private"
1159                                                 - "Global"
1160                                                 - "Constant"
1161                                                 - "Local"
1162                                                 - "Generic"
1163                                                 - "Region"
1164
1165                                                 .. TODO
1166                                                    Is GlobalBuffer only Global
1167                                                    or Constant? Is
1168                                                    DynamicSharedPointer always
1169                                                    Local? Can HCC allow Generic?
1170                                                    How can Private or Region
1171                                                    ever happen?
1172      "AccQual"         string                   Kernel argument access
1173                                                 qualifier. Only present if
1174                                                 "ValueKind" is "Image" or
1175                                                 "Pipe". Values
1176                                                 are:
1177
1178                                                 - "ReadOnly"
1179                                                 - "WriteOnly"
1180                                                 - "ReadWrite"
1181
1182                                                 .. TODO
1183                                                    Does this apply to
1184                                                    GlobalBuffer?
1185      "ActualAccQual"   string                   The actual memory accesses
1186                                                 performed by the kernel on the
1187                                                 kernel argument. Only present if
1188                                                 "ValueKind" is "GlobalBuffer",
1189                                                 "Image", or "Pipe". This may be
1190                                                 more restrictive than indicated
1191                                                 by "AccQual" to reflect what the
1192                                                 kernel actual does. If not
1193                                                 present then the runtime must
1194                                                 assume what is implied by
1195                                                 "AccQual" and "IsConst". Values
1196                                                 are:
1197
1198                                                 - "ReadOnly"
1199                                                 - "WriteOnly"
1200                                                 - "ReadWrite"
1201
1202      "IsConst"         boolean                  Indicates if the kernel argument
1203                                                 is const qualified. Only present
1204                                                 if "ValueKind" is
1205                                                 "GlobalBuffer".
1206
1207      "IsRestrict"      boolean                  Indicates if the kernel argument
1208                                                 is restrict qualified. Only
1209                                                 present if "ValueKind" is
1210                                                 "GlobalBuffer".
1211
1212      "IsVolatile"      boolean                  Indicates if the kernel argument
1213                                                 is volatile qualified. Only
1214                                                 present if "ValueKind" is
1215                                                 "GlobalBuffer".
1216
1217      "IsPipe"          boolean                  Indicates if the kernel argument
1218                                                 is pipe qualified. Only present
1219                                                 if "ValueKind" is "Pipe".
1220
1221                                                 .. TODO
1222                                                    Can GlobalBuffer be pipe
1223                                                    qualified?
1224      ================= ============== ========= ================================
1225
1226 ..
1227
1228   .. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
1229      :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
1230
1231      ============================ ============== ========= =====================
1232      String Key                   Value Type     Required? Description
1233      ============================ ============== ========= =====================
1234      "KernargSegmentSize"         integer        Required  The size in bytes of
1235                                                            the kernarg segment
1236                                                            that holds the values
1237                                                            of the arguments to
1238                                                            the kernel.
1239      "GroupSegmentFixedSize"      integer        Required  The amount of group
1240                                                            segment memory
1241                                                            required by a
1242                                                            work-group in
1243                                                            bytes. This does not
1244                                                            include any
1245                                                            dynamically allocated
1246                                                            group segment memory
1247                                                            that may be added
1248                                                            when the kernel is
1249                                                            dispatched.
1250      "PrivateSegmentFixedSize"    integer        Required  The amount of fixed
1251                                                            private address space
1252                                                            memory required for a
1253                                                            work-item in
1254                                                            bytes. If the kernel
1255                                                            uses a dynamic call
1256                                                            stack then additional
1257                                                            space must be added
1258                                                            to this value for the
1259                                                            call stack.
1260      "KernargSegmentAlign"        integer        Required  The maximum byte
1261                                                            alignment of
1262                                                            arguments in the
1263                                                            kernarg segment. Must
1264                                                            be a power of 2.
1265      "WavefrontSize"              integer        Required  Wavefront size. Must
1266                                                            be a power of 2.
1267      "NumSGPRs"                   integer        Required  Number of scalar
1268                                                            registers used by a
1269                                                            wavefront for
1270                                                            GFX6-GFX9. This
1271                                                            includes the special
1272                                                            SGPRs for VCC, Flat
1273                                                            Scratch (GFX7-GFX9)
1274                                                            and XNACK (for
1275                                                            GFX8-GFX9). It does
1276                                                            not include the 16
1277                                                            SGPR added if a trap
1278                                                            handler is
1279                                                            enabled. It is not
1280                                                            rounded up to the
1281                                                            allocation
1282                                                            granularity.
1283      "NumVGPRs"                   integer        Required  Number of vector
1284                                                            registers used by
1285                                                            each work-item for
1286                                                            GFX6-GFX9
1287      "MaxFlatWorkGroupSize"       integer        Required  Maximum flat
1288                                                            work-group size
1289                                                            supported by the
1290                                                            kernel in work-items.
1291                                                            Must be >=1 and
1292                                                            consistent with
1293                                                            ReqdWorkGroupSize if
1294                                                            not 0, 0, 0.
1295      "NumSpilledSGPRs"            integer                  Number of stores from
1296                                                            a scalar register to
1297                                                            a register allocator
1298                                                            created spill
1299                                                            location.
1300      "NumSpilledVGPRs"            integer                  Number of stores from
1301                                                            a vector register to
1302                                                            a register allocator
1303                                                            created spill
1304                                                            location.
1305      ============================ ============== ========= =====================
1306
1307 ..
1308
1309 Kernel Dispatch
1310 ~~~~~~~~~~~~~~~
1311
1312 The HSA architected queuing language (AQL) defines a user space memory interface
1313 that can be used to control the dispatch of kernels, in an agent independent
1314 way. An agent can have zero or more AQL queues created for it using the ROCm
1315 runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
1316 *HSA Platform System Architecture Specification* [HSA]_ for the AQL queue
1317 mechanics and packet layouts.
1318
1319 The packet processor of a kernel agent is responsible for detecting and
1320 dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
1321 packet processor is implemented by the hardware command processor (CP),
1322 asynchronous dispatch controller (ADC) and shader processor input controller
1323 (SPI).
1324
1325 The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
1326 mode driver to initialize and register the AQL queue with CP.
1327
1328 To dispatch a kernel the following actions are performed. This can occur in the
1329 CPU host program, or from an HSA kernel executing on a GPU.
1330
1331 1. A pointer to an AQL queue for the kernel agent on which the kernel is to be
1332    executed is obtained.
1333 2. A pointer to the kernel descriptor (see
1334    :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
1335    obtained. It must be for a kernel that is contained in a code object that that
1336    was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
1337    associated.
1338 3. Space is allocated for the kernel arguments using the ROCm runtime allocator
1339    for a memory region with the kernarg property for the kernel agent that will
1340    execute the kernel. It must be at least 16 byte aligned.
1341 4. Kernel argument values are assigned to the kernel argument memory
1342    allocation. The layout is defined in the *HSA Programmer's Language Reference*
1343    [HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
1344    memory in the same way constant memory is accessed. (Note that the HSA
1345    specification allows an implementation to copy the kernel argument contents to
1346    another location that is accessed by the kernel.)
1347 5. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
1348    api uses 64 bit atomic operations to reserve space in the AQL queue for the
1349    packet. The packet must be set up, and the final write must use an atomic
1350    store release to set the packet kind to ensure the packet contents are
1351    visible to the kernel agent. AQL defines a doorbell signal mechanism to
1352    notify the kernel agent that the AQL queue has been updated. These rules, and
1353    the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
1354    System Architecture Specification* [HSA]_.
1355 6. A kernel dispatch packet includes information about the actual dispatch,
1356    such as grid and work-group size, together with information from the code
1357    object about the kernel, such as segment sizes. The ROCm runtime queries on
1358    the kernel symbol can be used to obtain the code object values which are
1359    recorded in the :ref:`amdgpu-amdhsa-hsa-code-object-metadata`.
1360 7. CP executes micro-code and is responsible for detecting and setting up the
1361    GPU to execute the wavefronts of a kernel dispatch.
1362 8. CP ensures that when the a wavefront starts executing the kernel machine
1363    code, the scalar general purpose registers (SGPR) and vector general purpose
1364    registers (VGPR) are set up as required by the machine code. The required
1365    setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
1366    register state is defined in
1367    :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
1368 9. The prolog of the kernel machine code (see
1369    :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
1370    before continuing executing the machine code that corresponds to the kernel.
1371 10. When the kernel dispatch has completed execution, CP signals the completion
1372     signal specified in the kernel dispatch packet if not 0.
1373
1374 .. _amdgpu-amdhsa-memory-spaces:
1375
1376 Memory Spaces
1377 ~~~~~~~~~~~~~
1378
1379 The memory space properties are:
1380
1381   .. table:: AMDHSA Memory Spaces
1382      :name: amdgpu-amdhsa-memory-spaces-table
1383
1384      ================= =========== ======== ======= ==================
1385      Memory Space Name HSA Segment Hardware Address NULL Value
1386                        Name        Name     Size
1387      ================= =========== ======== ======= ==================
1388      Private           private     scratch  32      0x00000000
1389      Local             group       LDS      32      0xFFFFFFFF
1390      Global            global      global   64      0x0000000000000000
1391      Constant          constant    *same as 64      0x0000000000000000
1392                                    global*
1393      Generic           flat        flat     64      0x0000000000000000
1394      Region            N/A         GDS      32      *not implemented
1395                                                     for AMDHSA*
1396      ================= =========== ======== ======= ==================
1397
1398 The global and constant memory spaces both use global virtual addresses, which
1399 are the same virtual address space used by the CPU. However, some virtual
1400 addresses may only be accessible to the CPU, some only accessible by the GPU,
1401 and some by both.
1402
1403 Using the constant memory space indicates that the data will not change during
1404 the execution of the kernel. This allows scalar read instructions to be
1405 used. The vector and scalar L1 caches are invalidated of volatile data before
1406 each kernel dispatch execution to allow constant memory to change values between
1407 kernel dispatches.
1408
1409 The local memory space uses the hardware Local Data Store (LDS) which is
1410 automatically allocated when the hardware creates work-groups of wavefronts, and
1411 freed when all the wavefronts of a work-group have terminated. The data store
1412 (DS) instructions can be used to access it.
1413
1414 The private memory space uses the hardware scratch memory support. If the kernel
1415 uses scratch, then the hardware allocates memory that is accessed using
1416 wavefront lane dword (4 byte) interleaving. The mapping used from private
1417 address to physical address is:
1418
1419   ``wavefront-scratch-base +
1420   (private-address * wavefront-size * 4) +
1421   (wavefront-lane-id * 4)``
1422
1423 There are different ways that the wavefront scratch base address is determined
1424 by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
1425 memory can be accessed in an interleaved manner using buffer instruction with
1426 the scratch buffer descriptor and per wave scratch offset, by the scratch
1427 instructions, or by flat instructions. If each lane of a wavefront accesses the
1428 same private address, the interleaving results in adjacent dwords being accessed
1429 and hence requires fewer cache lines to be fetched. Multi-dword access is not
1430 supported except by flat and scratch instructions in GFX9.
1431
1432 The generic address space uses the hardware flat address support available in
1433 GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
1434 local appertures), that are outside the range of addressible global memory, to
1435 map from a flat address to a private or local address.
1436
1437 FLAT instructions can take a flat address and access global, private (scratch)
1438 and group (LDS) memory depending in if the address is within one of the
1439 apperture ranges. Flat access to scratch requires hardware aperture setup and
1440 setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
1441 access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
1442 (see :ref:`amdgpu-amdhsa-m0`).
1443
1444 To convert between a segment address and a flat address the base address of the
1445 appertures address can be used. For GFX7-GFX8 these are available in the
1446 :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
1447 Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
1448 GFX9 the appature base addresses are directly available as inline constant
1449 registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
1450 address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
1451 which makes it easier to convert from flat to segment or segment to flat.
1452
1453 Image and Samplers
1454 ~~~~~~~~~~~~~~~~~~
1455
1456 Image and sample handles created by the ROCm runtime are 64 bit addresses of a
1457 hardware 32 byte V# and 48 byte S# object respectively. In order to support the
1458 HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
1459 enumeration values for the queries that are not trivially deducible from the S#
1460 representation.
1461
1462 HSA Signals
1463 ~~~~~~~~~~~
1464
1465 HSA signal handles created by the ROCm runtime are 64 bit addresses of a
1466 structure allocated in memory accessible from both the CPU and GPU. The
1467 structure is defined by the ROCm runtime and subject to change between releases
1468 (see [AMD-ROCm-github]_).
1469
1470 .. _amdgpu-amdhsa-hsa-aql-queue:
1471
1472 HSA AQL Queue
1473 ~~~~~~~~~~~~~
1474
1475 The HSA AQL queue structure is defined by the ROCm runtime and subject to change
1476 between releases (see [AMD-ROCm-github]_). For some processors it contains
1477 fields needed to implement certain language features such as the flat address
1478 aperture bases. It also contains fields used by CP such as managing the
1479 allocation of scratch memory.
1480
1481 .. _amdgpu-amdhsa-kernel-descriptor:
1482
1483 Kernel Descriptor
1484 ~~~~~~~~~~~~~~~~~
1485
1486 A kernel descriptor consists of the information needed by CP to initiate the
1487 execution of a kernel, including the entry point address of the machine code
1488 that implements the kernel.
1489
1490 Kernel Descriptor for GFX6-GFX9
1491 +++++++++++++++++++++++++++++++
1492
1493 CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
1494
1495   .. table:: Kernel Descriptor for GFX6-GFX9
1496      :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
1497
1498      ======= ======= =============================== ============================
1499      Bits    Size    Field Name                      Description
1500      ======= ======= =============================== ============================
1501      31:0    4 bytes GroupSegmentFixedSize           The amount of fixed local
1502                                                      address space memory
1503                                                      required for a work-group
1504                                                      in bytes. This does not
1505                                                      include any dynamically
1506                                                      allocated local address
1507                                                      space memory that may be
1508                                                      added when the kernel is
1509                                                      dispatched.
1510      63:32   4 bytes PrivateSegmentFixedSize         The amount of fixed
1511                                                      private address space
1512                                                      memory required for a
1513                                                      work-item in bytes. If
1514                                                      is_dynamic_callstack is 1
1515                                                      then additional space must
1516                                                      be added to this value for
1517                                                      the call stack.
1518      127:64  8 bytes                                 Reserved, must be 0.
1519      191:128 8 bytes KernelCodeEntryByteOffset       Byte offset (possibly
1520                                                      negative) from base
1521                                                      address of kernel
1522                                                      descriptor to kernel's
1523                                                      entry point instruction
1524                                                      which must be 256 byte
1525                                                      aligned.
1526      383:192 24                                      Reserved, must be 0.
1527              bytes
1528      415:384 4 bytes ComputePgmRsrc1                 Compute Shader (CS)
1529                                                      program settings used by
1530                                                      CP to set up
1531                                                      ``COMPUTE_PGM_RSRC1``
1532                                                      configuration
1533                                                      register. See
1534                                                      :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
1535      447:416 4 bytes ComputePgmRsrc2                 Compute Shader (CS)
1536                                                      program settings used by
1537                                                      CP to set up
1538                                                      ``COMPUTE_PGM_RSRC2``
1539                                                      configuration
1540                                                      register. See
1541                                                      :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
1542      448     1 bit   EnableSGPRPrivateSegmentBuffer  Enable the setup of the
1543                                                      SGPR user data registers
1544                                                      (see
1545                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1546
1547                                                      The total number of SGPR
1548                                                      user data registers
1549                                                      requested must not exceed
1550                                                      16 and match value in
1551                                                      ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
1552                                                      Any requests beyond 16
1553                                                      will be ignored.
1554      449     1 bit   EnableSGPRDispatchPtr           *see above*
1555      450     1 bit   EnableSGPRQueuePtr              *see above*
1556      451     1 bit   EnableSGPRKernargSegmentPtr     *see above*
1557      452     1 bit   EnableSGPRDispatchID            *see above*
1558      453     1 bit   EnableSGPRFlatScratchInit       *see above*
1559      454     1 bit   EnableSGPRPrivateSegmentSize    *see above*
1560      455     1 bit   EnableSGPRGridWorkgroupCountX   Not implemented in CP and
1561                                                      should always be 0.
1562      456     1 bit   EnableSGPRGridWorkgroupCountY   Not implemented in CP and
1563                                                      should always be 0.
1564      457     1 bit   EnableSGPRGridWorkgroupCountZ   Not implemented in CP and
1565                                                      should always be 0.
1566      463:458 6 bits                                  Reserved, must be 0.
1567      511:464 6                                       Reserved, must be 0.
1568              bytes
1569      512     **Total size 64 bytes.**
1570      ======= ====================================================================
1571
1572 ..
1573
1574   .. table:: compute_pgm_rsrc1 for GFX6-GFX9
1575      :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table
1576
1577      ======= ======= =============================== ===========================================================================
1578      Bits    Size    Field Name                      Description
1579      ======= ======= =============================== ===========================================================================
1580      5:0     6 bits  GRANULATED_WORKITEM_VGPR_COUNT  Number of vector registers
1581                                                      used by each work-item,
1582                                                      granularity is device
1583                                                      specific:
1584
1585                                                      GFX6-GFX9
1586                                                        - max_vgpr 1..256
1587                                                        - roundup((max_vgpg + 1)
1588                                                          / 4) - 1
1589
1590                                                      Used by CP to set up
1591                                                      ``COMPUTE_PGM_RSRC1.VGPRS``.
1592      9:6     4 bits  GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar registers
1593                                                      used by a wavefront,
1594                                                      granularity is device
1595                                                      specific:
1596
1597                                                      GFX6-GFX8
1598                                                        - max_sgpr 1..112
1599                                                        - roundup((max_sgpg + 1)
1600                                                          / 8) - 1
1601                                                      GFX9
1602                                                        - max_sgpr 1..112
1603                                                        - roundup((max_sgpg + 1)
1604                                                          / 16) - 1
1605
1606                                                      Includes the special SGPRs
1607                                                      for VCC, Flat Scratch (for
1608                                                      GFX7 onwards) and XNACK
1609                                                      (for GFX8 onwards). It does
1610                                                      not include the 16 SGPR
1611                                                      added if a trap handler is
1612                                                      enabled.
1613
1614                                                      Used by CP to set up
1615                                                      ``COMPUTE_PGM_RSRC1.SGPRS``.
1616      11:10   2 bits  PRIORITY                        Must be 0.
1617
1618                                                      Start executing wavefront
1619                                                      at the specified priority.
1620
1621                                                      CP is responsible for
1622                                                      filling in
1623                                                      ``COMPUTE_PGM_RSRC1.PRIORITY``.
1624      13:12   2 bits  FLOAT_ROUND_MODE_32             Wavefront starts execution
1625                                                      with specified rounding
1626                                                      mode for single (32
1627                                                      bit) floating point
1628                                                      precision floating point
1629                                                      operations.
1630
1631                                                      Floating point rounding
1632                                                      mode values are defined in
1633                                                      :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1634
1635                                                      Used by CP to set up
1636                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1637      15:14   2 bits  FLOAT_ROUND_MODE_16_64          Wavefront starts execution
1638                                                      with specified rounding
1639                                                      denorm mode for half/double (16
1640                                                      and 64 bit) floating point
1641                                                      precision floating point
1642                                                      operations.
1643
1644                                                      Floating point rounding
1645                                                      mode values are defined in
1646                                                      :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1647
1648                                                      Used by CP to set up
1649                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1650      17:16   2 bits  FLOAT_DENORM_MODE_32            Wavefront starts execution
1651                                                      with specified denorm mode
1652                                                      for single (32
1653                                                      bit)  floating point
1654                                                      precision floating point
1655                                                      operations.
1656
1657                                                      Floating point denorm mode
1658                                                      values are defined in
1659                                                      :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1660
1661                                                      Used by CP to set up
1662                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1663      19:18   2 bits  FLOAT_DENORM_MODE_16_64         Wavefront starts execution
1664                                                      with specified denorm mode
1665                                                      for half/double (16
1666                                                      and 64 bit) floating point
1667                                                      precision floating point
1668                                                      operations.
1669
1670                                                      Floating point denorm mode
1671                                                      values are defined in
1672                                                      :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1673
1674                                                      Used by CP to set up
1675                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1676      20      1 bit   PRIV                            Must be 0.
1677
1678                                                      Start executing wavefront
1679                                                      in privilege trap handler
1680                                                      mode.
1681
1682                                                      CP is responsible for
1683                                                      filling in
1684                                                      ``COMPUTE_PGM_RSRC1.PRIV``.
1685      21      1 bit   ENABLE_DX10_CLAMP               Wavefront starts execution
1686                                                      with DX10 clamp mode
1687                                                      enabled. Used by the vector
1688                                                      ALU to force DX10 style
1689                                                      treatment of NaN's (when
1690                                                      set, clamp NaN to zero,
1691                                                      otherwise pass NaN
1692                                                      through).
1693
1694                                                      Used by CP to set up
1695                                                      ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
1696      22      1 bit   DEBUG_MODE                      Must be 0.
1697
1698                                                      Start executing wavefront
1699                                                      in single step mode.
1700
1701                                                      CP is responsible for
1702                                                      filling in
1703                                                      ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
1704      23      1 bit   ENABLE_IEEE_MODE                Wavefront starts execution
1705                                                      with IEEE mode
1706                                                      enabled. Floating point
1707                                                      opcodes that support
1708                                                      exception flag gathering
1709                                                      will quiet and propagate
1710                                                      signaling-NaN inputs per
1711                                                      IEEE 754-2008. Min_dx10 and
1712                                                      max_dx10 become IEEE
1713                                                      754-2008 compliant due to
1714                                                      signaling-NaN propagation
1715                                                      and quieting.
1716
1717                                                      Used by CP to set up
1718                                                      ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
1719      24      1 bit   BULKY                           Must be 0.
1720
1721                                                      Only one work-group allowed
1722                                                      to execute on a compute
1723                                                      unit.
1724
1725                                                      CP is responsible for
1726                                                      filling in
1727                                                      ``COMPUTE_PGM_RSRC1.BULKY``.
1728      25      1 bit   CDBG_USER                       Must be 0.
1729
1730                                                      Flag that can be used to
1731                                                      control debugging code.
1732
1733                                                      CP is responsible for
1734                                                      filling in
1735                                                      ``COMPUTE_PGM_RSRC1.CDBG_USER``.
1736      26      1 bit   FP16_OVFL                       GFX6-GFX8
1737                                                        Reserved, must be 0.
1738                                                      GFX9
1739                                                        Wavefront starts execution
1740                                                        with specified fp16 overflow
1741                                                        mode.
1742
1743                                                        - If 0, fp16 overflow generates
1744                                                          +/-INF values.
1745                                                        - If 1, fp16 overflow that is the
1746                                                          result of an +/-INF input value
1747                                                          or divide by 0 produces a +/-INF,
1748                                                          otherwise clamps computed
1749                                                          overflow to +/-MAX_FP16 as
1750                                                          appropriate.
1751
1752                                                        Used by CP to set up
1753                                                        ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
1754      31:27   5 bits                                  Reserved, must be 0.
1755      32      **Total size 4 bytes**
1756      ======= ===================================================================================================================
1757
1758 ..
1759
1760   .. table:: compute_pgm_rsrc2 for GFX6-GFX9
1761      :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
1762
1763      ======= ======= =============================== ===========================================================================
1764      Bits    Size    Field Name                      Description
1765      ======= ======= =============================== ===========================================================================
1766      0       1 bit   ENABLE_SGPR_PRIVATE_SEGMENT     Enable the setup of the
1767                      _WAVE_OFFSET                    SGPR wave scratch offset
1768                                                      system register (see
1769                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1770
1771                                                      Used by CP to set up
1772                                                      ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
1773      5:1     5 bits  USER_SGPR_COUNT                 The total number of SGPR
1774                                                      user data registers
1775                                                      requested. This number must
1776                                                      match the number of user
1777                                                      data registers enabled.
1778
1779                                                      Used by CP to set up
1780                                                      ``COMPUTE_PGM_RSRC2.USER_SGPR``.
1781      6       1 bit   ENABLE_TRAP_HANDLER             Set to 1 if code contains a
1782                                                      TRAP instruction which
1783                                                      requires a trap handler to
1784                                                      be enabled.
1785
1786                                                      CP sets
1787                                                      ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``
1788                                                      if the runtime has
1789                                                      installed a trap handler
1790                                                      regardless of the setting
1791                                                      of this field.
1792      7       1 bit   ENABLE_SGPR_WORKGROUP_ID_X      Enable the setup of the
1793                                                      system SGPR register for
1794                                                      the work-group id in the X
1795                                                      dimension (see
1796                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1797
1798                                                      Used by CP to set up
1799                                                      ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
1800      8       1 bit   ENABLE_SGPR_WORKGROUP_ID_Y      Enable the setup of the
1801                                                      system SGPR register for
1802                                                      the work-group id in the Y
1803                                                      dimension (see
1804                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1805
1806                                                      Used by CP to set up
1807                                                      ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
1808      9       1 bit   ENABLE_SGPR_WORKGROUP_ID_Z      Enable the setup of the
1809                                                      system SGPR register for
1810                                                      the work-group id in the Z
1811                                                      dimension (see
1812                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1813
1814                                                      Used by CP to set up
1815                                                      ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
1816      10      1 bit   ENABLE_SGPR_WORKGROUP_INFO      Enable the setup of the
1817                                                      system SGPR register for
1818                                                      work-group information (see
1819                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1820
1821                                                      Used by CP to set up
1822                                                      ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
1823      12:11   2 bits  ENABLE_VGPR_WORKITEM_ID         Enable the setup of the
1824                                                      VGPR system registers used
1825                                                      for the work-item ID.
1826                                                      :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
1827                                                      defines the values.
1828
1829                                                      Used by CP to set up
1830                                                      ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
1831      13      1 bit   ENABLE_EXCEPTION_ADDRESS_WATCH  Must be 0.
1832
1833                                                      Wavefront starts execution
1834                                                      with address watch
1835                                                      exceptions enabled which
1836                                                      are generated when L1 has
1837                                                      witnessed a thread access
1838                                                      an *address of
1839                                                      interest*.
1840
1841                                                      CP is responsible for
1842                                                      filling in the address
1843                                                      watch bit in
1844                                                      ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1845                                                      according to what the
1846                                                      runtime requests.
1847      14      1 bit   ENABLE_EXCEPTION_MEMORY         Must be 0.
1848
1849                                                      Wavefront starts execution
1850                                                      with memory violation
1851                                                      exceptions exceptions
1852                                                      enabled which are generated
1853                                                      when a memory violation has
1854                                                      occurred for this wave from
1855                                                      L1 or LDS
1856                                                      (write-to-read-only-memory,
1857                                                      mis-aligned atomic, LDS
1858                                                      address out of range,
1859                                                      illegal address, etc.).
1860
1861                                                      CP sets the memory
1862                                                      violation bit in
1863                                                      ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1864                                                      according to what the
1865                                                      runtime requests.
1866      23:15   9 bits  GRANULATED_LDS_SIZE             Must be 0.
1867
1868                                                      CP uses the rounded value
1869                                                      from the dispatch packet,
1870                                                      not this value, as the
1871                                                      dispatch may contain
1872                                                      dynamically allocated group
1873                                                      segment memory. CP writes
1874                                                      directly to
1875                                                      ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
1876
1877                                                      Amount of group segment
1878                                                      (LDS) to allocate for each
1879                                                      work-group. Granularity is
1880                                                      device specific:
1881
1882                                                      GFX6:
1883                                                        roundup(lds-size / (64 * 4))
1884                                                      GFX7-GFX9:
1885                                                        roundup(lds-size / (128 * 4))
1886
1887      24      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    Wavefront starts execution
1888                      _INVALID_OPERATION              with specified exceptions
1889                                                      enabled.
1890
1891                                                      Used by CP to set up
1892                                                      ``COMPUTE_PGM_RSRC2.EXCP_EN``
1893                                                      (set from bits 0..6).
1894
1895                                                      IEEE 754 FP Invalid
1896                                                      Operation
1897      25      1 bit   ENABLE_EXCEPTION_FP_DENORMAL    FP Denormal one or more
1898                      _SOURCE                         input operands is a
1899                                                      denormal number
1900      26      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Division by
1901                      _DIVISION_BY_ZERO               Zero
1902      27      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP FP Overflow
1903                      _OVERFLOW
1904      28      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Underflow
1905                      _UNDERFLOW
1906      29      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Inexact
1907                      _INEXACT
1908      30      1 bit   ENABLE_EXCEPTION_INT_DIVIDE_BY  Integer Division by Zero
1909                      _ZERO                           (rcp_iflag_f32 instruction
1910                                                      only)
1911      31      1 bit                                   Reserved, must be 0.
1912      32      **Total size 4 bytes.**
1913      ======= ===================================================================================================================
1914
1915 ..
1916
1917   .. table:: Floating Point Rounding Mode Enumeration Values
1918      :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
1919
1920      ====================================== ===== ==============================
1921      Enumeration Name                       Value Description
1922      ====================================== ===== ==============================
1923      AMDGPU_FLOAT_ROUND_MODE_NEAR_EVEN      0     Round Ties To Even
1924      AMDGPU_FLOAT_ROUND_MODE_PLUS_INFINITY  1     Round Toward +infinity
1925      AMDGPU_FLOAT_ROUND_MODE_MINUS_INFINITY 2     Round Toward -infinity
1926      AMDGPU_FLOAT_ROUND_MODE_ZERO           3     Round Toward 0
1927      ====================================== ===== ==============================
1928
1929 ..
1930
1931   .. table:: Floating Point Denorm Mode Enumeration Values
1932      :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
1933
1934      ====================================== ===== ==============================
1935      Enumeration Name                       Value Description
1936      ====================================== ===== ==============================
1937      AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC_DST 0     Flush Source and Destination
1938                                                   Denorms
1939      AMDGPU_FLOAT_DENORM_MODE_FLUSH_DST     1     Flush Output Denorms
1940      AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC     2     Flush Source Denorms
1941      AMDGPU_FLOAT_DENORM_MODE_FLUSH_NONE    3     No Flush
1942      ====================================== ===== ==============================
1943
1944 ..
1945
1946   .. table:: System VGPR Work-Item ID Enumeration Values
1947      :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
1948
1949      ======================================== ===== ============================
1950      Enumeration Name                         Value Description
1951      ======================================== ===== ============================
1952      AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X         0     Set work-item X dimension
1953                                                     ID.
1954      AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y       1     Set work-item X and Y
1955                                                     dimensions ID.
1956      AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y_Z     2     Set work-item X, Y and Z
1957                                                     dimensions ID.
1958      AMDGPU_SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3     Undefined.
1959      ======================================== ===== ============================
1960
1961 .. _amdgpu-amdhsa-initial-kernel-execution-state:
1962
1963 Initial Kernel Execution State
1964 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1965
1966 This section defines the register state that will be set up by the packet
1967 processor prior to the start of execution of every wavefront. This is limited by
1968 the constraints of the hardware controllers of CP/ADC/SPI.
1969
1970 The order of the SGPR registers is defined, but the compiler can specify which
1971 ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
1972 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
1973 for enabled registers are dense starting at SGPR0: the first enabled register is
1974 SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
1975 an SGPR number.
1976
1977 The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
1978 all waves of the grid. It is possible to specify more than 16 User SGPRs using
1979 the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
1980 initialized. These are then immediately followed by the System SGPRs that are
1981 set up by ADC/SPI and can have different values for each wave of the grid
1982 dispatch.
1983
1984 SGPR register initial state is defined in
1985 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
1986
1987   .. table:: SGPR Register Set Up Order
1988      :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
1989
1990      ========== ========================== ====== ==============================
1991      SGPR Order Name                       Number Description
1992                 (kernel descriptor enable  of
1993                 field)                     SGPRs
1994      ========== ========================== ====== ==============================
1995      First      Private Segment Buffer     4      V# that can be used, together
1996                 (enable_sgpr_private              with Scratch Wave Offset as an
1997                 _segment_buffer)                  offset, to access the private
1998                                                   memory space using a segment
1999                                                   address.
2000
2001                                                   CP uses the value provided by
2002                                                   the runtime.
2003      then       Dispatch Ptr               2      64 bit address of AQL dispatch
2004                 (enable_sgpr_dispatch_ptr)        packet for kernel dispatch
2005                                                   actually executing.
2006      then       Queue Ptr                  2      64 bit address of amd_queue_t
2007                 (enable_sgpr_queue_ptr)           object for AQL queue on which
2008                                                   the dispatch packet was
2009                                                   queued.
2010      then       Kernarg Segment Ptr        2      64 bit address of Kernarg
2011                 (enable_sgpr_kernarg              segment. This is directly
2012                 _segment_ptr)                     copied from the
2013                                                   kernarg_address in the kernel
2014                                                   dispatch packet.
2015
2016                                                   Having CP load it once avoids
2017                                                   loading it at the beginning of
2018                                                   every wavefront.
2019      then       Dispatch Id                2      64 bit Dispatch ID of the
2020                 (enable_sgpr_dispatch_id)         dispatch packet being
2021                                                   executed.
2022      then       Flat Scratch Init          2      This is 2 SGPRs:
2023                 (enable_sgpr_flat_scratch
2024                 _init)                            GFX6
2025                                                     Not supported.
2026                                                   GFX7-GFX8
2027                                                     The first SGPR is a 32 bit
2028                                                     byte offset from
2029                                                     ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2030                                                     to per SPI base of memory
2031                                                     for scratch for the queue
2032                                                     executing the kernel
2033                                                     dispatch. CP obtains this
2034                                                     from the runtime. (The
2035                                                     Scratch Segment Buffer base
2036                                                     address is
2037                                                     ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2038                                                     plus this offset.) The value
2039                                                     of Scratch Wave Offset must
2040                                                     be added to this offset by
2041                                                     the kernel machine code,
2042                                                     right shifted by 8, and
2043                                                     moved to the FLAT_SCRATCH_HI
2044                                                     SGPR register.
2045                                                     FLAT_SCRATCH_HI corresponds
2046                                                     to SGPRn-4 on GFX7, and
2047                                                     SGPRn-6 on GFX8 (where SGPRn
2048                                                     is the highest numbered SGPR
2049                                                     allocated to the wave).
2050                                                     FLAT_SCRATCH_HI is
2051                                                     multiplied by 256 (as it is
2052                                                     in units of 256 bytes) and
2053                                                     added to
2054                                                     ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2055                                                     to calculate the per wave
2056                                                     FLAT SCRATCH BASE in flat
2057                                                     memory instructions that
2058                                                     access the scratch
2059                                                     apperture.
2060
2061                                                     The second SGPR is 32 bit
2062                                                     byte size of a single
2063                                                     work-item's scratch memory
2064                                                     usage. CP obtains this from
2065                                                     the runtime, and it is
2066                                                     always a multiple of DWORD.
2067                                                     CP checks that the value in
2068                                                     the kernel dispatch packet
2069                                                     Private Segment Byte Size is
2070                                                     not larger, and requests the
2071                                                     runtime to increase the
2072                                                     queue's scratch size if
2073                                                     necessary. The kernel code
2074                                                     must move it to
2075                                                     FLAT_SCRATCH_LO which is
2076                                                     SGPRn-3 on GFX7 and SGPRn-5
2077                                                     on GFX8. FLAT_SCRATCH_LO is
2078                                                     used as the FLAT SCRATCH
2079                                                     SIZE in flat memory
2080                                                     instructions. Having CP load
2081                                                     it once avoids loading it at
2082                                                     the beginning of every
2083                                                     wavefront.
2084                                                   GFX9
2085                                                     This is the
2086                                                     64 bit base address of the
2087                                                     per SPI scratch backing
2088                                                     memory managed by SPI for
2089                                                     the queue executing the
2090                                                     kernel dispatch. CP obtains
2091                                                     this from the runtime (and
2092                                                     divides it if there are
2093                                                     multiple Shader Arrays each
2094                                                     with its own SPI). The value
2095                                                     of Scratch Wave Offset must
2096                                                     be added by the kernel
2097                                                     machine code and the result
2098                                                     moved to the FLAT_SCRATCH
2099                                                     SGPR which is SGPRn-6 and
2100                                                     SGPRn-5. It is used as the
2101                                                     FLAT SCRATCH BASE in flat
2102                                                     memory instructions.
2103      then       Private Segment Size       1      The 32 bit byte size of a
2104                                                   (enable_sgpr_private single
2105                                                   work-item's
2106                                                   scratch_segment_size) memory
2107                                                   allocation. This is the
2108                                                   value from the kernel
2109                                                   dispatch packet Private
2110                                                   Segment Byte Size rounded up
2111                                                   by CP to a multiple of
2112                                                   DWORD.
2113
2114                                                   Having CP load it once avoids
2115                                                   loading it at the beginning of
2116                                                   every wavefront.
2117
2118                                                   This is not used for
2119                                                   GFX7-GFX8 since it is the same
2120                                                   value as the second SGPR of
2121                                                   Flat Scratch Init. However, it
2122                                                   may be needed for GFX9 which
2123                                                   changes the meaning of the
2124                                                   Flat Scratch Init value.
2125      then       Grid Work-Group Count X    1      32 bit count of the number of
2126                 (enable_sgpr_grid                 work-groups in the X dimension
2127                 _workgroup_count_X)               for the grid being
2128                                                   executed. Computed from the
2129                                                   fields in the kernel dispatch
2130                                                   packet as ((grid_size.x +
2131                                                   workgroup_size.x - 1) /
2132                                                   workgroup_size.x).
2133      then       Grid Work-Group Count Y    1      32 bit count of the number of
2134                 (enable_sgpr_grid                 work-groups in the Y dimension
2135                 _workgroup_count_Y &&             for the grid being
2136                 less than 16 previous             executed. Computed from the
2137                 SGPRs)                            fields in the kernel dispatch
2138                                                   packet as ((grid_size.y +
2139                                                   workgroup_size.y - 1) /
2140                                                   workgroupSize.y).
2141
2142                                                   Only initialized if <16
2143                                                   previous SGPRs initialized.
2144      then       Grid Work-Group Count Z    1      32 bit count of the number of
2145                 (enable_sgpr_grid                 work-groups in the Z dimension
2146                 _workgroup_count_Z &&             for the grid being
2147                 less than 16 previous             executed. Computed from the
2148                 SGPRs)                            fields in the kernel dispatch
2149                                                   packet as ((grid_size.z +
2150                                                   workgroup_size.z - 1) /
2151                                                   workgroupSize.z).
2152
2153                                                   Only initialized if <16
2154                                                   previous SGPRs initialized.
2155      then       Work-Group Id X            1      32 bit work-group id in X
2156                 (enable_sgpr_workgroup_id         dimension of grid for
2157                 _X)                               wavefront.
2158      then       Work-Group Id Y            1      32 bit work-group id in Y
2159                 (enable_sgpr_workgroup_id         dimension of grid for
2160                 _Y)                               wavefront.
2161      then       Work-Group Id Z            1      32 bit work-group id in Z
2162                 (enable_sgpr_workgroup_id         dimension of grid for
2163                 _Z)                               wavefront.
2164      then       Work-Group Info            1      {first_wave, 14'b0000,
2165                 (enable_sgpr_workgroup            ordered_append_term[10:0],
2166                 _info)                            threadgroup_size_in_waves[5:0]}
2167      then       Scratch Wave Offset        1      32 bit byte offset from base
2168                 (enable_sgpr_private              of scratch base of queue
2169                 _segment_wave_offset)             executing the kernel
2170                                                   dispatch. Must be used as an
2171                                                   offset with Private
2172                                                   segment address when using
2173                                                   Scratch Segment Buffer. It
2174                                                   must be used to set up FLAT
2175                                                   SCRATCH for flat addressing
2176                                                   (see
2177                                                   :ref:`amdgpu-amdhsa-flat-scratch`).
2178      ========== ========================== ====== ==============================
2179
2180 The order of the VGPR registers is defined, but the compiler can specify which
2181 ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
2182 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2183 for enabled registers are dense starting at VGPR0: the first enabled register is
2184 VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
2185 VGPR number.
2186
2187 VGPR register initial state is defined in
2188 :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
2189
2190   .. table:: VGPR Register Set Up Order
2191      :name: amdgpu-amdhsa-vgpr-register-set-up-order-table
2192
2193      ========== ========================== ====== ==============================
2194      VGPR Order Name                       Number Description
2195                 (kernel descriptor enable  of
2196                 field)                     VGPRs
2197      ========== ========================== ====== ==============================
2198      First      Work-Item Id X             1      32 bit work item id in X
2199                 (Always initialized)              dimension of work-group for
2200                                                   wavefront lane.
2201      then       Work-Item Id Y             1      32 bit work item id in Y
2202                 (enable_vgpr_workitem_id          dimension of work-group for
2203                 > 0)                              wavefront lane.
2204      then       Work-Item Id Z             1      32 bit work item id in Z
2205                 (enable_vgpr_workitem_id          dimension of work-group for
2206                 > 1)                              wavefront lane.
2207      ========== ========================== ====== ==============================
2208
2209 The setting of registers is is done by GPU CP/ADC/SPI hardware as follows:
2210
2211 1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
2212    registers.
2213 2. Work-group Id registers X, Y, Z are set by ADC which supports any
2214    combination including none.
2215 3. Scratch Wave Offset is set by SPI in a per wave basis which is why its value
2216    cannot included with the flat scratch init value which is per queue.
2217 4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
2218    or (X, Y, Z).
2219
2220 Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
2221 value to the hardware required SGPRn-3 and SGPRn-4 respectively.
2222
2223 The global segment can be accessed either using buffer instructions (GFX6 which
2224 has V# 64 bit address support), flat instructions (GFX7-GFX9), or global
2225 instructions (GFX9).
2226
2227 If buffer operations are used then the compiler can generate a V# with the
2228 following properties:
2229
2230 * base address of 0
2231 * no swizzle
2232 * ATC: 1 if IOMMU present (such as APU)
2233 * ptr64: 1
2234 * MTYPE set to support memory coherence that matches the runtime (such as CC for
2235   APU and NC for dGPU).
2236
2237 .. _amdgpu-amdhsa-kernel-prolog:
2238
2239 Kernel Prolog
2240 ~~~~~~~~~~~~~
2241
2242 .. _amdgpu-amdhsa-m0:
2243
2244 M0
2245 ++
2246
2247 GFX6-GFX8
2248   The M0 register must be initialized with a value at least the total LDS size
2249   if the kernel may access LDS via DS or flat operations. Total LDS size is
2250   available in dispatch packet. For M0, it is also possible to use maximum
2251   possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
2252   GFX7-GFX8).
2253 GFX9
2254   The M0 register is not used for range checking LDS accesses and so does not
2255   need to be initialized in the prolog.
2256
2257 .. _amdgpu-amdhsa-flat-scratch:
2258
2259 Flat Scratch
2260 ++++++++++++
2261
2262 If the kernel may use flat operations to access scratch memory, the prolog code
2263 must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
2264 are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wave
2265 Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
2266
2267 GFX6
2268   Flat scratch is not supported.
2269
2270 GFX7-GFX8
2271   1. The low word of Flat Scratch Init is 32 bit byte offset from
2272      ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
2273      being managed by SPI for the queue executing the kernel dispatch. This is
2274      the same value used in the Scratch Segment Buffer V# base address. The
2275      prolog must add the value of Scratch Wave Offset to get the wave's byte
2276      scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
2277      FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
2278      by 8 before moving into FLAT_SCRATCH_LO.
2279   2. The second word of Flat Scratch Init is 32 bit byte size of a single
2280      work-items scratch memory usage. This is directly loaded from the kernel
2281      dispatch packet Private Segment Byte Size and rounded up to a multiple of
2282      DWORD. Having CP load it once avoids loading it at the beginning of every
2283      wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
2284      SIZE.
2285
2286 GFX9
2287   The Flat Scratch Init is the 64 bit address of the base of scratch backing
2288   memory being managed by SPI for the queue executing the kernel dispatch. The
2289   prolog must add the value of Scratch Wave Offset and moved to the FLAT_SCRATCH
2290   pair for use as the flat scratch base in flat memory instructions.
2291
2292 .. _amdgpu-amdhsa-memory-model:
2293
2294 Memory Model
2295 ~~~~~~~~~~~~
2296
2297 This section describes the mapping of LLVM memory model onto AMDGPU machine code
2298 (see :ref:`memmodel`). *The implementation is WIP.*
2299
2300 .. TODO
2301    Update when implementation complete.
2302
2303 The AMDGPU backend supports the memory synchronization scopes specified in
2304 :ref:`amdgpu-memory-scopes`.
2305
2306 The code sequences used to implement the memory model are defined in table
2307 :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
2308
2309 The sequences specify the order of instructions that a single thread must
2310 execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
2311 to other memory instructions executed by the same thread. This allows them to be
2312 moved earlier or later which can allow them to be combined with other instances
2313 of the same instruction, or hoisted/sunk out of loops to improve
2314 performance. Only the instructions related to the memory model are given;
2315 additional ``s_waitcnt`` instructions are required to ensure registers are
2316 defined before being used. These may be able to be combined with the memory
2317 model ``s_waitcnt`` instructions as described above.
2318
2319 The AMDGPU backend supports the following memory models:
2320
2321   HSA Memory Model [HSA]_
2322     The HSA memory model uses a single happens-before relation for all address
2323     spaces (see :ref:`amdgpu-address-spaces`).
2324   OpenCL Memory Model [OpenCL]_
2325     The OpenCL memory model which has separate happens-before relations for the
2326     global and local address spaces. Only a fence specifying both global and
2327     local address space, and seq_cst instructions join the relationships. Since
2328     the LLVM ``memfence`` instruction does not allow an address space to be
2329     specified the OpenCL fence has to convervatively assume both local and
2330     global address space was specified. However, optimizations can often be
2331     done to eliminate the additional ``s_waitcnt`` instructions when there are
2332     no intervening memory instructions which access the corresponding address
2333     space. The code sequences in the table indicate what can be omitted for the
2334     OpenCL memory. The target triple environment is used to determine if the
2335     source language is OpenCL (see :ref:`amdgpu-opencl`).
2336
2337 ``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
2338 operations.
2339
2340 ``buffer/global/flat_load/store/atomic`` instructions to global memory are
2341 termed vector memory operations.
2342
2343 For GFX6-GFX9:
2344
2345 * Each agent has multiple compute units (CU).
2346 * Each CU has multiple SIMDs that execute wavefronts.
2347 * The wavefronts for a single work-group are executed in the same CU but may be
2348   executed by different SIMDs.
2349 * Each CU has a single LDS memory shared by the wavefronts of the work-groups
2350   executing on it.
2351 * All LDS operations of a CU are performed as wavefront wide operations in a
2352   global order and involve no caching. Completion is reported to a wavefront in
2353   execution order.
2354 * The LDS memory has multiple request queues shared by the SIMDs of a
2355   CU. Therefore, the LDS operations performed by different waves of a work-group
2356   can be reordered relative to each other, which can result in reordering the
2357   visibility of vector memory operations with respect to LDS operations of other
2358   wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
2359   ensure synchronization between LDS operations and vector memory operations
2360   between waves of a work-group, but not between operations performed by the
2361   same wavefront.
2362 * The vector memory operations are performed as wavefront wide operations and
2363   completion is reported to a wavefront in execution order. The exception is
2364   that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
2365   vector memory order if they access LDS memory, and out of LDS operation order
2366   if they access global memory.
2367 * The vector memory operations access a single vector L1 cache shared by all
2368   SIMDs a CU. Therefore, no special action is required for coherence between the
2369   lanes of a single wavefront, or for coherence between wavefronts in the same
2370   work-group. A ``buffer_wbinvl1_vol`` is required for coherence between waves
2371   executing in different work-groups as they may be executing on different CUs.
2372 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
2373   on a group of CUs. The scalar and vector L1 caches are not coherent. However,
2374   scalar operations are used in a restricted way so do not impact the memory
2375   model. See :ref:`amdgpu-amdhsa-memory-spaces`.
2376 * The vector and scalar memory operations use an L2 cache shared by all CUs on
2377   the same agent.
2378 * The L2 cache has independent channels to service disjoint ranges of virtual
2379   addresses.
2380 * Each CU has a separate request queue per channel. Therefore, the vector and
2381   scalar memory operations performed by waves executing in different work-groups
2382   (which may be executing on different CUs) of an agent can be reordered
2383   relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
2384   synchronization between vector memory operations of different CUs. It ensures a
2385   previous vector memory operation has completed before executing a subsequent
2386   vector memory or LDS operation and so can be used to meet the requirements of
2387   acquire and release.
2388 * The L2 cache can be kept coherent with other agents on some targets, or ranges
2389   of virtual addresses can be set up to bypass it to ensure system coherence.
2390
2391 Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-GFX8),
2392 or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
2393 memory, atomic memory orderings are not meaningful and all accesses are treated
2394 as non-atomic.
2395
2396 Constant address space uses ``buffer/global_load`` instructions (or equivalent
2397 scalar memory instructions). Since the constant address space contents do not
2398 change during the execution of a kernel dispatch it is not legal to perform
2399 stores, and atomic memory orderings are not meaningful and all access are
2400 treated as non-atomic.
2401
2402 A memory synchronization scope wider than work-group is not meaningful for the
2403 group (LDS) address space and is treated as work-group.
2404
2405 The memory model does not support the region address space which is treated as
2406 non-atomic.
2407
2408 Acquire memory ordering is not meaningful on store atomic instructions and is
2409 treated as non-atomic.
2410
2411 Release memory ordering is not meaningful on load atomic instructions and is
2412 treated a non-atomic.
2413
2414 Acquire-release memory ordering is not meaningful on load or store atomic
2415 instructions and is treated as acquire and release respectively.
2416
2417 AMDGPU backend only uses scalar memory operations to access memory that is
2418 proven to not change during the execution of the kernel dispatch. This includes
2419 constant address space and global address space for program scope const
2420 variables. Therefore the kernel machine code does not have to maintain the
2421 scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
2422 and vector L1 caches are invalidated between kernel dispatches by CP since
2423 constant address space data may change between kernel dispatch executions. See
2424 :ref:`amdgpu-amdhsa-memory-spaces`.
2425
2426 The one execption is if scalar writes are used to spill SGPR registers. In this
2427 case the AMDGPU backend ensures the memory location used to spill is never
2428 accessed by vector memory operations at the same time. If scalar writes are used
2429 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
2430 return since the locations may be used for vector memory instructions by a
2431 future wave that uses the same scratch area, or a function call that creates a
2432 frame at the same address, respectively. There is no need for a ``s_dcache_inv``
2433 as all scalar writes are write-before-read in the same thread.
2434
2435 Scratch backing memory (which is used for the private address space)
2436 is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private
2437 address space is only accessed by a single thread, and is always
2438 write-before-read, there is never a need to invalidate these entries from the L1
2439 cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the
2440 volatile cache lines.
2441
2442 On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
2443 to invalidate the L2 cache. This also causes it to be treated as
2444 non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC
2445 (cache coherent) and so the L2 cache will coherent with the CPU and other
2446 agents.
2447
2448   .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
2449      :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
2450
2451      ============ ============ ============== ========== ===============================
2452      LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code
2453                   Ordering     Sync Scope     Address
2454                                               Space
2455      ============ ============ ============== ========== ===============================
2456      **Non-Atomic**
2457      -----------------------------------------------------------------------------------
2458      load         *none*       *none*         - global   - !volatile & !nontemporal
2459                                               - generic
2460                                               - private    1. buffer/global/flat_load
2461                                               - constant
2462                                                          - volatile & !nontemporal
2463
2464                                                            1. buffer/global/flat_load
2465                                                               glc=1
2466
2467                                                          - nontemporal
2468
2469                                                            1. buffer/global/flat_load
2470                                                               glc=1 slc=1
2471
2472      load         *none*       *none*         - local    1. ds_load
2473      store        *none*       *none*         - global   - !nontemporal
2474                                               - generic
2475                                               - private    1. buffer/global/flat_store
2476                                               - constant
2477                                                          - nontemporal
2478
2479                                                            1. buffer/global/flat_stote
2480                                                               glc=1 slc=1
2481
2482      store        *none*       *none*         - local    1. ds_store
2483      **Unordered Atomic**
2484      -----------------------------------------------------------------------------------
2485      load atomic  unordered    *any*          *any*      *Same as non-atomic*.
2486      store atomic unordered    *any*          *any*      *Same as non-atomic*.
2487      atomicrmw    unordered    *any*          *any*      *Same as monotonic
2488                                                          atomic*.
2489      **Monotonic Atomic**
2490      -----------------------------------------------------------------------------------
2491      load atomic  monotonic    - singlethread - global   1. buffer/global/flat_load
2492                                - wavefront    - generic
2493                                - workgroup
2494      load atomic  monotonic    - singlethread - local    1. ds_load
2495                                - wavefront
2496                                - workgroup
2497      load atomic  monotonic    - agent        - global   1. buffer/global/flat_load
2498                                - system       - generic     glc=1
2499      store atomic monotonic    - singlethread - global   1. buffer/global/flat_store
2500                                - wavefront    - generic
2501                                - workgroup
2502                                - agent
2503                                - system
2504      store atomic monotonic    - singlethread - local    1. ds_store
2505                                - wavefront
2506                                - workgroup
2507      atomicrmw    monotonic    - singlethread - global   1. buffer/global/flat_atomic
2508                                - wavefront    - generic
2509                                - workgroup
2510                                - agent
2511                                - system
2512      atomicrmw    monotonic    - singlethread - local    1. ds_atomic
2513                                - wavefront
2514                                - workgroup
2515      **Acquire Atomic**
2516      -----------------------------------------------------------------------------------
2517      load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load
2518                                - wavefront    - local
2519                                               - generic
2520      load atomic  acquire      - workgroup    - global   1. buffer/global/flat_load
2521      load atomic  acquire      - workgroup    - local    1. ds_load
2522                                                          2. s_waitcnt lgkmcnt(0)
2523
2524                                                            - If OpenCL, omit.
2525                                                            - Must happen before
2526                                                              any following
2527                                                              global/generic
2528                                                              load/load
2529                                                              atomic/store/store
2530                                                              atomic/atomicrmw.
2531                                                            - Ensures any
2532                                                              following global
2533                                                              data read is no
2534                                                              older than the load
2535                                                              atomic value being
2536                                                              acquired.
2537      load atomic  acquire      - workgroup    - generic  1. flat_load
2538                                                          2. s_waitcnt lgkmcnt(0)
2539
2540                                                            - If OpenCL, omit.
2541                                                            - Must happen before
2542                                                              any following
2543                                                              global/generic
2544                                                              load/load
2545                                                              atomic/store/store
2546                                                              atomic/atomicrmw.
2547                                                            - Ensures any
2548                                                              following global
2549                                                              data read is no
2550                                                              older than the load
2551                                                              atomic value being
2552                                                              acquired.
2553      load atomic  acquire      - agent        - global   1. buffer/global/flat_load
2554                                - system                     glc=1
2555                                                          2. s_waitcnt vmcnt(0)
2556
2557                                                            - Must happen before
2558                                                              following
2559                                                              buffer_wbinvl1_vol.
2560                                                            - Ensures the load
2561                                                              has completed
2562                                                              before invalidating
2563                                                              the cache.
2564
2565                                                          3. buffer_wbinvl1_vol
2566
2567                                                            - Must happen before
2568                                                              any following
2569                                                              global/generic
2570                                                              load/load
2571                                                              atomic/atomicrmw.
2572                                                            - Ensures that
2573                                                              following
2574                                                              loads will not see
2575                                                              stale global data.
2576
2577      load atomic  acquire      - agent        - generic  1. flat_load glc=1
2578                                - system                  2. s_waitcnt vmcnt(0) &
2579                                                             lgkmcnt(0)
2580
2581                                                            - If OpenCL omit
2582                                                              lgkmcnt(0).
2583                                                            - Must happen before
2584                                                              following
2585                                                              buffer_wbinvl1_vol.
2586                                                            - Ensures the flat_load
2587                                                              has completed
2588                                                              before invalidating
2589                                                              the cache.
2590
2591                                                          3. buffer_wbinvl1_vol
2592
2593                                                            - Must happen before
2594                                                              any following
2595                                                              global/generic
2596                                                              load/load
2597                                                              atomic/atomicrmw.
2598                                                            - Ensures that
2599                                                              following loads
2600                                                              will not see stale
2601                                                              global data.
2602
2603      atomicrmw    acquire      - singlethread - global   1. buffer/global/ds/flat_atomic
2604                                - wavefront    - local
2605                                               - generic
2606      atomicrmw    acquire      - workgroup    - global   1. buffer/global/flat_atomic
2607      atomicrmw    acquire      - workgroup    - local    1. ds_atomic
2608                                                          2. waitcnt lgkmcnt(0)
2609
2610                                                            - If OpenCL, omit.
2611                                                            - Must happen before
2612                                                              any following
2613                                                              global/generic
2614                                                              load/load
2615                                                              atomic/store/store
2616                                                              atomic/atomicrmw.
2617                                                            - Ensures any
2618                                                              following global
2619                                                              data read is no
2620                                                              older than the
2621                                                              atomicrmw value
2622                                                              being acquired.
2623
2624      atomicrmw    acquire      - workgroup    - generic  1. flat_atomic
2625                                                          2. waitcnt lgkmcnt(0)
2626
2627                                                            - If OpenCL, omit.
2628                                                            - Must happen before
2629                                                              any following
2630                                                              global/generic
2631                                                              load/load
2632                                                              atomic/store/store
2633                                                              atomic/atomicrmw.
2634                                                            - Ensures any
2635                                                              following global
2636                                                              data read is no
2637                                                              older than the
2638                                                              atomicrmw value
2639                                                              being acquired.
2640
2641      atomicrmw    acquire      - agent        - global   1. buffer/global/flat_atomic
2642                                - system                  2. s_waitcnt vmcnt(0)
2643
2644                                                            - Must happen before
2645                                                              following
2646                                                              buffer_wbinvl1_vol.
2647                                                            - Ensures the
2648                                                              atomicrmw has
2649                                                              completed before
2650                                                              invalidating the
2651                                                              cache.
2652
2653                                                          3. buffer_wbinvl1_vol
2654
2655                                                            - Must happen before
2656                                                              any following
2657                                                              global/generic
2658                                                              load/load
2659                                                              atomic/atomicrmw.
2660                                                            - Ensures that
2661                                                              following loads
2662                                                              will not see stale
2663                                                              global data.
2664
2665      atomicrmw    acquire      - agent        - generic  1. flat_atomic
2666                                - system                  2. s_waitcnt vmcnt(0) &
2667                                                             lgkmcnt(0)
2668
2669                                                            - If OpenCL, omit
2670                                                              lgkmcnt(0).
2671                                                            - Must happen before
2672                                                              following
2673                                                              buffer_wbinvl1_vol.
2674                                                            - Ensures the
2675                                                              atomicrmw has
2676                                                              completed before
2677                                                              invalidating the
2678                                                              cache.
2679
2680                                                          3. buffer_wbinvl1_vol
2681
2682                                                            - Must happen before
2683                                                              any following
2684                                                              global/generic
2685                                                              load/load
2686                                                              atomic/atomicrmw.
2687                                                            - Ensures that
2688                                                              following loads
2689                                                              will not see stale
2690                                                              global data.
2691
2692      fence        acquire      - singlethread *none*     *none*
2693                                - wavefront
2694      fence        acquire      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
2695
2696                                                            - If OpenCL and
2697                                                              address space is
2698                                                              not generic, omit.
2699                                                            - However, since LLVM
2700                                                              currently has no
2701                                                              address space on
2702                                                              the fence need to
2703                                                              conservatively
2704                                                              always generate. If
2705                                                              fence had an
2706                                                              address space then
2707                                                              set to address
2708                                                              space of OpenCL
2709                                                              fence flag, or to
2710                                                              generic if both
2711                                                              local and global
2712                                                              flags are
2713                                                              specified.
2714                                                            - Must happen after
2715                                                              any preceding
2716                                                              local/generic load
2717                                                              atomic/atomicrmw
2718                                                              with an equal or
2719                                                              wider sync scope
2720                                                              and memory ordering
2721                                                              stronger than
2722                                                              unordered (this is
2723                                                              termed the
2724                                                              fence-paired-atomic).
2725                                                            - Must happen before
2726                                                              any following
2727                                                              global/generic
2728                                                              load/load
2729                                                              atomic/store/store
2730                                                              atomic/atomicrmw.
2731                                                            - Ensures any
2732                                                              following global
2733                                                              data read is no
2734                                                              older than the
2735                                                              value read by the
2736                                                              fence-paired-atomic.
2737
2738      fence        acquire      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
2739                                - system                     vmcnt(0)
2740
2741                                                            - If OpenCL and
2742                                                              address space is
2743                                                              not generic, omit
2744                                                              lgkmcnt(0).
2745                                                            - However, since LLVM
2746                                                              currently has no
2747                                                              address space on
2748                                                              the fence need to
2749                                                              conservatively
2750                                                              always generate
2751                                                              (see comment for
2752                                                              previous fence).
2753                                                            - Could be split into
2754                                                              separate s_waitcnt
2755                                                              vmcnt(0) and
2756                                                              s_waitcnt
2757                                                              lgkmcnt(0) to allow
2758                                                              them to be
2759                                                              independently moved
2760                                                              according to the
2761                                                              following rules.
2762                                                            - s_waitcnt vmcnt(0)
2763                                                              must happen after
2764                                                              any preceding
2765                                                              global/generic load
2766                                                              atomic/atomicrmw
2767                                                              with an equal or
2768                                                              wider sync scope
2769                                                              and memory ordering
2770                                                              stronger than
2771                                                              unordered (this is
2772                                                              termed the
2773                                                              fence-paired-atomic).
2774                                                            - s_waitcnt lgkmcnt(0)
2775                                                              must happen after
2776                                                              any preceding
2777                                                              local/generic load
2778                                                              atomic/atomicrmw
2779                                                              with an equal or
2780                                                              wider sync scope
2781                                                              and memory ordering
2782                                                              stronger than
2783                                                              unordered (this is
2784                                                              termed the
2785                                                              fence-paired-atomic).
2786                                                            - Must happen before
2787                                                              the following
2788                                                              buffer_wbinvl1_vol.
2789                                                            - Ensures that the
2790                                                              fence-paired atomic
2791                                                              has completed
2792                                                              before invalidating
2793                                                              the
2794                                                              cache. Therefore
2795                                                              any following
2796                                                              locations read must
2797                                                              be no older than
2798                                                              the value read by
2799                                                              the
2800                                                              fence-paired-atomic.
2801
2802                                                          2. buffer_wbinvl1_vol
2803
2804                                                            - Must happen before any
2805                                                              following global/generic
2806                                                              load/load
2807                                                              atomic/store/store
2808                                                              atomic/atomicrmw.
2809                                                            - Ensures that
2810                                                              following loads
2811                                                              will not see stale
2812                                                              global data.
2813
2814      **Release Atomic**
2815      -----------------------------------------------------------------------------------
2816      store atomic release      - singlethread - global   1. buffer/global/ds/flat_store
2817                                - wavefront    - local
2818                                               - generic
2819      store atomic release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
2820
2821                                                            - If OpenCL, omit.
2822                                                            - Must happen after
2823                                                              any preceding
2824                                                              local/generic
2825                                                              load/store/load
2826                                                              atomic/store
2827                                                              atomic/atomicrmw.
2828                                                            - Must happen before
2829                                                              the following
2830                                                              store.
2831                                                            - Ensures that all
2832                                                              memory operations
2833                                                              to local have
2834                                                              completed before
2835                                                              performing the
2836                                                              store that is being
2837                                                              released.
2838
2839                                                          2. buffer/global/flat_store
2840      store atomic release      - workgroup    - local    1. ds_store
2841      store atomic release      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)
2842
2843                                                            - If OpenCL, omit.
2844                                                            - Must happen after
2845                                                              any preceding
2846                                                              local/generic
2847                                                              load/store/load
2848                                                              atomic/store
2849                                                              atomic/atomicrmw.
2850                                                            - Must happen before
2851                                                              the following
2852                                                              store.
2853                                                            - Ensures that all
2854                                                              memory operations
2855                                                              to local have
2856                                                              completed before
2857                                                              performing the
2858                                                              store that is being
2859                                                              released.
2860
2861                                                          2. flat_store
2862      store atomic release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
2863                                - system       - generic     vmcnt(0)
2864
2865                                                            - If OpenCL, omit
2866                                                              lgkmcnt(0).
2867                                                            - Could be split into
2868                                                              separate s_waitcnt
2869                                                              vmcnt(0) and
2870                                                              s_waitcnt
2871                                                              lgkmcnt(0) to allow
2872                                                              them to be
2873                                                              independently moved
2874                                                              according to the
2875                                                              following rules.
2876                                                            - s_waitcnt vmcnt(0)
2877                                                              must happen after
2878                                                              any preceding
2879                                                              global/generic
2880                                                              load/store/load
2881                                                              atomic/store
2882                                                              atomic/atomicrmw.
2883                                                            - s_waitcnt lgkmcnt(0)
2884                                                              must happen after
2885                                                              any preceding
2886                                                              local/generic
2887                                                              load/store/load
2888                                                              atomic/store
2889                                                              atomic/atomicrmw.
2890                                                            - Must happen before
2891                                                              the following
2892                                                              store.
2893                                                            - Ensures that all
2894                                                              memory operations
2895                                                              to memory have
2896                                                              completed before
2897                                                              performing the
2898                                                              store that is being
2899                                                              released.
2900
2901                                                          2. buffer/global/ds/flat_store
2902      atomicrmw    release      - singlethread - global   1. buffer/global/ds/flat_atomic
2903                                - wavefront    - local
2904                                               - generic
2905      atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
2906
2907                                                            - If OpenCL, omit.
2908                                                            - Must happen after
2909                                                              any preceding
2910                                                              local/generic
2911                                                              load/store/load
2912                                                              atomic/store
2913                                                              atomic/atomicrmw.
2914                                                            - Must happen before
2915                                                              the following
2916                                                              atomicrmw.
2917                                                            - Ensures that all
2918                                                              memory operations
2919                                                              to local have
2920                                                              completed before
2921                                                              performing the
2922                                                              atomicrmw that is
2923                                                              being released.
2924
2925                                                          2. buffer/global/flat_atomic
2926      atomicrmw    release      - workgroup    - local    1. ds_atomic
2927      atomicrmw    release      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)
2928
2929                                                            - If OpenCL, omit.
2930                                                            - Must happen after
2931                                                              any preceding
2932                                                              local/generic
2933                                                              load/store/load
2934                                                              atomic/store
2935                                                              atomic/atomicrmw.
2936                                                            - Must happen before
2937                                                              the following
2938                                                              atomicrmw.
2939                                                            - Ensures that all
2940                                                              memory operations
2941                                                              to local have
2942                                                              completed before
2943                                                              performing the
2944                                                              atomicrmw that is
2945                                                              being released.
2946
2947                                                          2. flat_atomic
2948      atomicrmw    release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
2949                                - system       - generic     vmcnt(0)
2950
2951                                                            - If OpenCL, omit
2952                                                              lgkmcnt(0).
2953                                                            - Could be split into
2954                                                              separate s_waitcnt
2955                                                              vmcnt(0) and
2956                                                              s_waitcnt
2957                                                              lgkmcnt(0) to allow
2958                                                              them to be
2959                                                              independently moved
2960                                                              according to the
2961                                                              following rules.
2962                                                            - s_waitcnt vmcnt(0)
2963                                                              must happen after
2964                                                              any preceding
2965                                                              global/generic
2966                                                              load/store/load
2967                                                              atomic/store
2968                                                              atomic/atomicrmw.
2969                                                            - s_waitcnt lgkmcnt(0)
2970                                                              must happen after
2971                                                              any preceding
2972                                                              local/generic
2973                                                              load/store/load
2974                                                              atomic/store
2975                                                              atomic/atomicrmw.
2976                                                            - Must happen before
2977                                                              the following
2978                                                              atomicrmw.
2979                                                            - Ensures that all
2980                                                              memory operations
2981                                                              to global and local
2982                                                              have completed
2983                                                              before performing
2984                                                              the atomicrmw that
2985                                                              is being released.
2986
2987                                                          2. buffer/global/ds/flat_atomic
2988      fence        release      - singlethread *none*     *none*
2989                                - wavefront
2990      fence        release      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
2991
2992                                                            - If OpenCL and
2993                                                              address space is
2994                                                              not generic, omit.
2995                                                            - However, since LLVM
2996                                                              currently has no
2997                                                              address space on
2998                                                              the fence need to
2999                                                              conservatively
3000                                                              always generate. If
3001                                                              fence had an
3002                                                              address space then
3003                                                              set to address
3004                                                              space of OpenCL
3005                                                              fence flag, or to
3006                                                              generic if both
3007                                                              local and global
3008                                                              flags are
3009                                                              specified.
3010                                                            - Must happen after
3011                                                              any preceding
3012                                                              local/generic
3013                                                              load/load
3014                                                              atomic/store/store
3015                                                              atomic/atomicrmw.
3016                                                            - Must happen before
3017                                                              any following store
3018                                                              atomic/atomicrmw
3019                                                              with an equal or
3020                                                              wider sync scope
3021                                                              and memory ordering
3022                                                              stronger than
3023                                                              unordered (this is
3024                                                              termed the
3025                                                              fence-paired-atomic).
3026                                                            - Ensures that all
3027                                                              memory operations
3028                                                              to local have
3029                                                              completed before
3030                                                              performing the
3031                                                              following
3032                                                              fence-paired-atomic.
3033
3034      fence        release      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
3035                                - system                     vmcnt(0)
3036
3037                                                            - If OpenCL and
3038                                                              address space is
3039                                                              not generic, omit
3040                                                              lgkmcnt(0).
3041                                                            - If OpenCL and
3042                                                              address space is
3043                                                              local, omit
3044                                                              vmcnt(0).
3045                                                            - However, since LLVM
3046                                                              currently has no
3047                                                              address space on
3048                                                              the fence need to
3049                                                              conservatively
3050                                                              always generate. If
3051                                                              fence had an
3052                                                              address space then
3053                                                              set to address
3054                                                              space of OpenCL
3055                                                              fence flag, or to
3056                                                              generic if both
3057                                                              local and global
3058                                                              flags are
3059                                                              specified.
3060                                                            - Could be split into
3061                                                              separate s_waitcnt
3062                                                              vmcnt(0) and
3063                                                              s_waitcnt
3064                                                              lgkmcnt(0) to allow
3065                                                              them to be
3066                                                              independently moved
3067                                                              according to the
3068                                                              following rules.
3069                                                            - s_waitcnt vmcnt(0)
3070                                                              must happen after
3071                                                              any preceding
3072                                                              global/generic
3073                                                              load/store/load
3074                                                              atomic/store
3075                                                              atomic/atomicrmw.
3076                                                            - s_waitcnt lgkmcnt(0)
3077                                                              must happen after
3078                                                              any preceding
3079                                                              local/generic
3080                                                              load/store/load
3081                                                              atomic/store
3082                                                              atomic/atomicrmw.
3083                                                            - Must happen before
3084                                                              any following store
3085                                                              atomic/atomicrmw
3086                                                              with an equal or
3087                                                              wider sync scope
3088                                                              and memory ordering
3089                                                              stronger than
3090                                                              unordered (this is
3091                                                              termed the
3092                                                              fence-paired-atomic).
3093                                                            - Ensures that all
3094                                                              memory operations
3095                                                              have
3096                                                              completed before
3097                                                              performing the
3098                                                              following
3099                                                              fence-paired-atomic.
3100
3101      **Acquire-Release Atomic**
3102      -----------------------------------------------------------------------------------
3103      atomicrmw    acq_rel      - singlethread - global   1. buffer/global/ds/flat_atomic
3104                                - wavefront    - local
3105                                               - generic
3106      atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
3107
3108                                                            - If OpenCL, omit.
3109                                                            - Must happen after
3110                                                              any preceding
3111                                                              local/generic
3112                                                              load/store/load
3113                                                              atomic/store
3114                                                              atomic/atomicrmw.
3115                                                            - Must happen before
3116                                                              the following
3117                                                              atomicrmw.
3118                                                            - Ensures that all
3119                                                              memory operations
3120                                                              to local have
3121                                                              completed before
3122                                                              performing the
3123                                                              atomicrmw that is
3124                                                              being released.
3125
3126                                                          2. buffer/global/flat_atomic
3127      atomicrmw    acq_rel      - workgroup    - local    1. ds_atomic
3128                                                          2. s_waitcnt lgkmcnt(0)
3129
3130                                                            - If OpenCL, omit.
3131                                                            - Must happen before
3132                                                              any following
3133                                                              global/generic
3134                                                              load/load
3135                                                              atomic/store/store
3136                                                              atomic/atomicrmw.
3137                                                            - Ensures any
3138                                                              following global
3139                                                              data read is no
3140                                                              older than the load
3141                                                              atomic value being
3142                                                              acquired.
3143
3144      atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)
3145
3146                                                            - If OpenCL, omit.
3147                                                            - Must happen after
3148                                                              any preceding
3149                                                              local/generic
3150                                                              load/store/load
3151                                                              atomic/store
3152                                                              atomic/atomicrmw.
3153                                                            - Must happen before
3154                                                              the following
3155                                                              atomicrmw.
3156                                                            - Ensures that all
3157                                                              memory operations
3158                                                              to local have
3159                                                              completed before
3160                                                              performing the
3161                                                              atomicrmw that is
3162                                                              being released.
3163
3164                                                          2. flat_atomic
3165                                                          3. s_waitcnt lgkmcnt(0)
3166
3167                                                            - If OpenCL, omit.
3168                                                            - Must happen before
3169                                                              any following
3170                                                              global/generic
3171                                                              load/load
3172                                                              atomic/store/store
3173                                                              atomic/atomicrmw.
3174                                                            - Ensures any
3175                                                              following global
3176                                                              data read is no
3177                                                              older than the load
3178                                                              atomic value being
3179                                                              acquired.
3180
3181      atomicrmw    acq_rel      - agent        - global   1. s_waitcnt lgkmcnt(0) &
3182                                - system                     vmcnt(0)
3183
3184                                                            - If OpenCL, omit
3185                                                              lgkmcnt(0).
3186                                                            - Could be split into
3187                                                              separate s_waitcnt
3188                                                              vmcnt(0) and
3189                                                              s_waitcnt
3190                                                              lgkmcnt(0) to allow
3191                                                              them to be
3192                                                              independently moved
3193                                                              according to the
3194                                                              following rules.
3195                                                            - s_waitcnt vmcnt(0)
3196                                                              must happen after
3197                                                              any preceding
3198                                                              global/generic
3199                                                              load/store/load
3200                                                              atomic/store
3201                                                              atomic/atomicrmw.
3202                                                            - s_waitcnt lgkmcnt(0)
3203                                                              must happen after
3204                                                              any preceding
3205                                                              local/generic
3206                                                              load/store/load
3207                                                              atomic/store
3208                                                              atomic/atomicrmw.
3209                                                            - Must happen before
3210                                                              the following
3211                                                              atomicrmw.
3212                                                            - Ensures that all
3213                                                              memory operations
3214                                                              to global have
3215                                                              completed before
3216                                                              performing the
3217                                                              atomicrmw that is
3218                                                              being released.
3219
3220                                                          2. buffer/global/flat_atomic
3221                                                          3. s_waitcnt vmcnt(0)
3222
3223                                                            - Must happen before
3224                                                              following
3225                                                              buffer_wbinvl1_vol.
3226                                                            - Ensures the
3227                                                              atomicrmw has
3228                                                              completed before
3229                                                              invalidating the
3230                                                              cache.
3231
3232                                                          4. buffer_wbinvl1_vol
3233
3234                                                            - Must happen before
3235                                                              any following
3236                                                              global/generic
3237                                                              load/load
3238                                                              atomic/atomicrmw.
3239                                                            - Ensures that
3240                                                              following loads
3241                                                              will not see stale
3242                                                              global data.
3243
3244      atomicrmw    acq_rel      - agent        - generic  1. s_waitcnt lgkmcnt(0) &
3245                                - system                     vmcnt(0)
3246
3247                                                            - If OpenCL, omit
3248                                                              lgkmcnt(0).
3249                                                            - Could be split into
3250                                                              separate s_waitcnt
3251                                                              vmcnt(0) and
3252                                                              s_waitcnt
3253                                                              lgkmcnt(0) to allow
3254                                                              them to be
3255                                                              independently moved
3256                                                              according to the
3257                                                              following rules.
3258                                                            - s_waitcnt vmcnt(0)
3259                                                              must happen after
3260                                                              any preceding
3261                                                              global/generic
3262                                                              load/store/load
3263                                                              atomic/store
3264                                                              atomic/atomicrmw.
3265                                                            - s_waitcnt lgkmcnt(0)
3266                                                              must happen after
3267                                                              any preceding
3268                                                              local/generic
3269                                                              load/store/load
3270                                                              atomic/store
3271                                                              atomic/atomicrmw.
3272                                                            - Must happen before
3273                                                              the following
3274                                                              atomicrmw.
3275                                                            - Ensures that all
3276                                                              memory operations
3277                                                              to global have
3278                                                              completed before
3279                                                              performing the
3280                                                              atomicrmw that is
3281                                                              being released.
3282
3283                                                          2. flat_atomic
3284                                                          3. s_waitcnt vmcnt(0) &
3285                                                             lgkmcnt(0)
3286
3287                                                            - If OpenCL, omit
3288                                                              lgkmcnt(0).
3289                                                            - Must happen before
3290                                                              following
3291                                                              buffer_wbinvl1_vol.
3292                                                            - Ensures the
3293                                                              atomicrmw has
3294                                                              completed before
3295                                                              invalidating the
3296                                                              cache.
3297
3298                                                          4. buffer_wbinvl1_vol
3299
3300                                                            - Must happen before
3301                                                              any following
3302                                                              global/generic
3303                                                              load/load
3304                                                              atomic/atomicrmw.
3305                                                            - Ensures that
3306                                                              following loads
3307                                                              will not see stale
3308                                                              global data.
3309
3310      fence        acq_rel      - singlethread *none*     *none*
3311                                - wavefront
3312      fence        acq_rel      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
3313
3314                                                            - If OpenCL and
3315                                                              address space is
3316                                                              not generic, omit.
3317                                                            - However,
3318                                                              since LLVM
3319                                                              currently has no
3320                                                              address space on
3321                                                              the fence need to
3322                                                              conservatively
3323                                                              always generate
3324                                                              (see comment for
3325                                                              previous fence).
3326                                                            - Must happen after
3327                                                              any preceding
3328                                                              local/generic
3329                                                              load/load
3330                                                              atomic/store/store
3331                                                              atomic/atomicrmw.
3332                                                            - Must happen before
3333                                                              any following
3334                                                              global/generic
3335                                                              load/load
3336                                                              atomic/store/store
3337                                                              atomic/atomicrmw.
3338                                                            - Ensures that all
3339                                                              memory operations
3340                                                              to local have
3341                                                              completed before
3342                                                              performing any
3343                                                              following global
3344                                                              memory operations.
3345                                                            - Ensures that the
3346                                                              preceding
3347                                                              local/generic load
3348                                                              atomic/atomicrmw
3349                                                              with an equal or
3350                                                              wider sync scope
3351                                                              and memory ordering
3352                                                              stronger than
3353                                                              unordered (this is
3354                                                              termed the
3355                                                              acquire-fence-paired-atomic
3356                                                              ) has completed
3357                                                              before following
3358                                                              global memory
3359                                                              operations. This
3360                                                              satisfies the
3361                                                              requirements of
3362                                                              acquire.
3363                                                            - Ensures that all
3364                                                              previous memory
3365                                                              operations have
3366                                                              completed before a
3367                                                              following
3368                                                              local/generic store
3369                                                              atomic/atomicrmw
3370                                                              with an equal or
3371                                                              wider sync scope
3372                                                              and memory ordering
3373                                                              stronger than
3374                                                              unordered (this is
3375                                                              termed the
3376                                                              release-fence-paired-atomic
3377                                                              ). This satisfies the
3378                                                              requirements of
3379                                                              release.
3380
3381      fence        acq_rel      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
3382                                - system                     vmcnt(0)
3383
3384                                                            - If OpenCL and
3385                                                              address space is
3386                                                              not generic, omit
3387                                                              lgkmcnt(0).
3388                                                            - However, since LLVM
3389                                                              currently has no
3390                                                              address space on
3391                                                              the fence need to
3392                                                              conservatively
3393                                                              always generate
3394                                                              (see comment for
3395                                                              previous fence).
3396                                                            - Could be split into
3397                                                              separate s_waitcnt
3398                                                              vmcnt(0) and
3399                                                              s_waitcnt
3400                                                              lgkmcnt(0) to allow
3401                                                              them to be
3402                                                              independently moved
3403                                                              according to the
3404                                                              following rules.
3405                                                            - s_waitcnt vmcnt(0)
3406                                                              must happen after
3407                                                              any preceding
3408                                                              global/generic
3409                                                              load/store/load
3410                                                              atomic/store
3411                                                              atomic/atomicrmw.
3412                                                            - s_waitcnt lgkmcnt(0)
3413                                                              must happen after
3414                                                              any preceding
3415                                                              local/generic
3416                                                              load/store/load
3417                                                              atomic/store
3418                                                              atomic/atomicrmw.
3419                                                            - Must happen before
3420                                                              the following
3421                                                              buffer_wbinvl1_vol.
3422                                                            - Ensures that the
3423                                                              preceding
3424                                                              global/local/generic
3425                                                              load
3426                                                              atomic/atomicrmw
3427                                                              with an equal or
3428                                                              wider sync scope
3429                                                              and memory ordering
3430                                                              stronger than
3431                                                              unordered (this is
3432                                                              termed the
3433                                                              acquire-fence-paired-atomic
3434                                                              ) has completed
3435                                                              before invalidating
3436                                                              the cache. This
3437                                                              satisfies the
3438                                                              requirements of
3439                                                              acquire.
3440                                                            - Ensures that all
3441                                                              previous memory
3442                                                              operations have
3443                                                              completed before a
3444                                                              following
3445                                                              global/local/generic
3446                                                              store
3447                                                              atomic/atomicrmw
3448                                                              with an equal or
3449                                                              wider sync scope
3450                                                              and memory ordering
3451                                                              stronger than
3452                                                              unordered (this is
3453                                                              termed the
3454                                                              release-fence-paired-atomic
3455                                                              ). This satisfies the
3456                                                              requirements of
3457                                                              release.
3458
3459                                                          2. buffer_wbinvl1_vol
3460
3461                                                            - Must happen before
3462                                                              any following
3463                                                              global/generic
3464                                                              load/load
3465                                                              atomic/store/store
3466                                                              atomic/atomicrmw.
3467                                                            - Ensures that
3468                                                              following loads
3469                                                              will not see stale
3470                                                              global data. This
3471                                                              satisfies the
3472                                                              requirements of
3473                                                              acquire.
3474
3475      **Sequential Consistent Atomic**
3476      -----------------------------------------------------------------------------------
3477      load atomic  seq_cst      - singlethread - global   *Same as corresponding
3478                                - wavefront    - local    load atomic acquire,
3479                                               - generic  except must generated
3480                                                          all instructions even
3481                                                          for OpenCL.*
3482      load atomic  seq_cst      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
3483                                               - generic
3484                                                            - Must
3485                                                              happen after
3486                                                              preceding
3487                                                              global/generic load
3488                                                              atomic/store
3489                                                              atomic/atomicrmw
3490                                                              with memory
3491                                                              ordering of seq_cst
3492                                                              and with equal or
3493                                                              wider sync scope.
3494                                                              (Note that seq_cst
3495                                                              fences have their
3496                                                              own s_waitcnt
3497                                                              lgkmcnt(0) and so do
3498                                                              not need to be
3499                                                              considered.)
3500                                                            - Ensures any
3501                                                              preceding
3502                                                              sequential
3503                                                              consistent local
3504                                                              memory instructions
3505                                                              have completed
3506                                                              before executing
3507                                                              this sequentially
3508                                                              consistent
3509                                                              instruction. This
3510                                                              prevents reordering
3511                                                              a seq_cst store
3512                                                              followed by a
3513                                                              seq_cst load. (Note
3514                                                              that seq_cst is
3515                                                              stronger than
3516                                                              acquire/release as
3517                                                              the reordering of
3518                                                              load acquire
3519                                                              followed by a store
3520                                                              release is
3521                                                              prevented by the
3522                                                              waitcnt of
3523                                                              the release, but
3524                                                              there is nothing
3525                                                              preventing a store
3526                                                              release followed by
3527                                                              load acquire from
3528                                                              competing out of
3529                                                              order.)
3530
3531                                                          2. *Following
3532                                                             instructions same as
3533                                                             corresponding load
3534                                                             atomic acquire,
3535                                                             except must generated
3536                                                             all instructions even
3537                                                             for OpenCL.*
3538      load atomic  seq_cst      - workgroup    - local    *Same as corresponding
3539                                                          load atomic acquire,
3540                                                          except must generated
3541                                                          all instructions even
3542                                                          for OpenCL.*
3543      load atomic  seq_cst      - agent        - global   1. s_waitcnt lgkmcnt(0) &
3544                                - system       - generic     vmcnt(0)
3545
3546                                                            - Could be split into
3547                                                              separate s_waitcnt
3548                                                              vmcnt(0)
3549                                                              and s_waitcnt
3550                                                              lgkmcnt(0) to allow
3551                                                              them to be
3552                                                              independently moved
3553                                                              according to the
3554                                                              following rules.
3555                                                            - waitcnt lgkmcnt(0)
3556                                                              must happen after
3557                                                              preceding
3558                                                              global/generic load
3559                                                              atomic/store
3560                                                              atomic/atomicrmw
3561                                                              with memory
3562                                                              ordering of seq_cst
3563                                                              and with equal or
3564                                                              wider sync scope.
3565                                                              (Note that seq_cst
3566                                                              fences have their
3567                                                              own s_waitcnt
3568                                                              lgkmcnt(0) and so do
3569                                                              not need to be
3570                                                              considered.)
3571                                                            - waitcnt vmcnt(0)
3572                                                              must happen after
3573                                                              preceding
3574                                                              global/generic load
3575                                                              atomic/store
3576                                                              atomic/atomicrmw
3577                                                              with memory
3578                                                              ordering of seq_cst
3579                                                              and with equal or
3580                                                              wider sync scope.
3581                                                              (Note that seq_cst
3582                                                              fences have their
3583                                                              own s_waitcnt
3584                                                              vmcnt(0) and so do
3585                                                              not need to be
3586                                                              considered.)
3587                                                            - Ensures any
3588                                                              preceding
3589                                                              sequential
3590                                                              consistent global
3591                                                              memory instructions
3592                                                              have completed
3593                                                              before executing
3594                                                              this sequentially
3595                                                              consistent
3596                                                              instruction. This
3597                                                              prevents reordering
3598                                                              a seq_cst store
3599                                                              followed by a
3600                                                              seq_cst load. (Note
3601                                                              that seq_cst is
3602                                                              stronger than
3603                                                              acquire/release as
3604                                                              the reordering of
3605                                                              load acquire
3606                                                              followed by a store
3607                                                              release is
3608                                                              prevented by the
3609                                                              waitcnt of
3610                                                              the release, but
3611                                                              there is nothing
3612                                                              preventing a store
3613                                                              release followed by
3614                                                              load acquire from
3615                                                              competing out of
3616                                                              order.)
3617
3618                                                          2. *Following
3619                                                             instructions same as
3620                                                             corresponding load
3621                                                             atomic acquire,
3622                                                             except must generated
3623                                                             all instructions even
3624                                                             for OpenCL.*
3625      store atomic seq_cst      - singlethread - global   *Same as corresponding
3626                                - wavefront    - local    store atomic release,
3627                                - workgroup    - generic  except must generated
3628                                                          all instructions even
3629                                                          for OpenCL.*
3630      store atomic seq_cst      - agent        - global   *Same as corresponding
3631                                - system       - generic  store atomic release,
3632                                                          except must generated
3633                                                          all instructions even
3634                                                          for OpenCL.*
3635      atomicrmw    seq_cst      - singlethread - global   *Same as corresponding
3636                                - wavefront    - local    atomicrmw acq_rel,
3637                                - workgroup    - generic  except must generated
3638                                                          all instructions even
3639                                                          for OpenCL.*
3640      atomicrmw    seq_cst      - agent        - global   *Same as corresponding
3641                                - system       - generic  atomicrmw acq_rel,
3642                                                          except must generated
3643                                                          all instructions even
3644                                                          for OpenCL.*
3645      fence        seq_cst      - singlethread *none*     *Same as corresponding
3646                                - wavefront               fence acq_rel,
3647                                - workgroup               except must generated
3648                                - agent                   all instructions even
3649                                - system                  for OpenCL.*
3650      ============ ============ ============== ========== ===============================
3651
3652 The memory order also adds the single thread optimization constrains defined in
3653 table
3654 :ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
3655
3656   .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
3657      :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
3658
3659      ============ ==============================================================
3660      LLVM Memory  Optimization Constraints
3661      Ordering
3662      ============ ==============================================================
3663      unordered    *none*
3664      monotonic    *none*
3665      acquire      - If a load atomic/atomicrmw then no following load/load
3666                     atomic/store/ store atomic/atomicrmw/fence instruction can
3667                     be moved before the acquire.
3668                   - If a fence then same as load atomic, plus no preceding
3669                     associated fence-paired-atomic can be moved after the fence.
3670      release      - If a store atomic/atomicrmw then no preceding load/load
3671                     atomic/store/ store atomic/atomicrmw/fence instruction can
3672                     be moved after the release.
3673                   - If a fence then same as store atomic, plus no following
3674                     associated fence-paired-atomic can be moved before the
3675                     fence.
3676      acq_rel      Same constraints as both acquire and release.
3677      seq_cst      - If a load atomic then same constraints as acquire, plus no
3678                     preceding sequentially consistent load atomic/store
3679                     atomic/atomicrmw/fence instruction can be moved after the
3680                     seq_cst.
3681                   - If a store atomic then the same constraints as release, plus
3682                     no following sequentially consistent load atomic/store
3683                     atomic/atomicrmw/fence instruction can be moved before the
3684                     seq_cst.
3685                   - If an atomicrmw/fence then same constraints as acq_rel.
3686      ============ ==============================================================
3687
3688 Trap Handler ABI
3689 ~~~~~~~~~~~~~~~~
3690
3691 For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
3692 (such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
3693 the ``s_trap`` instruction with the following usage:
3694
3695   .. table:: AMDGPU Trap Handler for AMDHSA OS
3696      :name: amdgpu-trap-handler-for-amdhsa-os-table
3697
3698      =================== =============== =============== =======================
3699      Usage               Code Sequence   Trap Handler    Description
3700                                          Inputs
3701      =================== =============== =============== =======================
3702      reserved            ``s_trap 0x00``                 Reserved by hardware.
3703      ``debugtrap(arg)``  ``s_trap 0x01`` ``SGPR0-1``:    Reserved for HSA
3704                                            ``queue_ptr`` ``debugtrap``
3705                                          ``VGPR0``:      intrinsic (not
3706                                            ``arg``       implemented).
3707      ``llvm.trap``       ``s_trap 0x02`` ``SGPR0-1``:    Causes dispatch to be
3708                                            ``queue_ptr`` terminated and its
3709                                                          associated queue put
3710                                                          into the error state.
3711      ``llvm.debugtrap``  ``s_trap 0x03`` ``SGPR0-1``:    If debugger not
3712                                            ``queue_ptr`` installed handled
3713                                                          same as ``llvm.trap``.
3714      debugger breakpoint ``s_trap 0x07``                 Reserved for  debugger
3715                                                          breakpoints.
3716      debugger            ``s_trap 0x08``                 Reserved for debugger.
3717      debugger            ``s_trap 0xfe``                 Reserved for debugger.
3718      debugger            ``s_trap 0xff``                 Reserved for debugger.
3719      =================== =============== =============== =======================
3720
3721 Unspecified OS
3722 --------------
3723
3724 This section provides code conventions used when the target triple OS is
3725 empty (see :ref:`amdgpu-target-triples`).
3726
3727 Trap Handler ABI
3728 ~~~~~~~~~~~~~~~~
3729
3730 For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
3731 not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
3732 instructions are handled as follows:
3733
3734   .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
3735      :name: amdgpu-trap-handler-for-non-amdhsa-os-table
3736
3737      =============== =============== ===========================================
3738      Usage           Code Sequence   Description
3739      =============== =============== ===========================================
3740      llvm.trap       s_endpgm        Causes wavefront to be terminated.
3741      llvm.debugtrap  *none*          Compiler warning given that there is no
3742                                      trap handler installed.
3743      =============== =============== ===========================================
3744
3745 Source Languages
3746 ================
3747
3748 .. _amdgpu-opencl:
3749
3750 OpenCL
3751 ------
3752
3753 When generating code for the OpenCL language the target triple environment
3754 should be ``opencl`` or ``amdgizcl`` (see :ref:`amdgpu-target-triples`).
3755
3756 When the language is OpenCL the following differences occur:
3757
3758 1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
3759 2. The AMDGPU backend adds additional arguments to the kernel.
3760 3. Additional metadata is generated
3761    (:ref:`amdgpu-amdhsa-hsa-code-object-metadata`).
3762
3763 .. TODO
3764    Specify what affect this has. Hidden arguments added. Additional metadata
3765    generated.
3766
3767 .. _amdgpu-hcc:
3768
3769 HCC
3770 ---
3771
3772 When generating code for the OpenCL language the target triple environment
3773 should be ``hcc`` (see :ref:`amdgpu-target-triples`).
3774
3775 When the language is OpenCL the following differences occur:
3776
3777 1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
3778
3779 .. TODO
3780    Specify what affect this has.
3781
3782 Assembler
3783 ---------
3784
3785 AMDGPU backend has LLVM-MC based assembler which is currently in development.
3786 It supports AMDGCN GFX6-GFX9.
3787
3788 This section describes general syntax for instructions and operands. For more
3789 information about instructions, their semantics and supported combinations of
3790 operands, refer to one of instruction set architecture manuals
3791 [AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_ and [AMD-GCN-GFX9]_.
3792
3793 An instruction has the following syntax (register operands are normally
3794 comma-separated while extra operands are space-separated):
3795
3796 *<opcode> <register_operand0>, ... <extra_operand0> ...*
3797
3798 Operands
3799 ~~~~~~~~
3800
3801 The following syntax for register operands is supported:
3802
3803 * SGPR registers: s0, ... or s[0], ...
3804 * VGPR registers: v0, ... or v[0], ...
3805 * TTMP registers: ttmp0, ... or ttmp[0], ...
3806 * Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
3807 * Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
3808 * Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
3809 * Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
3810 * Register index expressions: v[2*2], s[1-1:2-1]
3811 * 'off' indicates that an operand is not enabled
3812
3813 The following extra operands are supported:
3814
3815 * offset, offset0, offset1
3816 * idxen, offen bits
3817 * glc, slc, tfe bits
3818 * waitcnt: integer or combination of counter values
3819 * VOP3 modifiers:
3820
3821   - abs (\| \|), neg (\-)
3822
3823 * DPP modifiers:
3824
3825   - row_shl, row_shr, row_ror, row_rol
3826   - row_mirror, row_half_mirror, row_bcast
3827   - wave_shl, wave_shr, wave_ror, wave_rol, quad_perm
3828   - row_mask, bank_mask, bound_ctrl
3829
3830 * SDWA modifiers:
3831
3832   - dst_sel, src0_sel, src1_sel (BYTE_N, WORD_M, DWORD)
3833   - dst_unused (UNUSED_PAD, UNUSED_SEXT, UNUSED_PRESERVE)
3834   - abs, neg, sext
3835
3836 Instruction Examples
3837 ~~~~~~~~~~~~~~~~~~~~
3838
3839 DS
3840 ~~
3841
3842 .. code-block:: nasm
3843
3844   ds_add_u32 v2, v4 offset:16
3845   ds_write_src2_b64 v2 offset0:4 offset1:8
3846   ds_cmpst_f32 v2, v4, v6
3847   ds_min_rtn_f64 v[8:9], v2, v[4:5]
3848
3849
3850 For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
3851
3852 FLAT
3853 ++++
3854
3855 .. code-block:: nasm
3856
3857   flat_load_dword v1, v[3:4]
3858   flat_store_dwordx3 v[3:4], v[5:7]
3859   flat_atomic_swap v1, v[3:4], v5 glc
3860   flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
3861   flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
3862
3863 For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
3864
3865 MUBUF
3866 +++++
3867
3868 .. code-block:: nasm
3869
3870   buffer_load_dword v1, off, s[4:7], s1
3871   buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
3872   buffer_store_format_xy v[1:2], off, s[4:7], s1
3873   buffer_wbinvl1
3874   buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
3875
3876 For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
3877
3878 SMRD/SMEM
3879 +++++++++
3880
3881 .. code-block:: nasm
3882
3883   s_load_dword s1, s[2:3], 0xfc
3884   s_load_dwordx8 s[8:15], s[2:3], s4
3885   s_load_dwordx16 s[88:103], s[2:3], s4
3886   s_dcache_inv_vol
3887   s_memtime s[4:5]
3888
3889 For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
3890
3891 SOP1
3892 ++++
3893
3894 .. code-block:: nasm
3895
3896   s_mov_b32 s1, s2
3897   s_mov_b64 s[0:1], 0x80000000
3898   s_cmov_b32 s1, 200
3899   s_wqm_b64 s[2:3], s[4:5]
3900   s_bcnt0_i32_b64 s1, s[2:3]
3901   s_swappc_b64 s[2:3], s[4:5]
3902   s_cbranch_join s[4:5]
3903
3904 For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
3905
3906 SOP2
3907 ++++
3908
3909 .. code-block:: nasm
3910
3911   s_add_u32 s1, s2, s3
3912   s_and_b64 s[2:3], s[4:5], s[6:7]
3913   s_cselect_b32 s1, s2, s3
3914   s_andn2_b32 s2, s4, s6
3915   s_lshr_b64 s[2:3], s[4:5], s6
3916   s_ashr_i32 s2, s4, s6
3917   s_bfm_b64 s[2:3], s4, s6
3918   s_bfe_i64 s[2:3], s[4:5], s6
3919   s_cbranch_g_fork s[4:5], s[6:7]
3920
3921 For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
3922
3923 SOPC
3924 ++++
3925
3926 .. code-block:: nasm
3927
3928   s_cmp_eq_i32 s1, s2
3929   s_bitcmp1_b32 s1, s2
3930   s_bitcmp0_b64 s[2:3], s4
3931   s_setvskip s3, s5
3932
3933 For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
3934
3935 SOPP
3936 ++++
3937
3938 .. code-block:: nasm
3939
3940   s_barrier
3941   s_nop 2
3942   s_endpgm
3943   s_waitcnt 0 ; Wait for all counters to be 0
3944   s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
3945   s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
3946   s_sethalt 9
3947   s_sleep 10
3948   s_sendmsg 0x1
3949   s_sendmsg sendmsg(MSG_INTERRUPT)
3950   s_trap 1
3951
3952 For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
3953
3954 Unless otherwise mentioned, little verification is performed on the operands
3955 of SOPP Instructions, so it is up to the programmer to be familiar with the
3956 range or acceptable values.
3957
3958 VALU
3959 ++++
3960
3961 For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
3962 the assembler will automatically use optimal encoding based on its operands.
3963 To force specific encoding, one can add a suffix to the opcode of the instruction:
3964
3965 * _e32 for 32-bit VOP1/VOP2/VOPC
3966 * _e64 for 64-bit VOP3
3967 * _dpp for VOP_DPP
3968 * _sdwa for VOP_SDWA
3969
3970 VOP1/VOP2/VOP3/VOPC examples:
3971
3972 .. code-block:: nasm
3973
3974   v_mov_b32 v1, v2
3975   v_mov_b32_e32 v1, v2
3976   v_nop
3977   v_cvt_f64_i32_e32 v[1:2], v2
3978   v_floor_f32_e32 v1, v2
3979   v_bfrev_b32_e32 v1, v2
3980   v_add_f32_e32 v1, v2, v3
3981   v_mul_i32_i24_e64 v1, v2, 3
3982   v_mul_i32_i24_e32 v1, -3, v3
3983   v_mul_i32_i24_e32 v1, -100, v3
3984   v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
3985   v_max_f16_e32 v1, v2, v3
3986
3987 VOP_DPP examples:
3988
3989 .. code-block:: nasm
3990
3991   v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
3992   v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
3993   v_mov_b32 v0, v0 wave_shl:1
3994   v_mov_b32 v0, v0 row_mirror
3995   v_mov_b32 v0, v0 row_bcast:31
3996   v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
3997   v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
3998   v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
3999
4000 VOP_SDWA examples:
4001
4002 .. code-block:: nasm
4003
4004   v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
4005   v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
4006   v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
4007   v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
4008   v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
4009
4010 For full list of supported instructions, refer to "Vector ALU instructions".
4011
4012 HSA Code Object Directives
4013 ~~~~~~~~~~~~~~~~~~~~~~~~~~
4014
4015 AMDGPU ABI defines auxiliary data in output code object. In assembly source,
4016 one can specify them with assembler directives.
4017
4018 .hsa_code_object_version major, minor
4019 +++++++++++++++++++++++++++++++++++++
4020
4021 *major* and *minor* are integers that specify the version of the HSA code
4022 object that will be generated by the assembler.
4023
4024 .hsa_code_object_isa [major, minor, stepping, vendor, arch]
4025 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4026
4027
4028 *major*, *minor*, and *stepping* are all integers that describe the instruction
4029 set architecture (ISA) version of the assembly program.
4030
4031 *vendor* and *arch* are quoted strings.  *vendor* should always be equal to
4032 "AMD" and *arch* should always be equal to "AMDGPU".
4033
4034 By default, the assembler will derive the ISA version, *vendor*, and *arch*
4035 from the value of the -mcpu option that is passed to the assembler.
4036
4037 .amdgpu_hsa_kernel (name)
4038 +++++++++++++++++++++++++
4039
4040 This directives specifies that the symbol with given name is a kernel entry point
4041 (label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
4042
4043 .amd_kernel_code_t
4044 ++++++++++++++++++
4045
4046 This directive marks the beginning of a list of key / value pairs that are used
4047 to specify the amd_kernel_code_t object that will be emitted by the assembler.
4048 The list must be terminated by the *.end_amd_kernel_code_t* directive.  For
4049 any amd_kernel_code_t values that are unspecified a default value will be
4050 used.  The default value for all keys is 0, with the following exceptions:
4051
4052 - *kernel_code_version_major* defaults to 1.
4053 - *machine_kind* defaults to 1.
4054 - *machine_version_major*, *machine_version_minor*, and
4055   *machine_version_stepping* are derived from the value of the -mcpu option
4056   that is passed to the assembler.
4057 - *kernel_code_entry_byte_offset* defaults to 256.
4058 - *wavefront_size* defaults to 6.
4059 - *kernarg_segment_alignment*, *group_segment_alignment*, and
4060   *private_segment_alignment* default to 4. Note that alignments are specified
4061   as a power of two, so a value of **n** means an alignment of 2^ **n**.
4062
4063 The *.amd_kernel_code_t* directive must be placed immediately after the
4064 function label and before any instructions.
4065
4066 For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
4067 comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
4068
4069 Here is an example of a minimal amd_kernel_code_t specification:
4070
4071 .. code-block:: none
4072
4073    .hsa_code_object_version 1,0
4074    .hsa_code_object_isa
4075
4076    .hsatext
4077    .globl  hello_world
4078    .p2align 8
4079    .amdgpu_hsa_kernel hello_world
4080
4081    hello_world:
4082
4083       .amd_kernel_code_t
4084          enable_sgpr_kernarg_segment_ptr = 1
4085          is_ptr64 = 1
4086          compute_pgm_rsrc1_vgprs = 0
4087          compute_pgm_rsrc1_sgprs = 0
4088          compute_pgm_rsrc2_user_sgpr = 2
4089          kernarg_segment_byte_size = 8
4090          wavefront_sgpr_count = 2
4091          workitem_vgpr_count = 3
4092      .end_amd_kernel_code_t
4093
4094      s_load_dwordx2 s[0:1], s[0:1] 0x0
4095      v_mov_b32 v0, 3.14159
4096      s_waitcnt lgkmcnt(0)
4097      v_mov_b32 v1, s0
4098      v_mov_b32 v2, s1
4099      flat_store_dword v[1:2], v0
4100      s_endpgm
4101    .Lfunc_end0:
4102         .size   hello_world, .Lfunc_end0-hello_world
4103
4104 Additional Documentation
4105 ========================
4106
4107 .. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
4108 .. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
4109 .. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
4110 .. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
4111 .. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
4112 .. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
4113 .. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
4114 .. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
4115 .. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
4116 .. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
4117 .. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
4118 .. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
4119 .. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
4120 .. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
4121 .. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
4122 .. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__