docs/AMDGPUUsage.rst

   1 =============================
   2 User Guide for AMDGPU Backend
   3 =============================
   4
   5 .. contents::
   6    :local:
   7
   8 Introduction
   9 ============
  10
  11 The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
  12 R600 family up until the current GCN families. It lives in the
  13 ``lib/Target/AMDGPU`` directory.
  14
  15 LLVM
  16 ====
  17
  18 .. _amdgpu-target-triples:
  19
  20 Target Triples
  21 --------------
  22
  23 Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
  24 specify the target triple:
  25
  26   .. table:: AMDGPU Architectures
  27      :name: amdgpu-architecture-table
  28
  29      ============ ==============================================================
  30      Architecture Description
  31      ============ ==============================================================
  32      ``r600``     AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
  33      ``amdgcn``   AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
  34      ============ ==============================================================
  35
  36   .. table:: AMDGPU Vendors
  37      :name: amdgpu-vendor-table
  38
  39      ============ ==============================================================
  40      Vendor       Description
  41      ============ ==============================================================
  42      ``amd``      Can be used for all AMD GPU usage.
  43      ``mesa3d``   Can be used if the OS is ``mesa3d``.
  44      ============ ==============================================================
  45
  46   .. table:: AMDGPU Operating Systems
  47      :name: amdgpu-os-table
  48
  49      ============== ============================================================
  50      OS             Description
  51      ============== ============================================================
  52      *<empty>*      Defaults to the *unknown* OS.
  53      ``amdhsa``     Compute kernels executed on HSA [HSA]_ compatible runtimes
  54                     such as AMD's ROCm [AMD-ROCm]_.
  55      ``amdpal``     Graphic shaders and compute kernels executed on AMD PAL
  56                     runtime.
  57      ``mesa3d``     Graphic shaders and compute kernels executed on Mesa 3D
  58                     runtime.
  59      ============== ============================================================
  60
  61   .. table:: AMDGPU Environments
  62      :name: amdgpu-environment-table
  63
  64      ============ ==============================================================
  65      Environment  Description
  66      ============ ==============================================================
  67      *<empty>*    Defaults to ``opencl``.
  68      ``opencl``   OpenCL compute kernel (see :ref:`amdgpu-opencl`).
  69      ``hcc``      AMD HC language compute kernel (see :ref:`amdgpu-hcc`).
  70      ============ ==============================================================
  71
  72 .. _amdgpu-processors:
  73
  74 Processors
  75 ----------
  76
  77 Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
  78 names from both the *Processor* and *Alternative Processor* can be used.
  79
  80   .. table:: AMDGPU Processors
  81      :name: amdgpu-processor-table
  82
  83      =========== =============== ============ ===== ========= ======= ==================
  84      Processor   Alternative     Target       dGPU/ Target    ROCm    Example
  85                  Processor       Triple       APU   Features  Support Products
  86                                  Architecture       Supported
  87                                                     [Default]
  88      =========== =============== ============ ===== ========= ======= ==================
  89      **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
  90      -----------------------------------------------------------------------------------
  91      ``r600``                    ``r600``     dGPU
  92      ``r630``                    ``r600``     dGPU
  93      ``rs880``                   ``r600``     dGPU
  94      ``rv670``                   ``r600``     dGPU
  95      **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
  96      -----------------------------------------------------------------------------------
  97      ``rv710``                   ``r600``     dGPU
  98      ``rv730``                   ``r600``     dGPU
  99      ``rv770``                   ``r600``     dGPU
 100      **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
 101      -----------------------------------------------------------------------------------
 102      ``cedar``                   ``r600``     dGPU
 103      ``cypress``                 ``r600``     dGPU
 104      ``juniper``                 ``r600``     dGPU
 105      ``redwood``                 ``r600``     dGPU
 106      ``sumo``                    ``r600``     dGPU
 107      **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
 108      -----------------------------------------------------------------------------------
 109      ``barts``                   ``r600``     dGPU
 110      ``caicos``                  ``r600``     dGPU
 111      ``cayman``                  ``r600``     dGPU
 112      ``turks``                   ``r600``     dGPU
 113      **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
 114      -----------------------------------------------------------------------------------
 115      ``gfx600``  - ``tahiti``    ``amdgcn``   dGPU
 116      ``gfx601``  - ``hainan``    ``amdgcn``   dGPU
 117                  - ``oland``
 118                  - ``pitcairn``
 119                  - ``verde``
 120      **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
 121      -----------------------------------------------------------------------------------
 122      ``gfx700``  - ``kaveri``    ``amdgcn``   APU                     - A6-7000
 123                                                                       - A6 Pro-7050B
 124                                                                       - A8-7100
 125                                                                       - A8 Pro-7150B
 126                                                                       - A10-7300
 127                                                                       - A10 Pro-7350B
 128                                                                       - FX-7500
 129                                                                       - A8-7200P
 130                                                                       - A10-7400P
 131                                                                       - FX-7600P
 132      ``gfx701``  - ``hawaii``    ``amdgcn``   dGPU            ROCm    - FirePro W8100
 133                                                                       - FirePro W9100
 134                                                                       - FirePro S9150
 135                                                                       - FirePro S9170
 136      ``gfx702``                  ``amdgcn``   dGPU            ROCm    - Radeon R9 290
 137                                                                       - Radeon R9 290x
 138                                                                       - Radeon R390
 139                                                                       - Radeon R390x
 140      ``gfx703``  - ``kabini``    ``amdgcn``   APU                     - E1-2100
 141                  - ``mullins``                                        - E1-2200
 142                                                                       - E1-2500
 143                                                                       - E2-3000
 144                                                                       - E2-3800
 145                                                                       - A4-5000
 146                                                                       - A4-5100
 147                                                                       - A6-5200
 148                                                                       - A4 Pro-3340B
 149      ``gfx704``  - ``bonaire``   ``amdgcn``   dGPU                    - Radeon HD 7790
 150                                                                       - Radeon HD 8770
 151                                                                       - R7 260
 152                                                                       - R7 260X
 153      **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
 154      -----------------------------------------------------------------------------------
 155      ``gfx801``  - ``carrizo``   ``amdgcn``   APU   - xnack           - A6-8500P
 156                                                       [on]            - Pro A6-8500B
 157                                                                       - A8-8600P
 158                                                                       - Pro A8-8600B
 159                                                                       - FX-8800P
 160                                                                       - Pro A12-8800B
 161      \                           ``amdgcn``   APU   - xnack   ROCm    - A10-8700P
 162                                                       [on]            - Pro A10-8700B
 163                                                                       - A10-8780P
 164      \                           ``amdgcn``   APU   - xnack           - A10-9600P
 165                                                       [on]            - A10-9630P
 166                                                                       - A12-9700P
 167                                                                       - A12-9730P
 168                                                                       - FX-9800P
 169                                                                       - FX-9830P
 170      \                           ``amdgcn``   APU   - xnack           - E2-9010
 171                                                       [on]            - A6-9210
 172                                                                       - A9-9410
 173      ``gfx802``  - ``iceland``   ``amdgcn``   dGPU  - xnack   ROCm    - FirePro S7150
 174                  - ``tonga``                          [off]           - FirePro S7100
 175                                                                       - FirePro W7100
 176                                                                       - Radeon R285
 177                                                                       - Radeon R9 380
 178                                                                       - Radeon R9 385
 179                                                                       - Mobile FirePro
 180                                                                         M7170
 181      ``gfx803``  - ``fiji``      ``amdgcn``   dGPU  - xnack   ROCm    - Radeon R9 Nano
 182                                                       [off]           - Radeon R9 Fury
 183                                                                       - Radeon R9 FuryX
 184                                                                       - Radeon Pro Duo
 185                                                                       - FirePro S9300x2
 186                                                                       - Radeon Instinct MI8
 187      \           - ``polaris10`` ``amdgcn``   dGPU  - xnack   ROCm    - Radeon RX 470
 188                                                       [off]           - Radeon RX 480
 189                                                                       - Radeon Instinct MI6
 190      \           - ``polaris11`` ``amdgcn``   dGPU  - xnack   ROCm    - Radeon RX 460
 191                                                       [off]
 192      ``gfx810``  - ``stoney``    ``amdgcn``   APU   - xnack
 193                                                       [on]
 194      **GCN GFX9** [AMD-GCN-GFX9]_
 195      -----------------------------------------------------------------------------------
 196      ``gfx900``                  ``amdgcn``   dGPU  - xnack   ROCm    - Radeon Vega
 197                                                       [off]             Frontier Edition
 198                                                                       - Radeon RX Vega 56
 199                                                                       - Radeon RX Vega 64
 200                                                                       - Radeon RX Vega 64
 201                                                                         Liquid
 202                                                                       - Radeon Instinct MI25
 203      ``gfx902``                  ``amdgcn``   APU   - xnack           *TBA*
 204                                                       [on]
 205                                                                       .. TODO
 206                                                                          Add product
 207                                                                          names.
 208      =========== =============== ============ ===== ========= ======= ==================
 209
 210 .. _amdgpu-target-features:
 211
 212 Target Features
 213 ---------------
 214
 215 Target features control how code is generated to support certain
 216 processor specific features. Not all target features are supported by
 217 all processors. The runtime must ensure that the features supported by
 218 the device used to execute the code match the features enabled when
 219 generating the code. A mismatch of features may result in incorrect
 220 execution, or a reduction in performance.
 221
 222 The target features supported by each processor, and the default value
 223 used if not specified explicitly, is listed in
 224 :ref:`amdgpu-processor-table`.
 225
 226 Use the ``clang -m[no-]<TargetFeature>`` option to specify the AMD GPU
 227 target features.
 228
 229 For example:
 230
 231 ``-mxnack``
 232   Enable the ``xnack`` feature.
 233 ``-mno-xnack``
 234   Disable the ``xnack`` feature.
 235
 236   .. table:: AMDGPU Target Features
 237      :name: amdgpu-target-feature-table
 238
 239      ============== ==================================================
 240      Target Feature Description
 241      ============== ==================================================
 242      -m[no-]xnack   Enable/disable generating code that has
 243                     memory clauses that are compatible with
 244                     having XNACK replay enabled.
 245
 246                     This is used for demand paging and page
 247                     migration. If XNACK replay is enabled in
 248                     the device, then if a page fault occurs
 249                     the code may execute incorrectly if the
 250                     ``xnack`` feature is not enabled. Executing
 251                     code that has the feature enabled on a
 252                     device that does not have XNACK replay
 253                     enabled will execute correctly, but may
 254                     be less performant than code with the
 255                     feature disabled.
 256      ============== ==================================================
 257
 258 .. _amdgpu-address-spaces:
 259
 260 Address Spaces
 261 --------------
 262
 263 The AMDGPU backend uses the following address space mappings.
 264
 265 The memory space names used in the table, aside from the region memory space, is
 266 from the OpenCL standard.
 267
 268 LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
 269
 270   .. table:: Address Space Mapping
 271      :name: amdgpu-address-space-mapping-table
 272
 273      ================== =================
 274      LLVM Address Space Memory Space
 275      ================== =================
 276      0                  Generic (Flat)
 277      1                  Global
 278      2                  Region (GDS)
 279      3                  Local (group/LDS)
 280      4                  Constant
 281      5                  Private (Scratch)
 282      6                  Constant 32-bit
 283      ================== =================
 284
 285 .. _amdgpu-memory-scopes:
 286
 287 Memory Scopes
 288 -------------
 289
 290 This section provides LLVM memory synchronization scopes supported by the AMDGPU
 291 backend memory model when the target triple OS is ``amdhsa`` (see
 292 :ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
 293
 294 The memory model supported is based on the HSA memory model [HSA]_ which is
 295 based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
 296 relation is transitive over the synchonizes-with relation independent of scope,
 297 and synchonizes-with allows the memory scope instances to be inclusive (see
 298 table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
 299
 300 This is different to the OpenCL [OpenCL]_ memory model which does not have scope
 301 inclusion and requires the memory scopes to exactly match. However, this
 302 is conservatively correct for OpenCL.
 303
 304   .. table:: AMDHSA LLVM Sync Scopes
 305      :name: amdgpu-amdhsa-llvm-sync-scopes-table
 306
 307      ================ ==========================================================
 308      LLVM Sync Scope  Description
 309      ================ ==========================================================
 310      *none*           The default: ``system``.
 311
 312                       Synchronizes with, and participates in modification and
 313                       seq_cst total orderings with, other operations (except
 314                       image operations) for all address spaces (except private,
 315                       or generic that accesses private) provided the other
 316                       operation's sync scope is:
 317
 318                       - ``system``.
 319                       - ``agent`` and executed by a thread on the same agent.
 320                       - ``workgroup`` and executed by a thread in the same
 321                         workgroup.
 322                       - ``wavefront`` and executed by a thread in the same
 323                         wavefront.
 324
 325      ``agent``        Synchronizes with, and participates in modification and
 326                       seq_cst total orderings with, other operations (except
 327                       image operations) for all address spaces (except private,
 328                       or generic that accesses private) provided the other
 329                       operation's sync scope is:
 330
 331                       - ``system`` or ``agent`` and executed by a thread on the
 332                         same agent.
 333                       - ``workgroup`` and executed by a thread in the same
 334                         workgroup.
 335                       - ``wavefront`` and executed by a thread in the same
 336                         wavefront.
 337
 338      ``workgroup``    Synchronizes with, and participates in modification and
 339                       seq_cst total orderings with, other operations (except
 340                       image operations) for all address spaces (except private,
 341                       or generic that accesses private) provided the other
 342                       operation's sync scope is:
 343
 344                       - ``system``, ``agent`` or ``workgroup`` and executed by a
 345                         thread in the same workgroup.
 346                       - ``wavefront`` and executed by a thread in the same
 347                         wavefront.
 348
 349      ``wavefront``    Synchronizes with, and participates in modification and
 350                       seq_cst total orderings with, other operations (except
 351                       image operations) for all address spaces (except private,
 352                       or generic that accesses private) provided the other
 353                       operation's sync scope is:
 354
 355                       - ``system``, ``agent``, ``workgroup`` or ``wavefront``
 356                         and executed by a thread in the same wavefront.
 357
 358      ``singlethread`` Only synchronizes with, and participates in modification
 359                       and seq_cst total orderings with, other operations (except
 360                       image operations) running in the same thread for all
 361                       address spaces (for example, in signal handlers).
 362      ================ ==========================================================
 363
 364 AMDGPU Intrinsics
 365 -----------------
 366
 367 The AMDGPU backend implements the following intrinsics.
 368
 369 *This section is WIP.*
 370
 371 .. TODO
 372    List AMDGPU intrinsics
 373
 374 Code Object
 375 ===========
 376
 377 The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
 378 can be linked by ``lld`` to produce a standard ELF shared code object which can
 379 be loaded and executed on an AMDGPU target.
 380
 381 Header
 382 ------
 383
 384 The AMDGPU backend uses the following ELF header:
 385
 386   .. table:: AMDGPU ELF Header
 387      :name: amdgpu-elf-header-table
 388
 389      ========================== ===============================
 390      Field                      Value
 391      ========================== ===============================
 392      ``e_ident[EI_CLASS]``      ``ELFCLASS64``
 393      ``e_ident[EI_DATA]``       ``ELFDATA2LSB``
 394      ``e_ident[EI_OSABI]``      - ``ELFOSABI_NONE``
 395                                 - ``ELFOSABI_AMDGPU_HSA``
 396                                 - ``ELFOSABI_AMDGPU_PAL``
 397                                 - ``ELFOSABI_AMDGPU_MESA3D``
 398      ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA``
 399                                 - ``ELFABIVERSION_AMDGPU_PAL``
 400                                 - ``ELFABIVERSION_AMDGPU_MESA3D``
 401      ``e_type``                 - ``ET_REL``
 402                                 - ``ET_DYN``
 403      ``e_machine``              ``EM_AMDGPU``
 404      ``e_entry``                0
 405      ``e_flags``                See :ref:`amdgpu-elf-header-e_flags-table`
 406      ========================== ===============================
 407
 408 ..
 409
 410   .. table:: AMDGPU ELF Header Enumeration Values
 411      :name: amdgpu-elf-header-enumeration-values-table
 412
 413      =============================== =====
 414      Name                            Value
 415      =============================== =====
 416      ``EM_AMDGPU``                   224
 417      ``ELFOSABI_NONE``               0
 418      ``ELFOSABI_AMDGPU_HSA``         64
 419      ``ELFOSABI_AMDGPU_PAL``         65
 420      ``ELFOSABI_AMDGPU_MESA3D``      66
 421      ``ELFABIVERSION_AMDGPU_HSA``    1
 422      ``ELFABIVERSION_AMDGPU_PAL``    0
 423      ``ELFABIVERSION_AMDGPU_MESA3D`` 0
 424      =============================== =====
 425
 426 ``e_ident[EI_CLASS]``
 427   The ELF class is:
 428
 429   * ``ELFCLASS32`` for ``r600`` architecture.
 430
 431   * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64
 432     bit applications.
 433
 434 ``e_ident[EI_DATA]``
 435   All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
 436
 437 ``e_ident[EI_OSABI]``
 438   One of the following AMD GPU architecture specific OS ABIs
 439   (see :ref:`amdgpu-os-table`):
 440
 441   * ``ELFOSABI_NONE`` for *unknown* OS.
 442
 443   * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
 444
 445   * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
 446
 447   * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
 448
 449 ``e_ident[EI_ABIVERSION]``
 450   The ABI version of the AMD GPU architecture specific OS ABI to which the code
 451   object conforms:
 452
 453   * ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
 454     runtime ABI.
 455
 456   * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
 457     runtime ABI.
 458
 459   * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
 460     3D runtime ABI.
 461
 462 ``e_type``
 463   Can be one of the following values:
 464
 465
 466   ``ET_REL``
 467     The type produced by the AMD GPU backend compiler as it is relocatable code
 468     object.
 469
 470   ``ET_DYN``
 471     The type produced by the linker as it is a shared code object.
 472
 473   The AMD HSA runtime loader requires a ``ET_DYN`` code object.
 474
 475 ``e_machine``
 476   The value ``EM_AMDGPU`` is used for the machine for all processors supported
 477   by the ``r600`` and ``amdgcn`` architectures (see
 478   :ref:`amdgpu-processor-table`). The specific processor is specified in the
 479   ``EF_AMDGPU_MACH`` bit field of the ``e_flags`` (see
 480   :ref:`amdgpu-elf-header-e_flags-table`).
 481
 482 ``e_entry``
 483   The entry point is 0 as the entry points for individual kernels must be
 484   selected in order to invoke them through AQL packets.
 485
 486 ``e_flags``
 487   The AMDGPU backend uses the following ELF header flags:
 488
 489   .. table:: AMDGPU ELF Header ``e_flags``
 490      :name: amdgpu-elf-header-e_flags-table
 491
 492      ================================= ========== =============================
 493      Name                              Value      Description
 494      ================================= ========== =============================
 495      **AMDGPU Processor Flag**                    See :ref:`amdgpu-processor-table`.
 496      -------------------------------------------- -----------------------------
 497      ``EF_AMDGPU_MACH``                0x000000ff AMDGPU processor selection
 498                                                   mask for
 499                                                   ``EF_AMDGPU_MACH_xxx`` values
 500                                                   defined in
 501                                                   :ref:`amdgpu-ef-amdgpu-mach-table`.
 502      ``EF_AMDGPU_XNACK``               0x00000100 Indicates if the ``xnack``
 503                                                   target feature is
 504                                                   enabled for all code
 505                                                   contained in the code object.
 506                                                   If the processor
 507                                                   does not support the
 508                                                   ``xnack`` target
 509                                                   feature then must
 510                                                   be 0.
 511                                                   See
 512                                                   :ref:`amdgpu-target-features`.
 513      ================================= ========== =============================
 514
 515   .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
 516      :name: amdgpu-ef-amdgpu-mach-table
 517
 518      ================================= ========== =============================
 519      Name                              Value      Description (see
 520                                                   :ref:`amdgpu-processor-table`)
 521      ================================= ========== =============================
 522      ``EF_AMDGPU_MACH_NONE``           0x000      *not specified*
 523      ``EF_AMDGPU_MACH_R600_R600``      0x001      ``r600``
 524      ``EF_AMDGPU_MACH_R600_R630``      0x002      ``r630``
 525      ``EF_AMDGPU_MACH_R600_RS880``     0x003      ``rs880``
 526      ``EF_AMDGPU_MACH_R600_RV670``     0x004      ``rv670``
 527      ``EF_AMDGPU_MACH_R600_RV710``     0x005      ``rv710``
 528      ``EF_AMDGPU_MACH_R600_RV730``     0x006      ``rv730``
 529      ``EF_AMDGPU_MACH_R600_RV770``     0x007      ``rv770``
 530      ``EF_AMDGPU_MACH_R600_CEDAR``     0x008      ``cedar``
 531      ``EF_AMDGPU_MACH_R600_CYPRESS``   0x009      ``cypress``
 532      ``EF_AMDGPU_MACH_R600_JUNIPER``   0x00a      ``juniper``
 533      ``EF_AMDGPU_MACH_R600_REDWOOD``   0x00b      ``redwood``
 534      ``EF_AMDGPU_MACH_R600_SUMO``      0x00c      ``sumo``
 535      ``EF_AMDGPU_MACH_R600_BARTS``     0x00d      ``barts``
 536      ``EF_AMDGPU_MACH_R600_CAICOS``    0x00e      ``caicos``
 537      ``EF_AMDGPU_MACH_R600_CAYMAN``    0x00f      ``cayman``
 538      ``EF_AMDGPU_MACH_R600_TURKS``     0x010      ``turks``
 539      *reserved*                        0x011 -    Reserved for ``r600``
 540                                        0x01f      architecture processors.
 541      ``EF_AMDGPU_MACH_AMDGCN_GFX600``  0x020      ``gfx600``
 542      ``EF_AMDGPU_MACH_AMDGCN_GFX601``  0x021      ``gfx601``
 543      ``EF_AMDGPU_MACH_AMDGCN_GFX700``  0x022      ``gfx700``
 544      ``EF_AMDGPU_MACH_AMDGCN_GFX701``  0x023      ``gfx701``
 545      ``EF_AMDGPU_MACH_AMDGCN_GFX702``  0x024      ``gfx702``
 546      ``EF_AMDGPU_MACH_AMDGCN_GFX703``  0x025      ``gfx703``
 547      ``EF_AMDGPU_MACH_AMDGCN_GFX704``  0x026      ``gfx704``
 548      *reserved*                        0x027      Reserved.
 549      ``EF_AMDGPU_MACH_AMDGCN_GFX801``  0x028      ``gfx801``
 550      ``EF_AMDGPU_MACH_AMDGCN_GFX802``  0x029      ``gfx802``
 551      ``EF_AMDGPU_MACH_AMDGCN_GFX803``  0x02a      ``gfx803``
 552      ``EF_AMDGPU_MACH_AMDGCN_GFX810``  0x02b      ``gfx810``
 553      ``EF_AMDGPU_MACH_AMDGCN_GFX900``  0x02c      ``gfx900``
 554      ``EF_AMDGPU_MACH_AMDGCN_GFX902``  0x02d      ``gfx902``
 555      *reserved*                        0x02e      Reserved.
 556      *reserved*                        0x02f      Reserved.
 557      *reserved*                        0x030      Reserved.
 558      ================================= ========== =============================
 559
 560 Sections
 561 --------
 562
 563 An AMDGPU target ELF code object has the standard ELF sections which include:
 564
 565   .. table:: AMDGPU ELF Sections
 566      :name: amdgpu-elf-sections-table
 567
 568      ================== ================ =================================
 569      Name               Type             Attributes
 570      ================== ================ =================================
 571      ``.bss``           ``SHT_NOBITS``   ``SHF_ALLOC`` + ``SHF_WRITE``
 572      ``.data``          ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
 573      ``.debug_``\ *\**  ``SHT_PROGBITS`` *none*
 574      ``.dynamic``       ``SHT_DYNAMIC``  ``SHF_ALLOC``
 575      ``.dynstr``        ``SHT_PROGBITS`` ``SHF_ALLOC``
 576      ``.dynsym``        ``SHT_PROGBITS`` ``SHF_ALLOC``
 577      ``.got``           ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
 578      ``.hash``          ``SHT_HASH``     ``SHF_ALLOC``
 579      ``.note``          ``SHT_NOTE``     *none*
 580      ``.rela``\ *name*  ``SHT_RELA``     *none*
 581      ``.rela.dyn``      ``SHT_RELA``     *none*
 582      ``.rodata``        ``SHT_PROGBITS`` ``SHF_ALLOC``
 583      ``.shstrtab``      ``SHT_STRTAB``   *none*
 584      ``.strtab``        ``SHT_STRTAB``   *none*
 585      ``.symtab``        ``SHT_SYMTAB``   *none*
 586      ``.text``          ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
 587      ================== ================ =================================
 588
 589 These sections have their standard meanings (see [ELF]_) and are only generated
 590 if needed.
 591
 592 ``.debug``\ *\**
 593   The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
 594   DWARF produced by the AMDGPU backend.
 595
 596 ``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
 597   The standard sections used by a dynamic loader.
 598
 599 ``.note``
 600   See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
 601   backend.
 602
 603 ``.rela``\ *name*, ``.rela.dyn``
 604   For relocatable code objects, *name* is the name of the section that the
 605   relocation records apply. For example, ``.rela.text`` is the section name for
 606   relocation records associated with the ``.text`` section.
 607
 608   For linked shared code objects, ``.rela.dyn`` contains all the relocation
 609   records from each of the relocatable code object's ``.rela``\ *name* sections.
 610
 611   See :ref:`amdgpu-relocation-records` for the relocation records supported by
 612   the AMDGPU backend.
 613
 614 ``.text``
 615   The executable machine code for the kernels and functions they call. Generated
 616   as position independent code. See :ref:`amdgpu-code-conventions` for
 617   information on conventions used in the isa generation.
 618
 619 .. _amdgpu-note-records:
 620
 621 Note Records
 622 ------------
 623
 624 As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero byte padding must
 625 be generated after the ``name`` field to ensure the ``desc`` field is 4 byte
 626 aligned. In addition, minimal zero byte padding must be generated to ensure the
 627 ``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` field of the
 628 ``.note`` section must be at least 4 to indicate at least 8 byte alignment.
 629
 630 The AMDGPU backend code object uses the following ELF note records in the
 631 ``.note`` section. The *Description* column specifies the layout of the note
 632 record's ``desc`` field. All fields are consecutive bytes. Note records with
 633 variable size strings have a corresponding ``*_size`` field that specifies the
 634 number of bytes, including the terminating null character, in the string. The
 635 string(s) come immediately after the preceding fields.
 636
 637 Additional note records can be present.
 638
 639   .. table:: AMDGPU ELF Note Records
 640      :name: amdgpu-elf-note-records-table
 641
 642      ===== ============================== ======================================
 643      Name  Type                           Description
 644      ===== ============================== ======================================
 645      "AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
 646      ===== ============================== ======================================
 647
 648 ..
 649
 650   .. table:: AMDGPU ELF Note Record Enumeration Values
 651      :name: amdgpu-elf-note-record-enumeration-values-table
 652
 653      ============================== =====
 654      Name                           Value
 655      ============================== =====
 656      *reserved*                       0-9
 657      ``NT_AMD_AMDGPU_HSA_METADATA``    10
 658      *reserved*                        11
 659      ============================== =====
 660
 661 ``NT_AMD_AMDGPU_HSA_METADATA``
 662   Specifies extensible metadata associated with the code objects executed on HSA
 663   [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
 664   the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
 665   :ref:`amdgpu-amdhsa-hsa-code-object-metadata` for the syntax of the code
 666   object metadata string.
 667
 668 .. _amdgpu-symbols:
 669
 670 Symbols
 671 -------
 672
 673 Symbols include the following:
 674
 675   .. table:: AMDGPU ELF Symbols
 676      :name: amdgpu-elf-symbols-table
 677
 678      ===================== ============== ============= ==================
 679      Name                  Type           Section       Description
 680      ===================== ============== ============= ==================
 681      *link-name*           ``STT_OBJECT`` - ``.data``   Global variable
 682                                           - ``.rodata``
 683                                           - ``.bss``
 684      *link-name*\ ``@kd``  ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
 685      *link-name*           ``STT_FUNC``   - ``.text``   Kernel entry point
 686      ===================== ============== ============= ==================
 687
 688 Global variable
 689   Global variables both used and defined by the compilation unit.
 690
 691   If the symbol is defined in the compilation unit then it is allocated in the
 692   appropriate section according to if it has initialized data or is readonly.
 693
 694   If the symbol is external then its section is ``STN_UNDEF`` and the loader
 695   will resolve relocations using the definition provided by another code object
 696   or explicitly defined by the runtime.
 697
 698   All global symbols, whether defined in the compilation unit or external, are
 699   accessed by the machine code indirectly through a GOT table entry. This
 700   allows them to be preemptable. The GOT table is only supported when the target
 701   triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
 702
 703   .. TODO
 704      Add description of linked shared object symbols. Seems undefined symbols
 705      are marked as STT_NOTYPE.
 706
 707 Kernel descriptor
 708   Every HSA kernel has an associated kernel descriptor. It is the address of the
 709   kernel descriptor that is used in the AQL dispatch packet used to invoke the
 710   kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
 711   defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
 712
 713 Kernel entry point
 714   Every HSA kernel also has a symbol for its machine code entry point.
 715
 716 .. _amdgpu-relocation-records:
 717
 718 Relocation Records
 719 ------------------
 720
 721 AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
 722 relocatable fields are:
 723
 724 ``word32``
 725   This specifies a 32-bit field occupying 4 bytes with arbitrary byte
 726   alignment. These values use the same byte order as other word values in the
 727   AMD GPU architecture.
 728
 729 ``word64``
 730   This specifies a 64-bit field occupying 8 bytes with arbitrary byte
 731   alignment. These values use the same byte order as other word values in the
 732   AMD GPU architecture.
 733
 734 Following notations are used for specifying relocation calculations:
 735
 736 **A**
 737   Represents the addend used to compute the value of the relocatable field.
 738
 739 **G**
 740   Represents the offset into the global offset table at which the relocation
 741   entry's symbol will reside during execution.
 742
 743 **GOT**
 744   Represents the address of the global offset table.
 745
 746 **P**
 747   Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
 748   of the storage unit being relocated (computed using ``r_offset``).
 749
 750 **S**
 751   Represents the value of the symbol whose index resides in the relocation
 752   entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``.
 753
 754 **B**
 755   Represents the base address of a loaded executable or shared object which is
 756   the difference between the ELF address and the actual load address. Relocations
 757   using this are only valid in executable or shared objects.
 758
 759 The following relocation types are supported:
 760
 761   .. table:: AMDGPU ELF Relocation Records
 762      :name: amdgpu-elf-relocation-records-table
 763
 764      ========================== ======= =====  ==========  ==============================
 765      Relocation Type            Kind    Value  Field       Calculation
 766      ========================== ======= =====  ==========  ==============================
 767      ``R_AMDGPU_NONE``                  0      *none*      *none*
 768      ``R_AMDGPU_ABS32_LO``      Dynamic 1      ``word32``  (S + A) & 0xFFFFFFFF
 769      ``R_AMDGPU_ABS32_HI``      Dynamic 2      ``word32``  (S + A) >> 32
 770      ``R_AMDGPU_ABS64``         Dynamic 3      ``word64``  S + A
 771      ``R_AMDGPU_REL32``         Static  4      ``word32``  S + A - P
 772      ``R_AMDGPU_REL64``         Static  5      ``word64``  S + A - P
 773      ``R_AMDGPU_ABS32``         Static  6      ``word32``  S + A
 774      ``R_AMDGPU_GOTPCREL``      Static  7      ``word32``  G + GOT + A - P
 775      ``R_AMDGPU_GOTPCREL32_LO`` Static  8      ``word32``  (G + GOT + A - P) & 0xFFFFFFFF
 776      ``R_AMDGPU_GOTPCREL32_HI`` Static  9      ``word32``  (G + GOT + A - P) >> 32
 777      ``R_AMDGPU_REL32_LO``      Static  10     ``word32``  (S + A - P) & 0xFFFFFFFF
 778      ``R_AMDGPU_REL32_HI``      Static  11     ``word32``  (S + A - P) >> 32
 779      *reserved*                         12
 780      ``R_AMDGPU_RELATIVE64``    Dynamic 13     ``word64``  B + A
 781      ========================== ======= =====  ==========  ==============================
 782
 783 .. _amdgpu-dwarf:
 784
 785 DWARF
 786 -----
 787
 788 Standard DWARF [DWARF]_ Version 5 sections can be generated. These contain
 789 information that maps the code object executable code and data to the source
 790 language constructs. It can be used by tools such as debuggers and profilers.
 791
 792 Address Space Mapping
 793 ~~~~~~~~~~~~~~~~~~~~~
 794
 795 The following address space mapping is used:
 796
 797   .. table:: AMDGPU DWARF Address Space Mapping
 798      :name: amdgpu-dwarf-address-space-mapping-table
 799
 800      =================== =================
 801      DWARF Address Space Memory Space
 802      =================== =================
 803      1                   Private (Scratch)
 804      2                   Local (group/LDS)
 805      *omitted*           Global
 806      *omitted*           Constant
 807      *omitted*           Generic (Flat)
 808      *not supported*     Region (GDS)
 809      =================== =================
 810
 811 See :ref:`amdgpu-address-spaces` for information on the memory space terminology
 812 used in the table.
 813
 814 An ``address_class`` attribute is generated on pointer type DIEs to specify the
 815 DWARF address space of the value of the pointer when it is in the *private* or
 816 *local* address space. Otherwise the attribute is omitted.
 817
 818 An ``XDEREF`` operation is generated in location list expressions for variables
 819 that are allocated in the *private* and *local* address space. Otherwise no
 820 ``XDREF`` is omitted.
 821
 822 Register Mapping
 823 ~~~~~~~~~~~~~~~~
 824
 825 *This section is WIP.*
 826
 827 .. TODO
 828    Define DWARF register enumeration.
 829
 830    If want to present a wavefront state then should expose vector registers as
 831    64 wide (rather than per work-item view that LLVM uses). Either as separate
 832    registers, or a 64x4 byte single register. In either case use a new LANE op
 833    (akin to XDREF) to select the current lane usage in a location
 834    expression. This would also allow scalar register spilling to vector register
 835    lanes to be expressed (currently no debug information is being generated for
 836    spilling). If choose a wide single register approach then use LANE in
 837    conjunction with PIECE operation to select the dword part of the register for
 838    the current lane. If the separate register approach then use LANE to select
 839    the register.
 840
 841 Source Text
 842 ~~~~~~~~~~~
 843
 844 Source text for online-compiled programs (e.g. those compiled by the OpenCL
 845 runtime) may be embedded into the DWARF v5 line table using the ``clang
 846 -gembed-source`` option, described in table :ref:`amdgpu-debug-options`.
 847
 848 For example:
 849
 850 ``-gembed-source``
 851   Enable the embedded source DWARF v5 extension.
 852 ``-gno-embed-source``
 853   Disable the embedded source DWARF v5 extension.
 854
 855   .. table:: AMDGPU Debug Options
 856      :name: amdgpu-debug-options
 857
 858      ==================== ==================================================
 859      Debug Flag           Description
 860      ==================== ==================================================
 861      -g[no-]embed-source  Enable/disable embedding source text in DWARF
 862                           debug sections. Useful for environments where
 863                           source cannot be written to disk, such as
 864                           when performing online compilation.
 865      ==================== ==================================================
 866
 867 This option enables one extended content types in the DWARF v5 Line Number
 868 Program Header, which is used to encode embedded source.
 869
 870   .. table:: AMDGPU DWARF Line Number Program Header Extended Content Types
 871      :name: amdgpu-dwarf-extended-content-types
 872
 873      ============================  ======================
 874      Content Type                  Form
 875      ============================  ======================
 876      ``DW_LNCT_LLVM_source``       ``DW_FORM_line_strp``
 877      ============================  ======================
 878
 879 The source field will contain the UTF-8 encoded, null-terminated source text
 880 with ``'\n'`` line endings. When the source field is present, consumers can use
 881 the embedded source instead of attempting to discover the source on disk. When
 882 the source field is absent, consumers can access the file to get the source
 883 text.
 884
 885 The above content type appears in the ``file_name_entry_format`` field of the
 886 line table prologue, and its corresponding value appear in the ``file_names``
 887 field. The current encoding of the content type is documented in table
 888 :ref:`amdgpu-dwarf-extended-content-types-encoding`
 889
 890   .. table:: AMDGPU DWARF Line Number Program Header Extended Content Types Encoding
 891      :name: amdgpu-dwarf-extended-content-types-encoding
 892
 893      ============================  ====================
 894      Content Type                  Value
 895      ============================  ====================
 896      ``DW_LNCT_LLVM_source``       0x2001
 897      ============================  ====================
 898
 899 .. _amdgpu-code-conventions:
 900
 901 Code Conventions
 902 ================
 903
 904 This section provides code conventions used for each supported target triple OS
 905 (see :ref:`amdgpu-target-triples`).
 906
 907 AMDHSA
 908 ------
 909
 910 This section provides code conventions used when the target triple OS is
 911 ``amdhsa`` (see :ref:`amdgpu-target-triples`).
 912
 913 .. _amdgpu-amdhsa-hsa-code-object-metadata:
 914
 915 Code Object Metadata
 916 ~~~~~~~~~~~~~~~~~~~~
 917
 918 The code object metadata specifies extensible metadata associated with the code
 919 objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
 920 [AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record
 921 (see :ref:`amdgpu-note-records`) and is required when the target triple OS is
 922 ``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
 923 information necessary to support the ROCM kernel queries. For example, the
 924 segment sizes needed in a dispatch packet. In addition, a high level language
 925 runtime may require other information to be included. For example, the AMD
 926 OpenCL runtime records kernel argument information.
 927
 928 The metadata is specified as a YAML formatted string (see [YAML]_ and
 929 :doc:`YamlIO`).
 930
 931 .. TODO
 932    Is the string null terminated? It probably should not if YAML allows it to
 933    contain null characters, otherwise it should be.
 934
 935 The metadata is represented as a single YAML document comprised of the mapping
 936 defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
 937 referenced tables.
 938
 939 For boolean values, the string values of ``false`` and ``true`` are used for
 940 false and true respectively.
 941
 942 Additional information can be added to the mappings. To avoid conflicts, any
 943 non-AMD key names should be prefixed by "*vendor-name*.".
 944
 945   .. table:: AMDHSA Code Object Metadata Mapping
 946      :name: amdgpu-amdhsa-code-object-metadata-mapping-table
 947
 948      ========== ============== ========= =======================================
 949      String Key Value Type     Required? Description
 950      ========== ============== ========= =======================================
 951      "Version"  sequence of    Required  - The first integer is the major
 952                 2 integers                 version. Currently 1.
 953                                          - The second integer is the minor
 954                                            version. Currently 0.
 955      "Printf"   sequence of              Each string is encoded information
 956                 strings                  about a printf function call. The
 957                                          encoded information is organized as
 958                                          fields separated by colon (':'):
 959
 960                                          ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
 961
 962                                          where:
 963
 964                                          ``ID``
 965                                            A 32 bit integer as a unique id for
 966                                            each printf function call
 967
 968                                          ``N``
 969                                            A 32 bit integer equal to the number
 970                                            of arguments of printf function call
 971                                            minus 1
 972
 973                                          ``S[i]`` (where i = 0, 1, ... , N-1)
 974                                            32 bit integers for the size in bytes
 975                                            of the i-th FormatString argument of
 976                                            the printf function call
 977
 978                                          FormatString
 979                                            The format string passed to the
 980                                            printf function call.
 981      "Kernels"  sequence of    Required  Sequence of the mappings for each
 982                 mapping                  kernel in the code object. See
 983                                          :ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
 984                                          for the definition of the mapping.
 985      ========== ============== ========= =======================================
 986
 987 ..
 988
 989   .. table:: AMDHSA Code Object Kernel Metadata Mapping
 990      :name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
 991
 992      ================= ============== ========= ================================
 993      String Key        Value Type     Required? Description
 994      ================= ============== ========= ================================
 995      "Name"            string         Required  Source name of the kernel.
 996      "SymbolName"      string         Required  Name of the kernel
 997                                                 descriptor ELF symbol.
 998      "Language"        string                   Source language of the kernel.
 999                                                 Values include:
1000
1001                                                 - "OpenCL C"
1002                                                 - "OpenCL C++"
1003                                                 - "HCC"
1004                                                 - "OpenMP"
1005
1006      "LanguageVersion" sequence of              - The first integer is the major
1007                        2 integers                 version.
1008                                                 - The second integer is the
1009                                                   minor version.
1010      "Attrs"           mapping                  Mapping of kernel attributes.
1011                                                 See
1012                                                 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
1013                                                 for the mapping definition.
1014      "Args"            sequence of              Sequence of mappings of the
1015                        mapping                  kernel arguments. See
1016                                                 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
1017                                                 for the definition of the mapping.
1018      "CodeProps"       mapping                  Mapping of properties related to
1019                                                 the kernel code. See
1020                                                 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
1021                                                 for the mapping definition.
1022      ================= ============== ========= ================================
1023
1024 ..
1025
1026   .. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
1027      :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
1028
1029      =================== ============== ========= ==============================
1030      String Key          Value Type     Required? Description
1031      =================== ============== ========= ==============================
1032      "ReqdWorkGroupSize" sequence of              If not 0, 0, 0 then all values
1033                          3 integers               must be >=1 and the dispatch
1034                                                   work-group size X, Y, Z must
1035                                                   correspond to the specified
1036                                                   values. Defaults to 0, 0, 0.
1037
1038                                                   Corresponds to the OpenCL
1039                                                   ``reqd_work_group_size``
1040                                                   attribute.
1041      "WorkGroupSizeHint" sequence of              The dispatch work-group size
1042                          3 integers               X, Y, Z is likely to be the
1043                                                   specified values.
1044
1045                                                   Corresponds to the OpenCL
1046                                                   ``work_group_size_hint``
1047                                                   attribute.
1048      "VecTypeHint"       string                   The name of a scalar or vector
1049                                                   type.
1050
1051                                                   Corresponds to the OpenCL
1052                                                   ``vec_type_hint`` attribute.
1053
1054      "RuntimeHandle"     string                   The external symbol name
1055                                                   associated with a kernel.
1056                                                   OpenCL runtime allocates a
1057                                                   global buffer for the symbol
1058                                                   and saves the kernel's address
1059                                                   to it, which is used for
1060                                                   device side enqueueing. Only
1061                                                   available for device side
1062                                                   enqueued kernels.
1063      =================== ============== ========= ==============================
1064
1065 ..
1066
1067   .. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
1068      :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
1069
1070      ================= ============== ========= ================================
1071      String Key        Value Type     Required? Description
1072      ================= ============== ========= ================================
1073      "Name"            string                   Kernel argument name.
1074      "TypeName"        string                   Kernel argument type name.
1075      "Size"            integer        Required  Kernel argument size in bytes.
1076      "Align"           integer        Required  Kernel argument alignment in
1077                                                 bytes. Must be a power of two.
1078      "ValueKind"       string         Required  Kernel argument kind that
1079                                                 specifies how to set up the
1080                                                 corresponding argument.
1081                                                 Values include:
1082
1083                                                 "ByValue"
1084                                                   The argument is copied
1085                                                   directly into the kernarg.
1086
1087                                                 "GlobalBuffer"
1088                                                   A global address space pointer
1089                                                   to the buffer data is passed
1090                                                   in the kernarg.
1091
1092                                                 "DynamicSharedPointer"
1093                                                   A group address space pointer
1094                                                   to dynamically allocated LDS
1095                                                   is passed in the kernarg.
1096
1097                                                 "Sampler"
1098                                                   A global address space
1099                                                   pointer to a S# is passed in
1100                                                   the kernarg.
1101
1102                                                 "Image"
1103                                                   A global address space
1104                                                   pointer to a T# is passed in
1105                                                   the kernarg.
1106
1107                                                 "Pipe"
1108                                                   A global address space pointer
1109                                                   to an OpenCL pipe is passed in
1110                                                   the kernarg.
1111
1112                                                 "Queue"
1113                                                   A global address space pointer
1114                                                   to an OpenCL device enqueue
1115                                                   queue is passed in the
1116                                                   kernarg.
1117
1118                                                 "HiddenGlobalOffsetX"
1119                                                   The OpenCL grid dispatch
1120                                                   global offset for the X
1121                                                   dimension is passed in the
1122                                                   kernarg.
1123
1124                                                 "HiddenGlobalOffsetY"
1125                                                   The OpenCL grid dispatch
1126                                                   global offset for the Y
1127                                                   dimension is passed in the
1128                                                   kernarg.
1129
1130                                                 "HiddenGlobalOffsetZ"
1131                                                   The OpenCL grid dispatch
1132                                                   global offset for the Z
1133                                                   dimension is passed in the
1134                                                   kernarg.
1135
1136                                                 "HiddenNone"
1137                                                   An argument that is not used
1138                                                   by the kernel. Space needs to
1139                                                   be left for it, but it does
1140                                                   not need to be set up.
1141
1142                                                 "HiddenPrintfBuffer"
1143                                                   A global address space pointer
1144                                                   to the runtime printf buffer
1145                                                   is passed in kernarg.
1146
1147                                                 "HiddenDefaultQueue"
1148                                                   A global address space pointer
1149                                                   to the OpenCL device enqueue
1150                                                   queue that should be used by
1151                                                   the kernel by default is
1152                                                   passed in the kernarg.
1153
1154                                                 "HiddenCompletionAction"
1155                                                   A global address space pointer
1156                                                   to help link enqueued kernels into
1157                                                   the ancestor tree for determining
1158                                                   when the parent kernel has finished.
1159
1160      "ValueType"       string         Required  Kernel argument value type. Only
1161                                                 present if "ValueKind" is
1162                                                 "ByValue". For vector data
1163                                                 types, the value is for the
1164                                                 element type. Values include:
1165
1166                                                 - "Struct"
1167                                                 - "I8"
1168                                                 - "U8"
1169                                                 - "I16"
1170                                                 - "U16"
1171                                                 - "F16"
1172                                                 - "I32"
1173                                                 - "U32"
1174                                                 - "F32"
1175                                                 - "I64"
1176                                                 - "U64"
1177                                                 - "F64"
1178
1179                                                 .. TODO
1180                                                    How can it be determined if a
1181                                                    vector type, and what size
1182                                                    vector?
1183      "PointeeAlign"    integer                  Alignment in bytes of pointee
1184                                                 type for pointer type kernel
1185                                                 argument. Must be a power
1186                                                 of 2. Only present if
1187                                                 "ValueKind" is
1188                                                 "DynamicSharedPointer".
1189      "AddrSpaceQual"   string                   Kernel argument address space
1190                                                 qualifier. Only present if
1191                                                 "ValueKind" is "GlobalBuffer" or
1192                                                 "DynamicSharedPointer". Values
1193                                                 are:
1194
1195                                                 - "Private"
1196                                                 - "Global"
1197                                                 - "Constant"
1198                                                 - "Local"
1199                                                 - "Generic"
1200                                                 - "Region"
1201
1202                                                 .. TODO
1203                                                    Is GlobalBuffer only Global
1204                                                    or Constant? Is
1205                                                    DynamicSharedPointer always
1206                                                    Local? Can HCC allow Generic?
1207                                                    How can Private or Region
1208                                                    ever happen?
1209      "AccQual"         string                   Kernel argument access
1210                                                 qualifier. Only present if
1211                                                 "ValueKind" is "Image" or
1212                                                 "Pipe". Values
1213                                                 are:
1214
1215                                                 - "ReadOnly"
1216                                                 - "WriteOnly"
1217                                                 - "ReadWrite"
1218
1219                                                 .. TODO
1220                                                    Does this apply to
1221                                                    GlobalBuffer?
1222      "ActualAccQual"   string                   The actual memory accesses
1223                                                 performed by the kernel on the
1224                                                 kernel argument. Only present if
1225                                                 "ValueKind" is "GlobalBuffer",
1226                                                 "Image", or "Pipe". This may be
1227                                                 more restrictive than indicated
1228                                                 by "AccQual" to reflect what the
1229                                                 kernel actual does. If not
1230                                                 present then the runtime must
1231                                                 assume what is implied by
1232                                                 "AccQual" and "IsConst". Values
1233                                                 are:
1234
1235                                                 - "ReadOnly"
1236                                                 - "WriteOnly"
1237                                                 - "ReadWrite"
1238
1239      "IsConst"         boolean                  Indicates if the kernel argument
1240                                                 is const qualified. Only present
1241                                                 if "ValueKind" is
1242                                                 "GlobalBuffer".
1243
1244      "IsRestrict"      boolean                  Indicates if the kernel argument
1245                                                 is restrict qualified. Only
1246                                                 present if "ValueKind" is
1247                                                 "GlobalBuffer".
1248
1249      "IsVolatile"      boolean                  Indicates if the kernel argument
1250                                                 is volatile qualified. Only
1251                                                 present if "ValueKind" is
1252                                                 "GlobalBuffer".
1253
1254      "IsPipe"          boolean                  Indicates if the kernel argument
1255                                                 is pipe qualified. Only present
1256                                                 if "ValueKind" is "Pipe".
1257
1258                                                 .. TODO
1259                                                    Can GlobalBuffer be pipe
1260                                                    qualified?
1261      ================= ============== ========= ================================
1262
1263 ..
1264
1265   .. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
1266      :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
1267
1268      ============================ ============== ========= =====================
1269      String Key                   Value Type     Required? Description
1270      ============================ ============== ========= =====================
1271      "KernargSegmentSize"         integer        Required  The size in bytes of
1272                                                            the kernarg segment
1273                                                            that holds the values
1274                                                            of the arguments to
1275                                                            the kernel.
1276      "GroupSegmentFixedSize"      integer        Required  The amount of group
1277                                                            segment memory
1278                                                            required by a
1279                                                            work-group in
1280                                                            bytes. This does not
1281                                                            include any
1282                                                            dynamically allocated
1283                                                            group segment memory
1284                                                            that may be added
1285                                                            when the kernel is
1286                                                            dispatched.
1287      "PrivateSegmentFixedSize"    integer        Required  The amount of fixed
1288                                                            private address space
1289                                                            memory required for a
1290                                                            work-item in
1291                                                            bytes. If the kernel
1292                                                            uses a dynamic call
1293                                                            stack then additional
1294                                                            space must be added
1295                                                            to this value for the
1296                                                            call stack.
1297      "KernargSegmentAlign"        integer        Required  The maximum byte
1298                                                            alignment of
1299                                                            arguments in the
1300                                                            kernarg segment. Must
1301                                                            be a power of 2.
1302      "WavefrontSize"              integer        Required  Wavefront size. Must
1303                                                            be a power of 2.
1304      "NumSGPRs"                   integer        Required  Number of scalar
1305                                                            registers used by a
1306                                                            wavefront for
1307                                                            GFX6-GFX9. This
1308                                                            includes the special
1309                                                            SGPRs for VCC, Flat
1310                                                            Scratch (GFX7-GFX9)
1311                                                            and XNACK (for
1312                                                            GFX8-GFX9). It does
1313                                                            not include the 16
1314                                                            SGPR added if a trap
1315                                                            handler is
1316                                                            enabled. It is not
1317                                                            rounded up to the
1318                                                            allocation
1319                                                            granularity.
1320      "NumVGPRs"                   integer        Required  Number of vector
1321                                                            registers used by
1322                                                            each work-item for
1323                                                            GFX6-GFX9
1324      "MaxFlatWorkGroupSize"       integer        Required  Maximum flat
1325                                                            work-group size
1326                                                            supported by the
1327                                                            kernel in work-items.
1328                                                            Must be >=1 and
1329                                                            consistent with
1330                                                            ReqdWorkGroupSize if
1331                                                            not 0, 0, 0.
1332      "NumSpilledSGPRs"            integer                  Number of stores from
1333                                                            a scalar register to
1334                                                            a register allocator
1335                                                            created spill
1336                                                            location.
1337      "NumSpilledVGPRs"            integer                  Number of stores from
1338                                                            a vector register to
1339                                                            a register allocator
1340                                                            created spill
1341                                                            location.
1342      ============================ ============== ========= =====================
1343
1344 ..
1345
1346 Kernel Dispatch
1347 ~~~~~~~~~~~~~~~
1348
1349 The HSA architected queuing language (AQL) defines a user space memory interface
1350 that can be used to control the dispatch of kernels, in an agent independent
1351 way. An agent can have zero or more AQL queues created for it using the ROCm
1352 runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
1353 *HSA Platform System Architecture Specification* [HSA]_ for the AQL queue
1354 mechanics and packet layouts.
1355
1356 The packet processor of a kernel agent is responsible for detecting and
1357 dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
1358 packet processor is implemented by the hardware command processor (CP),
1359 asynchronous dispatch controller (ADC) and shader processor input controller
1360 (SPI).
1361
1362 The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
1363 mode driver to initialize and register the AQL queue with CP.
1364
1365 To dispatch a kernel the following actions are performed. This can occur in the
1366 CPU host program, or from an HSA kernel executing on a GPU.
1367
1368 1. A pointer to an AQL queue for the kernel agent on which the kernel is to be
1369    executed is obtained.
1370 2. A pointer to the kernel descriptor (see
1371    :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
1372    obtained. It must be for a kernel that is contained in a code object that that
1373    was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
1374    associated.
1375 3. Space is allocated for the kernel arguments using the ROCm runtime allocator
1376    for a memory region with the kernarg property for the kernel agent that will
1377    execute the kernel. It must be at least 16 byte aligned.
1378 4. Kernel argument values are assigned to the kernel argument memory
1379    allocation. The layout is defined in the *HSA Programmer's Language Reference*
1380    [HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
1381    memory in the same way constant memory is accessed. (Note that the HSA
1382    specification allows an implementation to copy the kernel argument contents to
1383    another location that is accessed by the kernel.)
1384 5. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
1385    api uses 64 bit atomic operations to reserve space in the AQL queue for the
1386    packet. The packet must be set up, and the final write must use an atomic
1387    store release to set the packet kind to ensure the packet contents are
1388    visible to the kernel agent. AQL defines a doorbell signal mechanism to
1389    notify the kernel agent that the AQL queue has been updated. These rules, and
1390    the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
1391    System Architecture Specification* [HSA]_.
1392 6. A kernel dispatch packet includes information about the actual dispatch,
1393    such as grid and work-group size, together with information from the code
1394    object about the kernel, such as segment sizes. The ROCm runtime queries on
1395    the kernel symbol can be used to obtain the code object values which are
1396    recorded in the :ref:`amdgpu-amdhsa-hsa-code-object-metadata`.
1397 7. CP executes micro-code and is responsible for detecting and setting up the
1398    GPU to execute the wavefronts of a kernel dispatch.
1399 8. CP ensures that when the a wavefront starts executing the kernel machine
1400    code, the scalar general purpose registers (SGPR) and vector general purpose
1401    registers (VGPR) are set up as required by the machine code. The required
1402    setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
1403    register state is defined in
1404    :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
1405 9. The prolog of the kernel machine code (see
1406    :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
1407    before continuing executing the machine code that corresponds to the kernel.
1408 10. When the kernel dispatch has completed execution, CP signals the completion
1409     signal specified in the kernel dispatch packet if not 0.
1410
1411 .. _amdgpu-amdhsa-memory-spaces:
1412
1413 Memory Spaces
1414 ~~~~~~~~~~~~~
1415
1416 The memory space properties are:
1417
1418   .. table:: AMDHSA Memory Spaces
1419      :name: amdgpu-amdhsa-memory-spaces-table
1420
1421      ================= =========== ======== ======= ==================
1422      Memory Space Name HSA Segment Hardware Address NULL Value
1423                        Name        Name     Size
1424      ================= =========== ======== ======= ==================
1425      Private           private     scratch  32      0x00000000
1426      Local             group       LDS      32      0xFFFFFFFF
1427      Global            global      global   64      0x0000000000000000
1428      Constant          constant    *same as 64      0x0000000000000000
1429                                    global*
1430      Generic           flat        flat     64      0x0000000000000000
1431      Region            N/A         GDS      32      *not implemented
1432                                                     for AMDHSA*
1433      ================= =========== ======== ======= ==================
1434
1435 The global and constant memory spaces both use global virtual addresses, which
1436 are the same virtual address space used by the CPU. However, some virtual
1437 addresses may only be accessible to the CPU, some only accessible by the GPU,
1438 and some by both.
1439
1440 Using the constant memory space indicates that the data will not change during
1441 the execution of the kernel. This allows scalar read instructions to be
1442 used. The vector and scalar L1 caches are invalidated of volatile data before
1443 each kernel dispatch execution to allow constant memory to change values between
1444 kernel dispatches.
1445
1446 The local memory space uses the hardware Local Data Store (LDS) which is
1447 automatically allocated when the hardware creates work-groups of wavefronts, and
1448 freed when all the wavefronts of a work-group have terminated. The data store
1449 (DS) instructions can be used to access it.
1450
1451 The private memory space uses the hardware scratch memory support. If the kernel
1452 uses scratch, then the hardware allocates memory that is accessed using
1453 wavefront lane dword (4 byte) interleaving. The mapping used from private
1454 address to physical address is:
1455
1456   ``wavefront-scratch-base +
1457   (private-address * wavefront-size * 4) +
1458   (wavefront-lane-id * 4)``
1459
1460 There are different ways that the wavefront scratch base address is determined
1461 by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
1462 memory can be accessed in an interleaved manner using buffer instruction with
1463 the scratch buffer descriptor and per wavefront scratch offset, by the scratch
1464 instructions, or by flat instructions. If each lane of a wavefront accesses the
1465 same private address, the interleaving results in adjacent dwords being accessed
1466 and hence requires fewer cache lines to be fetched. Multi-dword access is not
1467 supported except by flat and scratch instructions in GFX9.
1468
1469 The generic address space uses the hardware flat address support available in
1470 GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
1471 local appertures), that are outside the range of addressible global memory, to
1472 map from a flat address to a private or local address.
1473
1474 FLAT instructions can take a flat address and access global, private (scratch)
1475 and group (LDS) memory depending in if the address is within one of the
1476 apperture ranges. Flat access to scratch requires hardware aperture setup and
1477 setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
1478 access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
1479 (see :ref:`amdgpu-amdhsa-m0`).
1480
1481 To convert between a segment address and a flat address the base address of the
1482 appertures address can be used. For GFX7-GFX8 these are available in the
1483 :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
1484 Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
1485 GFX9 the appature base addresses are directly available as inline constant
1486 registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
1487 address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
1488 which makes it easier to convert from flat to segment or segment to flat.
1489
1490 Image and Samplers
1491 ~~~~~~~~~~~~~~~~~~
1492
1493 Image and sample handles created by the ROCm runtime are 64 bit addresses of a
1494 hardware 32 byte V# and 48 byte S# object respectively. In order to support the
1495 HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
1496 enumeration values for the queries that are not trivially deducible from the S#
1497 representation.
1498
1499 HSA Signals
1500 ~~~~~~~~~~~
1501
1502 HSA signal handles created by the ROCm runtime are 64 bit addresses of a
1503 structure allocated in memory accessible from both the CPU and GPU. The
1504 structure is defined by the ROCm runtime and subject to change between releases
1505 (see [AMD-ROCm-github]_).
1506
1507 .. _amdgpu-amdhsa-hsa-aql-queue:
1508
1509 HSA AQL Queue
1510 ~~~~~~~~~~~~~
1511
1512 The HSA AQL queue structure is defined by the ROCm runtime and subject to change
1513 between releases (see [AMD-ROCm-github]_). For some processors it contains
1514 fields needed to implement certain language features such as the flat address
1515 aperture bases. It also contains fields used by CP such as managing the
1516 allocation of scratch memory.
1517
1518 .. _amdgpu-amdhsa-kernel-descriptor:
1519
1520 Kernel Descriptor
1521 ~~~~~~~~~~~~~~~~~
1522
1523 A kernel descriptor consists of the information needed by CP to initiate the
1524 execution of a kernel, including the entry point address of the machine code
1525 that implements the kernel.
1526
1527 Kernel Descriptor for GFX6-GFX9
1528 +++++++++++++++++++++++++++++++
1529
1530 CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
1531
1532   .. table:: Kernel Descriptor for GFX6-GFX9
1533      :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
1534
1535      ======= ======= =============================== ============================
1536      Bits    Size    Field Name                      Description
1537      ======= ======= =============================== ============================
1538      31:0    4 bytes GroupSegmentFixedSize           The amount of fixed local
1539                                                      address space memory
1540                                                      required for a work-group
1541                                                      in bytes. This does not
1542                                                      include any dynamically
1543                                                      allocated local address
1544                                                      space memory that may be
1545                                                      added when the kernel is
1546                                                      dispatched.
1547      63:32   4 bytes PrivateSegmentFixedSize         The amount of fixed
1548                                                      private address space
1549                                                      memory required for a
1550                                                      work-item in bytes. If
1551                                                      is_dynamic_callstack is 1
1552                                                      then additional space must
1553                                                      be added to this value for
1554                                                      the call stack.
1555      127:64  8 bytes                                 Reserved, must be 0.
1556      191:128 8 bytes KernelCodeEntryByteOffset       Byte offset (possibly
1557                                                      negative) from base
1558                                                      address of kernel
1559                                                      descriptor to kernel's
1560                                                      entry point instruction
1561                                                      which must be 256 byte
1562                                                      aligned.
1563      383:192 24                                      Reserved, must be 0.
1564              bytes
1565      415:384 4 bytes ComputePgmRsrc1                 Compute Shader (CS)
1566                                                      program settings used by
1567                                                      CP to set up
1568                                                      ``COMPUTE_PGM_RSRC1``
1569                                                      configuration
1570                                                      register. See
1571                                                      :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
1572      447:416 4 bytes ComputePgmRsrc2                 Compute Shader (CS)
1573                                                      program settings used by
1574                                                      CP to set up
1575                                                      ``COMPUTE_PGM_RSRC2``
1576                                                      configuration
1577                                                      register. See
1578                                                      :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
1579      448     1 bit   EnableSGPRPrivateSegmentBuffer  Enable the setup of the
1580                                                      SGPR user data registers
1581                                                      (see
1582                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1583
1584                                                      The total number of SGPR
1585                                                      user data registers
1586                                                      requested must not exceed
1587                                                      16 and match value in
1588                                                      ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
1589                                                      Any requests beyond 16
1590                                                      will be ignored.
1591      449     1 bit   EnableSGPRDispatchPtr           *see above*
1592      450     1 bit   EnableSGPRQueuePtr              *see above*
1593      451     1 bit   EnableSGPRKernargSegmentPtr     *see above*
1594      452     1 bit   EnableSGPRDispatchID            *see above*
1595      453     1 bit   EnableSGPRFlatScratchInit       *see above*
1596      454     1 bit   EnableSGPRPrivateSegmentSize    *see above*
1597      455     1 bit   EnableSGPRGridWorkgroupCountX   Not implemented in CP and
1598                                                      should always be 0.
1599      456     1 bit   EnableSGPRGridWorkgroupCountY   Not implemented in CP and
1600                                                      should always be 0.
1601      457     1 bit   EnableSGPRGridWorkgroupCountZ   Not implemented in CP and
1602                                                      should always be 0.
1603      463:458 6 bits                                  Reserved, must be 0.
1604      511:464 6                                       Reserved, must be 0.
1605              bytes
1606      512     **Total size 64 bytes.**
1607      ======= ====================================================================
1608
1609 ..
1610
1611   .. table:: compute_pgm_rsrc1 for GFX6-GFX9
1612      :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table
1613
1614      ======= ======= =============================== ===========================================================================
1615      Bits    Size    Field Name                      Description
1616      ======= ======= =============================== ===========================================================================
1617      5:0     6 bits  GRANULATED_WORKITEM_VGPR_COUNT  Number of vector registers
1618                                                      used by each work-item,
1619                                                      granularity is device
1620                                                      specific:
1621
1622                                                      GFX6-GFX9
1623                                                        - max_vgpr 1..256
1624                                                        - roundup((max_vgpg + 1)
1625                                                          / 4) - 1
1626
1627                                                      Used by CP to set up
1628                                                      ``COMPUTE_PGM_RSRC1.VGPRS``.
1629      9:6     4 bits  GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar registers
1630                                                      used by a wavefront,
1631                                                      granularity is device
1632                                                      specific:
1633
1634                                                      GFX6-GFX8
1635                                                        - max_sgpr 1..112
1636                                                        - roundup((max_sgpg + 1)
1637                                                          / 8) - 1
1638                                                      GFX9
1639                                                        - max_sgpr 1..112
1640                                                        - roundup((max_sgpg + 1)
1641                                                          / 16) - 1
1642
1643                                                      Includes the special SGPRs
1644                                                      for VCC, Flat Scratch (for
1645                                                      GFX7 onwards) and XNACK
1646                                                      (for GFX8 onwards). It does
1647                                                      not include the 16 SGPR
1648                                                      added if a trap handler is
1649                                                      enabled.
1650
1651                                                      Used by CP to set up
1652                                                      ``COMPUTE_PGM_RSRC1.SGPRS``.
1653      11:10   2 bits  PRIORITY                        Must be 0.
1654
1655                                                      Start executing wavefront
1656                                                      at the specified priority.
1657
1658                                                      CP is responsible for
1659                                                      filling in
1660                                                      ``COMPUTE_PGM_RSRC1.PRIORITY``.
1661      13:12   2 bits  FLOAT_ROUND_MODE_32             Wavefront starts execution
1662                                                      with specified rounding
1663                                                      mode for single (32
1664                                                      bit) floating point
1665                                                      precision floating point
1666                                                      operations.
1667
1668                                                      Floating point rounding
1669                                                      mode values are defined in
1670                                                      :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1671
1672                                                      Used by CP to set up
1673                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1674      15:14   2 bits  FLOAT_ROUND_MODE_16_64          Wavefront starts execution
1675                                                      with specified rounding
1676                                                      denorm mode for half/double (16
1677                                                      and 64 bit) floating point
1678                                                      precision floating point
1679                                                      operations.
1680
1681                                                      Floating point rounding
1682                                                      mode values are defined in
1683                                                      :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1684
1685                                                      Used by CP to set up
1686                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1687      17:16   2 bits  FLOAT_DENORM_MODE_32            Wavefront starts execution
1688                                                      with specified denorm mode
1689                                                      for single (32
1690                                                      bit)  floating point
1691                                                      precision floating point
1692                                                      operations.
1693
1694                                                      Floating point denorm mode
1695                                                      values are defined in
1696                                                      :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1697
1698                                                      Used by CP to set up
1699                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1700      19:18   2 bits  FLOAT_DENORM_MODE_16_64         Wavefront starts execution
1701                                                      with specified denorm mode
1702                                                      for half/double (16
1703                                                      and 64 bit) floating point
1704                                                      precision floating point
1705                                                      operations.
1706
1707                                                      Floating point denorm mode
1708                                                      values are defined in
1709                                                      :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1710
1711                                                      Used by CP to set up
1712                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1713      20      1 bit   PRIV                            Must be 0.
1714
1715                                                      Start executing wavefront
1716                                                      in privilege trap handler
1717                                                      mode.
1718
1719                                                      CP is responsible for
1720                                                      filling in
1721                                                      ``COMPUTE_PGM_RSRC1.PRIV``.
1722      21      1 bit   ENABLE_DX10_CLAMP               Wavefront starts execution
1723                                                      with DX10 clamp mode
1724                                                      enabled. Used by the vector
1725                                                      ALU to force DX10 style
1726                                                      treatment of NaN's (when
1727                                                      set, clamp NaN to zero,
1728                                                      otherwise pass NaN
1729                                                      through).
1730
1731                                                      Used by CP to set up
1732                                                      ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
1733      22      1 bit   DEBUG_MODE                      Must be 0.
1734
1735                                                      Start executing wavefront
1736                                                      in single step mode.
1737
1738                                                      CP is responsible for
1739                                                      filling in
1740                                                      ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
1741      23      1 bit   ENABLE_IEEE_MODE                Wavefront starts execution
1742                                                      with IEEE mode
1743                                                      enabled. Floating point
1744                                                      opcodes that support
1745                                                      exception flag gathering
1746                                                      will quiet and propagate
1747                                                      signaling-NaN inputs per
1748                                                      IEEE 754-2008. Min_dx10 and
1749                                                      max_dx10 become IEEE
1750                                                      754-2008 compliant due to
1751                                                      signaling-NaN propagation
1752                                                      and quieting.
1753
1754                                                      Used by CP to set up
1755                                                      ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
1756      24      1 bit   BULKY                           Must be 0.
1757
1758                                                      Only one work-group allowed
1759                                                      to execute on a compute
1760                                                      unit.
1761
1762                                                      CP is responsible for
1763                                                      filling in
1764                                                      ``COMPUTE_PGM_RSRC1.BULKY``.
1765      25      1 bit   CDBG_USER                       Must be 0.
1766
1767                                                      Flag that can be used to
1768                                                      control debugging code.
1769
1770                                                      CP is responsible for
1771                                                      filling in
1772                                                      ``COMPUTE_PGM_RSRC1.CDBG_USER``.
1773      26      1 bit   FP16_OVFL                       GFX6-GFX8
1774                                                        Reserved, must be 0.
1775                                                      GFX9
1776                                                        Wavefront starts execution
1777                                                        with specified fp16 overflow
1778                                                        mode.
1779
1780                                                        - If 0, fp16 overflow generates
1781                                                          +/-INF values.
1782                                                        - If 1, fp16 overflow that is the
1783                                                          result of an +/-INF input value
1784                                                          or divide by 0 produces a +/-INF,
1785                                                          otherwise clamps computed
1786                                                          overflow to +/-MAX_FP16 as
1787                                                          appropriate.
1788
1789                                                        Used by CP to set up
1790                                                        ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
1791      31:27   5 bits                                  Reserved, must be 0.
1792      32      **Total size 4 bytes**
1793      ======= ===================================================================================================================
1794
1795 ..
1796
1797   .. table:: compute_pgm_rsrc2 for GFX6-GFX9
1798      :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
1799
1800      ======= ======= =============================== ===========================================================================
1801      Bits    Size    Field Name                      Description
1802      ======= ======= =============================== ===========================================================================
1803      0       1 bit   ENABLE_SGPR_PRIVATE_SEGMENT     Enable the setup of the
1804                      _WAVEFRONT_OFFSET               SGPR wavefront scratch offset
1805                                                      system register (see
1806                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1807
1808                                                      Used by CP to set up
1809                                                      ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
1810      5:1     5 bits  USER_SGPR_COUNT                 The total number of SGPR
1811                                                      user data registers
1812                                                      requested. This number must
1813                                                      match the number of user
1814                                                      data registers enabled.
1815
1816                                                      Used by CP to set up
1817                                                      ``COMPUTE_PGM_RSRC2.USER_SGPR``.
1818      6       1 bit   ENABLE_TRAP_HANDLER             Set to 1 if code contains a
1819                                                      TRAP instruction which
1820                                                      requires a trap handler to
1821                                                      be enabled.
1822
1823                                                      CP sets
1824                                                      ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``
1825                                                      if the runtime has
1826                                                      installed a trap handler
1827                                                      regardless of the setting
1828                                                      of this field.
1829      7       1 bit   ENABLE_SGPR_WORKGROUP_ID_X      Enable the setup of the
1830                                                      system SGPR register for
1831                                                      the work-group id in the X
1832                                                      dimension (see
1833                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1834
1835                                                      Used by CP to set up
1836                                                      ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
1837      8       1 bit   ENABLE_SGPR_WORKGROUP_ID_Y      Enable the setup of the
1838                                                      system SGPR register for
1839                                                      the work-group id in the Y
1840                                                      dimension (see
1841                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1842
1843                                                      Used by CP to set up
1844                                                      ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
1845      9       1 bit   ENABLE_SGPR_WORKGROUP_ID_Z      Enable the setup of the
1846                                                      system SGPR register for
1847                                                      the work-group id in the Z
1848                                                      dimension (see
1849                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1850
1851                                                      Used by CP to set up
1852                                                      ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
1853      10      1 bit   ENABLE_SGPR_WORKGROUP_INFO      Enable the setup of the
1854                                                      system SGPR register for
1855                                                      work-group information (see
1856                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1857
1858                                                      Used by CP to set up
1859                                                      ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
1860      12:11   2 bits  ENABLE_VGPR_WORKITEM_ID         Enable the setup of the
1861                                                      VGPR system registers used
1862                                                      for the work-item ID.
1863                                                      :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
1864                                                      defines the values.
1865
1866                                                      Used by CP to set up
1867                                                      ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
1868      13      1 bit   ENABLE_EXCEPTION_ADDRESS_WATCH  Must be 0.
1869
1870                                                      Wavefront starts execution
1871                                                      with address watch
1872                                                      exceptions enabled which
1873                                                      are generated when L1 has
1874                                                      witnessed a thread access
1875                                                      an *address of
1876                                                      interest*.
1877
1878                                                      CP is responsible for
1879                                                      filling in the address
1880                                                      watch bit in
1881                                                      ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1882                                                      according to what the
1883                                                      runtime requests.
1884      14      1 bit   ENABLE_EXCEPTION_MEMORY         Must be 0.
1885
1886                                                      Wavefront starts execution
1887                                                      with memory violation
1888                                                      exceptions exceptions
1889                                                      enabled which are generated
1890                                                      when a memory violation has
1891                                                      occurred for this wavefront from
1892                                                      L1 or LDS
1893                                                      (write-to-read-only-memory,
1894                                                      mis-aligned atomic, LDS
1895                                                      address out of range,
1896                                                      illegal address, etc.).
1897
1898                                                      CP sets the memory
1899                                                      violation bit in
1900                                                      ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1901                                                      according to what the
1902                                                      runtime requests.
1903      23:15   9 bits  GRANULATED_LDS_SIZE             Must be 0.
1904
1905                                                      CP uses the rounded value
1906                                                      from the dispatch packet,
1907                                                      not this value, as the
1908                                                      dispatch may contain
1909                                                      dynamically allocated group
1910                                                      segment memory. CP writes
1911                                                      directly to
1912                                                      ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
1913
1914                                                      Amount of group segment
1915                                                      (LDS) to allocate for each
1916                                                      work-group. Granularity is
1917                                                      device specific:
1918
1919                                                      GFX6:
1920                                                        roundup(lds-size / (64 * 4))
1921                                                      GFX7-GFX9:
1922                                                        roundup(lds-size / (128 * 4))
1923
1924      24      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    Wavefront starts execution
1925                      _INVALID_OPERATION              with specified exceptions
1926                                                      enabled.
1927
1928                                                      Used by CP to set up
1929                                                      ``COMPUTE_PGM_RSRC2.EXCP_EN``
1930                                                      (set from bits 0..6).
1931
1932                                                      IEEE 754 FP Invalid
1933                                                      Operation
1934      25      1 bit   ENABLE_EXCEPTION_FP_DENORMAL    FP Denormal one or more
1935                      _SOURCE                         input operands is a
1936                                                      denormal number
1937      26      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Division by
1938                      _DIVISION_BY_ZERO               Zero
1939      27      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP FP Overflow
1940                      _OVERFLOW
1941      28      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Underflow
1942                      _UNDERFLOW
1943      29      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Inexact
1944                      _INEXACT
1945      30      1 bit   ENABLE_EXCEPTION_INT_DIVIDE_BY  Integer Division by Zero
1946                      _ZERO                           (rcp_iflag_f32 instruction
1947                                                      only)
1948      31      1 bit                                   Reserved, must be 0.
1949      32      **Total size 4 bytes.**
1950      ======= ===================================================================================================================
1951
1952 ..
1953
1954   .. table:: Floating Point Rounding Mode Enumeration Values
1955      :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
1956
1957      ====================================== ===== ==============================
1958      Enumeration Name                       Value Description
1959      ====================================== ===== ==============================
1960      AMDGPU_FLOAT_ROUND_MODE_NEAR_EVEN      0     Round Ties To Even
1961      AMDGPU_FLOAT_ROUND_MODE_PLUS_INFINITY  1     Round Toward +infinity
1962      AMDGPU_FLOAT_ROUND_MODE_MINUS_INFINITY 2     Round Toward -infinity
1963      AMDGPU_FLOAT_ROUND_MODE_ZERO           3     Round Toward 0
1964      ====================================== ===== ==============================
1965
1966 ..
1967
1968   .. table:: Floating Point Denorm Mode Enumeration Values
1969      :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
1970
1971      ====================================== ===== ==============================
1972      Enumeration Name                       Value Description
1973      ====================================== ===== ==============================
1974      AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC_DST 0     Flush Source and Destination
1975                                                   Denorms
1976      AMDGPU_FLOAT_DENORM_MODE_FLUSH_DST     1     Flush Output Denorms
1977      AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC     2     Flush Source Denorms
1978      AMDGPU_FLOAT_DENORM_MODE_FLUSH_NONE    3     No Flush
1979      ====================================== ===== ==============================
1980
1981 ..
1982
1983   .. table:: System VGPR Work-Item ID Enumeration Values
1984      :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
1985
1986      ======================================== ===== ============================
1987      Enumeration Name                         Value Description
1988      ======================================== ===== ============================
1989      AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X         0     Set work-item X dimension
1990                                                     ID.
1991      AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y       1     Set work-item X and Y
1992                                                     dimensions ID.
1993      AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y_Z     2     Set work-item X, Y and Z
1994                                                     dimensions ID.
1995      AMDGPU_SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3     Undefined.
1996      ======================================== ===== ============================
1997
1998 .. _amdgpu-amdhsa-initial-kernel-execution-state:
1999
2000 Initial Kernel Execution State
2001 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2002
2003 This section defines the register state that will be set up by the packet
2004 processor prior to the start of execution of every wavefront. This is limited by
2005 the constraints of the hardware controllers of CP/ADC/SPI.
2006
2007 The order of the SGPR registers is defined, but the compiler can specify which
2008 ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
2009 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2010 for enabled registers are dense starting at SGPR0: the first enabled register is
2011 SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
2012 an SGPR number.
2013
2014 The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
2015 all wavefronts of the grid. It is possible to specify more than 16 User SGPRs using
2016 the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
2017 initialized. These are then immediately followed by the System SGPRs that are
2018 set up by ADC/SPI and can have different values for each wavefront of the grid
2019 dispatch.
2020
2021 SGPR register initial state is defined in
2022 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
2023
2024   .. table:: SGPR Register Set Up Order
2025      :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
2026
2027      ========== ========================== ====== ==============================
2028      SGPR Order Name                       Number Description
2029                 (kernel descriptor enable  of
2030                 field)                     SGPRs
2031      ========== ========================== ====== ==============================
2032      First      Private Segment Buffer     4      V# that can be used, together
2033                 (enable_sgpr_private              with Scratch Wavefront Offset
2034                 _segment_buffer)                  as an offset, to access the
2035                                                   private memory space using a
2036                                                   segment address.
2037
2038                                                   CP uses the value provided by
2039                                                   the runtime.
2040      then       Dispatch Ptr               2      64 bit address of AQL dispatch
2041                 (enable_sgpr_dispatch_ptr)        packet for kernel dispatch
2042                                                   actually executing.
2043      then       Queue Ptr                  2      64 bit address of amd_queue_t
2044                 (enable_sgpr_queue_ptr)           object for AQL queue on which
2045                                                   the dispatch packet was
2046                                                   queued.
2047      then       Kernarg Segment Ptr        2      64 bit address of Kernarg
2048                 (enable_sgpr_kernarg              segment. This is directly
2049                 _segment_ptr)                     copied from the
2050                                                   kernarg_address in the kernel
2051                                                   dispatch packet.
2052
2053                                                   Having CP load it once avoids
2054                                                   loading it at the beginning of
2055                                                   every wavefront.
2056      then       Dispatch Id                2      64 bit Dispatch ID of the
2057                 (enable_sgpr_dispatch_id)         dispatch packet being
2058                                                   executed.
2059      then       Flat Scratch Init          2      This is 2 SGPRs:
2060                 (enable_sgpr_flat_scratch
2061                 _init)                            GFX6
2062                                                     Not supported.
2063                                                   GFX7-GFX8
2064                                                     The first SGPR is a 32 bit
2065                                                     byte offset from
2066                                                     ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2067                                                     to per SPI base of memory
2068                                                     for scratch for the queue
2069                                                     executing the kernel
2070                                                     dispatch. CP obtains this
2071                                                     from the runtime. (The
2072                                                     Scratch Segment Buffer base
2073                                                     address is
2074                                                     ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2075                                                     plus this offset.) The value
2076                                                     of Scratch Wavefront Offset must
2077                                                     be added to this offset by
2078                                                     the kernel machine code,
2079                                                     right shifted by 8, and
2080                                                     moved to the FLAT_SCRATCH_HI
2081                                                     SGPR register.
2082                                                     FLAT_SCRATCH_HI corresponds
2083                                                     to SGPRn-4 on GFX7, and
2084                                                     SGPRn-6 on GFX8 (where SGPRn
2085                                                     is the highest numbered SGPR
2086                                                     allocated to the wavefront).
2087                                                     FLAT_SCRATCH_HI is
2088                                                     multiplied by 256 (as it is
2089                                                     in units of 256 bytes) and
2090                                                     added to
2091                                                     ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2092                                                     to calculate the per wavefront
2093                                                     FLAT SCRATCH BASE in flat
2094                                                     memory instructions that
2095                                                     access the scratch
2096                                                     apperture.
2097
2098                                                     The second SGPR is 32 bit
2099                                                     byte size of a single
2100                                                     work-item's scratch memory
2101                                                     usage. CP obtains this from
2102                                                     the runtime, and it is
2103                                                     always a multiple of DWORD.
2104                                                     CP checks that the value in
2105                                                     the kernel dispatch packet
2106                                                     Private Segment Byte Size is
2107                                                     not larger, and requests the
2108                                                     runtime to increase the
2109                                                     queue's scratch size if
2110                                                     necessary. The kernel code
2111                                                     must move it to
2112                                                     FLAT_SCRATCH_LO which is
2113                                                     SGPRn-3 on GFX7 and SGPRn-5
2114                                                     on GFX8. FLAT_SCRATCH_LO is
2115                                                     used as the FLAT SCRATCH
2116                                                     SIZE in flat memory
2117                                                     instructions. Having CP load
2118                                                     it once avoids loading it at
2119                                                     the beginning of every
2120                                                     wavefront.
2121                                                   GFX9
2122                                                     This is the
2123                                                     64 bit base address of the
2124                                                     per SPI scratch backing
2125                                                     memory managed by SPI for
2126                                                     the queue executing the
2127                                                     kernel dispatch. CP obtains
2128                                                     this from the runtime (and
2129                                                     divides it if there are
2130                                                     multiple Shader Arrays each
2131                                                     with its own SPI). The value
2132                                                     of Scratch Wavefront Offset must
2133                                                     be added by the kernel
2134                                                     machine code and the result
2135                                                     moved to the FLAT_SCRATCH
2136                                                     SGPR which is SGPRn-6 and
2137                                                     SGPRn-5. It is used as the
2138                                                     FLAT SCRATCH BASE in flat
2139                                                     memory instructions.
2140      then       Private Segment Size       1      The 32 bit byte size of a
2141                                                   (enable_sgpr_private single
2142                                                   work-item's
2143                                                   scratch_segment_size) memory
2144                                                   allocation. This is the
2145                                                   value from the kernel
2146                                                   dispatch packet Private
2147                                                   Segment Byte Size rounded up
2148                                                   by CP to a multiple of
2149                                                   DWORD.
2150
2151                                                   Having CP load it once avoids
2152                                                   loading it at the beginning of
2153                                                   every wavefront.
2154
2155                                                   This is not used for
2156                                                   GFX7-GFX8 since it is the same
2157                                                   value as the second SGPR of
2158                                                   Flat Scratch Init. However, it
2159                                                   may be needed for GFX9 which
2160                                                   changes the meaning of the
2161                                                   Flat Scratch Init value.
2162      then       Grid Work-Group Count X    1      32 bit count of the number of
2163                 (enable_sgpr_grid                 work-groups in the X dimension
2164                 _workgroup_count_X)               for the grid being
2165                                                   executed. Computed from the
2166                                                   fields in the kernel dispatch
2167                                                   packet as ((grid_size.x +
2168                                                   workgroup_size.x - 1) /
2169                                                   workgroup_size.x).
2170      then       Grid Work-Group Count Y    1      32 bit count of the number of
2171                 (enable_sgpr_grid                 work-groups in the Y dimension
2172                 _workgroup_count_Y &&             for the grid being
2173                 less than 16 previous             executed. Computed from the
2174                 SGPRs)                            fields in the kernel dispatch
2175                                                   packet as ((grid_size.y +
2176                                                   workgroup_size.y - 1) /
2177                                                   workgroupSize.y).
2178
2179                                                   Only initialized if <16
2180                                                   previous SGPRs initialized.
2181      then       Grid Work-Group Count Z    1      32 bit count of the number of
2182                 (enable_sgpr_grid                 work-groups in the Z dimension
2183                 _workgroup_count_Z &&             for the grid being
2184                 less than 16 previous             executed. Computed from the
2185                 SGPRs)                            fields in the kernel dispatch
2186                                                   packet as ((grid_size.z +
2187                                                   workgroup_size.z - 1) /
2188                                                   workgroupSize.z).
2189
2190                                                   Only initialized if <16
2191                                                   previous SGPRs initialized.
2192      then       Work-Group Id X            1      32 bit work-group id in X
2193                 (enable_sgpr_workgroup_id         dimension of grid for
2194                 _X)                               wavefront.
2195      then       Work-Group Id Y            1      32 bit work-group id in Y
2196                 (enable_sgpr_workgroup_id         dimension of grid for
2197                 _Y)                               wavefront.
2198      then       Work-Group Id Z            1      32 bit work-group id in Z
2199                 (enable_sgpr_workgroup_id         dimension of grid for
2200                 _Z)                               wavefront.
2201      then       Work-Group Info            1      {first_wavefront, 14'b0000,
2202                 (enable_sgpr_workgroup            ordered_append_term[10:0],
2203                 _info)                            threadgroup_size_in_wavefronts[5:0]}
2204      then       Scratch Wavefront Offset   1      32 bit byte offset from base
2205                 (enable_sgpr_private              of scratch base of queue
2206                 _segment_wavefront_offset)        executing the kernel
2207                                                   dispatch. Must be used as an
2208                                                   offset with Private
2209                                                   segment address when using
2210                                                   Scratch Segment Buffer. It
2211                                                   must be used to set up FLAT
2212                                                   SCRATCH for flat addressing
2213                                                   (see
2214                                                   :ref:`amdgpu-amdhsa-flat-scratch`).
2215      ========== ========================== ====== ==============================
2216
2217 The order of the VGPR registers is defined, but the compiler can specify which
2218 ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
2219 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2220 for enabled registers are dense starting at VGPR0: the first enabled register is
2221 VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
2222 VGPR number.
2223
2224 VGPR register initial state is defined in
2225 :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
2226
2227   .. table:: VGPR Register Set Up Order
2228      :name: amdgpu-amdhsa-vgpr-register-set-up-order-table
2229
2230      ========== ========================== ====== ==============================
2231      VGPR Order Name                       Number Description
2232                 (kernel descriptor enable  of
2233                 field)                     VGPRs
2234      ========== ========================== ====== ==============================
2235      First      Work-Item Id X             1      32 bit work item id in X
2236                 (Always initialized)              dimension of work-group for
2237                                                   wavefront lane.
2238      then       Work-Item Id Y             1      32 bit work item id in Y
2239                 (enable_vgpr_workitem_id          dimension of work-group for
2240                 > 0)                              wavefront lane.
2241      then       Work-Item Id Z             1      32 bit work item id in Z
2242                 (enable_vgpr_workitem_id          dimension of work-group for
2243                 > 1)                              wavefront lane.
2244      ========== ========================== ====== ==============================
2245
2246 The setting of registers is is done by GPU CP/ADC/SPI hardware as follows:
2247
2248 1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
2249    registers.
2250 2. Work-group Id registers X, Y, Z are set by ADC which supports any
2251    combination including none.
2252 3. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why
2253    its value cannot included with the flat scratch init value which is per queue.
2254 4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
2255    or (X, Y, Z).
2256
2257 Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
2258 value to the hardware required SGPRn-3 and SGPRn-4 respectively.
2259
2260 The global segment can be accessed either using buffer instructions (GFX6 which
2261 has V# 64 bit address support), flat instructions (GFX7-GFX9), or global
2262 instructions (GFX9).
2263
2264 If buffer operations are used then the compiler can generate a V# with the
2265 following properties:
2266
2267 * base address of 0
2268 * no swizzle
2269 * ATC: 1 if IOMMU present (such as APU)
2270 * ptr64: 1
2271 * MTYPE set to support memory coherence that matches the runtime (such as CC for
2272   APU and NC for dGPU).
2273
2274 .. _amdgpu-amdhsa-kernel-prolog:
2275
2276 Kernel Prolog
2277 ~~~~~~~~~~~~~
2278
2279 .. _amdgpu-amdhsa-m0:
2280
2281 M0
2282 ++
2283
2284 GFX6-GFX8
2285   The M0 register must be initialized with a value at least the total LDS size
2286   if the kernel may access LDS via DS or flat operations. Total LDS size is
2287   available in dispatch packet. For M0, it is also possible to use maximum
2288   possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
2289   GFX7-GFX8).
2290 GFX9
2291   The M0 register is not used for range checking LDS accesses and so does not
2292   need to be initialized in the prolog.
2293
2294 .. _amdgpu-amdhsa-flat-scratch:
2295
2296 Flat Scratch
2297 ++++++++++++
2298
2299 If the kernel may use flat operations to access scratch memory, the prolog code
2300 must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
2301 are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wavefront
2302 Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
2303
2304 GFX6
2305   Flat scratch is not supported.
2306
2307 GFX7-GFX8
2308   1. The low word of Flat Scratch Init is 32 bit byte offset from
2309      ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
2310      being managed by SPI for the queue executing the kernel dispatch. This is
2311      the same value used in the Scratch Segment Buffer V# base address. The
2312      prolog must add the value of Scratch Wavefront Offset to get the wavefront's byte
2313      scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
2314      FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
2315      by 8 before moving into FLAT_SCRATCH_LO.
2316   2. The second word of Flat Scratch Init is 32 bit byte size of a single
2317      work-items scratch memory usage. This is directly loaded from the kernel
2318      dispatch packet Private Segment Byte Size and rounded up to a multiple of
2319      DWORD. Having CP load it once avoids loading it at the beginning of every
2320      wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
2321      SIZE.
2322
2323 GFX9
2324   The Flat Scratch Init is the 64 bit address of the base of scratch backing
2325   memory being managed by SPI for the queue executing the kernel dispatch. The
2326   prolog must add the value of Scratch Wavefront Offset and moved to the FLAT_SCRATCH
2327   pair for use as the flat scratch base in flat memory instructions.
2328
2329 .. _amdgpu-amdhsa-memory-model:
2330
2331 Memory Model
2332 ~~~~~~~~~~~~
2333
2334 This section describes the mapping of LLVM memory model onto AMDGPU machine code
2335 (see :ref:`memmodel`). *The implementation is WIP.*
2336
2337 .. TODO
2338    Update when implementation complete.
2339
2340 The AMDGPU backend supports the memory synchronization scopes specified in
2341 :ref:`amdgpu-memory-scopes`.
2342
2343 The code sequences used to implement the memory model are defined in table
2344 :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
2345
2346 The sequences specify the order of instructions that a single thread must
2347 execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
2348 to other memory instructions executed by the same thread. This allows them to be
2349 moved earlier or later which can allow them to be combined with other instances
2350 of the same instruction, or hoisted/sunk out of loops to improve
2351 performance. Only the instructions related to the memory model are given;
2352 additional ``s_waitcnt`` instructions are required to ensure registers are
2353 defined before being used. These may be able to be combined with the memory
2354 model ``s_waitcnt`` instructions as described above.
2355
2356 The AMDGPU backend supports the following memory models:
2357
2358   HSA Memory Model [HSA]_
2359     The HSA memory model uses a single happens-before relation for all address
2360     spaces (see :ref:`amdgpu-address-spaces`).
2361   OpenCL Memory Model [OpenCL]_
2362     The OpenCL memory model which has separate happens-before relations for the
2363     global and local address spaces. Only a fence specifying both global and
2364     local address space, and seq_cst instructions join the relationships. Since
2365     the LLVM ``memfence`` instruction does not allow an address space to be
2366     specified the OpenCL fence has to convervatively assume both local and
2367     global address space was specified. However, optimizations can often be
2368     done to eliminate the additional ``s_waitcnt`` instructions when there are
2369     no intervening memory instructions which access the corresponding address
2370     space. The code sequences in the table indicate what can be omitted for the
2371     OpenCL memory. The target triple environment is used to determine if the
2372     source language is OpenCL (see :ref:`amdgpu-opencl`).
2373
2374 ``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
2375 operations.
2376
2377 ``buffer/global/flat_load/store/atomic`` instructions to global memory are
2378 termed vector memory operations.
2379
2380 For GFX6-GFX9:
2381
2382 * Each agent has multiple compute units (CU).
2383 * Each CU has multiple SIMDs that execute wavefronts.
2384 * The wavefronts for a single work-group are executed in the same CU but may be
2385   executed by different SIMDs.
2386 * Each CU has a single LDS memory shared by the wavefronts of the work-groups
2387   executing on it.
2388 * All LDS operations of a CU are performed as wavefront wide operations in a
2389   global order and involve no caching. Completion is reported to a wavefront in
2390   execution order.
2391 * The LDS memory has multiple request queues shared by the SIMDs of a
2392   CU. Therefore, the LDS operations performed by different wavefronts of a work-group
2393   can be reordered relative to each other, which can result in reordering the
2394   visibility of vector memory operations with respect to LDS operations of other
2395   wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
2396   ensure synchronization between LDS operations and vector memory operations
2397   between wavefronts of a work-group, but not between operations performed by the
2398   same wavefront.
2399 * The vector memory operations are performed as wavefront wide operations and
2400   completion is reported to a wavefront in execution order. The exception is
2401   that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
2402   vector memory order if they access LDS memory, and out of LDS operation order
2403   if they access global memory.
2404 * The vector memory operations access a single vector L1 cache shared by all
2405   SIMDs a CU. Therefore, no special action is required for coherence between the
2406   lanes of a single wavefront, or for coherence between wavefronts in the same
2407   work-group. A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts
2408   executing in different work-groups as they may be executing on different CUs.
2409 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
2410   on a group of CUs. The scalar and vector L1 caches are not coherent. However,
2411   scalar operations are used in a restricted way so do not impact the memory
2412   model. See :ref:`amdgpu-amdhsa-memory-spaces`.
2413 * The vector and scalar memory operations use an L2 cache shared by all CUs on
2414   the same agent.
2415 * The L2 cache has independent channels to service disjoint ranges of virtual
2416   addresses.
2417 * Each CU has a separate request queue per channel. Therefore, the vector and
2418   scalar memory operations performed by wavefronts executing in different work-groups
2419   (which may be executing on different CUs) of an agent can be reordered
2420   relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
2421   synchronization between vector memory operations of different CUs. It ensures a
2422   previous vector memory operation has completed before executing a subsequent
2423   vector memory or LDS operation and so can be used to meet the requirements of
2424   acquire and release.
2425 * The L2 cache can be kept coherent with other agents on some targets, or ranges
2426   of virtual addresses can be set up to bypass it to ensure system coherence.
2427
2428 Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-GFX8),
2429 or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
2430 memory, atomic memory orderings are not meaningful and all accesses are treated
2431 as non-atomic.
2432
2433 Constant address space uses ``buffer/global_load`` instructions (or equivalent
2434 scalar memory instructions). Since the constant address space contents do not
2435 change during the execution of a kernel dispatch it is not legal to perform
2436 stores, and atomic memory orderings are not meaningful and all access are
2437 treated as non-atomic.
2438
2439 A memory synchronization scope wider than work-group is not meaningful for the
2440 group (LDS) address space and is treated as work-group.
2441
2442 The memory model does not support the region address space which is treated as
2443 non-atomic.
2444
2445 Acquire memory ordering is not meaningful on store atomic instructions and is
2446 treated as non-atomic.
2447
2448 Release memory ordering is not meaningful on load atomic instructions and is
2449 treated a non-atomic.
2450
2451 Acquire-release memory ordering is not meaningful on load or store atomic
2452 instructions and is treated as acquire and release respectively.
2453
2454 AMDGPU backend only uses scalar memory operations to access memory that is
2455 proven to not change during the execution of the kernel dispatch. This includes
2456 constant address space and global address space for program scope const
2457 variables. Therefore the kernel machine code does not have to maintain the
2458 scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
2459 and vector L1 caches are invalidated between kernel dispatches by CP since
2460 constant address space data may change between kernel dispatch executions. See
2461 :ref:`amdgpu-amdhsa-memory-spaces`.
2462
2463 The one execption is if scalar writes are used to spill SGPR registers. In this
2464 case the AMDGPU backend ensures the memory location used to spill is never
2465 accessed by vector memory operations at the same time. If scalar writes are used
2466 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
2467 return since the locations may be used for vector memory instructions by a
2468 future wavefront that uses the same scratch area, or a function call that creates a
2469 frame at the same address, respectively. There is no need for a ``s_dcache_inv``
2470 as all scalar writes are write-before-read in the same thread.
2471
2472 Scratch backing memory (which is used for the private address space)
2473 is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private
2474 address space is only accessed by a single thread, and is always
2475 write-before-read, there is never a need to invalidate these entries from the L1
2476 cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the
2477 volatile cache lines.
2478
2479 On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
2480 to invalidate the L2 cache. This also causes it to be treated as
2481 non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC
2482 (cache coherent) and so the L2 cache will coherent with the CPU and other
2483 agents.
2484
2485   .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
2486      :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
2487
2488      ============ ============ ============== ========== ===============================
2489      LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code
2490                   Ordering     Sync Scope     Address
2491                                               Space
2492      ============ ============ ============== ========== ===============================
2493      **Non-Atomic**
2494      -----------------------------------------------------------------------------------
2495      load         *none*       *none*         - global   - !volatile & !nontemporal
2496                                               - generic
2497                                               - private    1. buffer/global/flat_load
2498                                               - constant
2499                                                          - volatile & !nontemporal
2500
2501                                                            1. buffer/global/flat_load
2502                                                               glc=1
2503
2504                                                          - nontemporal
2505
2506                                                            1. buffer/global/flat_load
2507                                                               glc=1 slc=1
2508
2509      load         *none*       *none*         - local    1. ds_load
2510      store        *none*       *none*         - global   - !nontemporal
2511                                               - generic
2512                                               - private    1. buffer/global/flat_store
2513                                               - constant
2514                                                          - nontemporal
2515
2516                                                            1. buffer/global/flat_stote
2517                                                               glc=1 slc=1
2518
2519      store        *none*       *none*         - local    1. ds_store
2520      **Unordered Atomic**
2521      -----------------------------------------------------------------------------------
2522      load atomic  unordered    *any*          *any*      *Same as non-atomic*.
2523      store atomic unordered    *any*          *any*      *Same as non-atomic*.
2524      atomicrmw    unordered    *any*          *any*      *Same as monotonic
2525                                                          atomic*.
2526      **Monotonic Atomic**
2527      -----------------------------------------------------------------------------------
2528      load atomic  monotonic    - singlethread - global   1. buffer/global/flat_load
2529                                - wavefront    - generic
2530                                - workgroup
2531      load atomic  monotonic    - singlethread - local    1. ds_load
2532                                - wavefront
2533                                - workgroup
2534      load atomic  monotonic    - agent        - global   1. buffer/global/flat_load
2535                                - system       - generic     glc=1
2536      store atomic monotonic    - singlethread - global   1. buffer/global/flat_store
2537                                - wavefront    - generic
2538                                - workgroup
2539                                - agent
2540                                - system
2541      store atomic monotonic    - singlethread - local    1. ds_store
2542                                - wavefront
2543                                - workgroup
2544      atomicrmw    monotonic    - singlethread - global   1. buffer/global/flat_atomic
2545                                - wavefront    - generic
2546                                - workgroup
2547                                - agent
2548                                - system
2549      atomicrmw    monotonic    - singlethread - local    1. ds_atomic
2550                                - wavefront
2551                                - workgroup
2552      **Acquire Atomic**
2553      -----------------------------------------------------------------------------------
2554      load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load
2555                                - wavefront    - local
2556                                               - generic
2557      load atomic  acquire      - workgroup    - global   1. buffer/global/flat_load
2558      load atomic  acquire      - workgroup    - local    1. ds_load
2559                                                          2. s_waitcnt lgkmcnt(0)
2560
2561                                                            - If OpenCL, omit.
2562                                                            - Must happen before
2563                                                              any following
2564                                                              global/generic
2565                                                              load/load
2566                                                              atomic/store/store
2567                                                              atomic/atomicrmw.
2568                                                            - Ensures any
2569                                                              following global
2570                                                              data read is no
2571                                                              older than the load
2572                                                              atomic value being
2573                                                              acquired.
2574      load atomic  acquire      - workgroup    - generic  1. flat_load
2575                                                          2. s_waitcnt lgkmcnt(0)
2576
2577                                                            - If OpenCL, omit.
2578                                                            - Must happen before
2579                                                              any following
2580                                                              global/generic
2581                                                              load/load
2582                                                              atomic/store/store
2583                                                              atomic/atomicrmw.
2584                                                            - Ensures any
2585                                                              following global
2586                                                              data read is no
2587                                                              older than the load
2588                                                              atomic value being
2589                                                              acquired.
2590      load atomic  acquire      - agent        - global   1. buffer/global/flat_load
2591                                - system                     glc=1
2592                                                          2. s_waitcnt vmcnt(0)
2593
2594                                                            - Must happen before
2595                                                              following
2596                                                              buffer_wbinvl1_vol.
2597                                                            - Ensures the load
2598                                                              has completed
2599                                                              before invalidating
2600                                                              the cache.
2601
2602                                                          3. buffer_wbinvl1_vol
2603
2604                                                            - Must happen before
2605                                                              any following
2606                                                              global/generic
2607                                                              load/load
2608                                                              atomic/atomicrmw.
2609                                                            - Ensures that
2610                                                              following
2611                                                              loads will not see
2612                                                              stale global data.
2613
2614      load atomic  acquire      - agent        - generic  1. flat_load glc=1
2615                                - system                  2. s_waitcnt vmcnt(0) &
2616                                                             lgkmcnt(0)
2617
2618                                                            - If OpenCL omit
2619                                                              lgkmcnt(0).
2620                                                            - Must happen before
2621                                                              following
2622                                                              buffer_wbinvl1_vol.
2623                                                            - Ensures the flat_load
2624                                                              has completed
2625                                                              before invalidating
2626                                                              the cache.
2627
2628                                                          3. buffer_wbinvl1_vol
2629
2630                                                            - Must happen before
2631                                                              any following
2632                                                              global/generic
2633                                                              load/load
2634                                                              atomic/atomicrmw.
2635                                                            - Ensures that
2636                                                              following loads
2637                                                              will not see stale
2638                                                              global data.
2639
2640      atomicrmw    acquire      - singlethread - global   1. buffer/global/ds/flat_atomic
2641                                - wavefront    - local
2642                                               - generic
2643      atomicrmw    acquire      - workgroup    - global   1. buffer/global/flat_atomic
2644      atomicrmw    acquire      - workgroup    - local    1. ds_atomic
2645                                                          2. waitcnt lgkmcnt(0)
2646
2647                                                            - If OpenCL, omit.
2648                                                            - Must happen before
2649                                                              any following
2650                                                              global/generic
2651                                                              load/load
2652                                                              atomic/store/store
2653                                                              atomic/atomicrmw.
2654                                                            - Ensures any
2655                                                              following global
2656                                                              data read is no
2657                                                              older than the
2658                                                              atomicrmw value
2659                                                              being acquired.
2660
2661      atomicrmw    acquire      - workgroup    - generic  1. flat_atomic
2662                                                          2. waitcnt lgkmcnt(0)
2663
2664                                                            - If OpenCL, omit.
2665                                                            - Must happen before
2666                                                              any following
2667                                                              global/generic
2668                                                              load/load
2669                                                              atomic/store/store
2670                                                              atomic/atomicrmw.
2671                                                            - Ensures any
2672                                                              following global
2673                                                              data read is no
2674                                                              older than the
2675                                                              atomicrmw value
2676                                                              being acquired.
2677
2678      atomicrmw    acquire      - agent        - global   1. buffer/global/flat_atomic
2679                                - system                  2. s_waitcnt vmcnt(0)
2680
2681                                                            - Must happen before
2682                                                              following
2683                                                              buffer_wbinvl1_vol.
2684                                                            - Ensures the
2685                                                              atomicrmw has
2686                                                              completed before
2687                                                              invalidating the
2688                                                              cache.
2689
2690                                                          3. buffer_wbinvl1_vol
2691
2692                                                            - Must happen before
2693                                                              any following
2694                                                              global/generic
2695                                                              load/load
2696                                                              atomic/atomicrmw.
2697                                                            - Ensures that
2698                                                              following loads
2699                                                              will not see stale
2700                                                              global data.
2701
2702      atomicrmw    acquire      - agent        - generic  1. flat_atomic
2703                                - system                  2. s_waitcnt vmcnt(0) &
2704                                                             lgkmcnt(0)
2705
2706                                                            - If OpenCL, omit
2707                                                              lgkmcnt(0).
2708                                                            - Must happen before
2709                                                              following
2710                                                              buffer_wbinvl1_vol.
2711                                                            - Ensures the
2712                                                              atomicrmw has
2713                                                              completed before
2714                                                              invalidating the
2715                                                              cache.
2716
2717                                                          3. buffer_wbinvl1_vol
2718
2719                                                            - Must happen before
2720                                                              any following
2721                                                              global/generic
2722                                                              load/load
2723                                                              atomic/atomicrmw.
2724                                                            - Ensures that
2725                                                              following loads
2726                                                              will not see stale
2727                                                              global data.
2728
2729      fence        acquire      - singlethread *none*     *none*
2730                                - wavefront
2731      fence        acquire      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
2732
2733                                                            - If OpenCL and
2734                                                              address space is
2735                                                              not generic, omit.
2736                                                            - However, since LLVM
2737                                                              currently has no
2738                                                              address space on
2739                                                              the fence need to
2740                                                              conservatively
2741                                                              always generate. If
2742                                                              fence had an
2743                                                              address space then
2744                                                              set to address
2745                                                              space of OpenCL
2746                                                              fence flag, or to
2747                                                              generic if both
2748                                                              local and global
2749                                                              flags are
2750                                                              specified.
2751                                                            - Must happen after
2752                                                              any preceding
2753                                                              local/generic load
2754                                                              atomic/atomicrmw
2755                                                              with an equal or
2756                                                              wider sync scope
2757                                                              and memory ordering
2758                                                              stronger than
2759                                                              unordered (this is
2760                                                              termed the
2761                                                              fence-paired-atomic).
2762                                                            - Must happen before
2763                                                              any following
2764                                                              global/generic
2765                                                              load/load
2766                                                              atomic/store/store
2767                                                              atomic/atomicrmw.
2768                                                            - Ensures any
2769                                                              following global
2770                                                              data read is no
2771                                                              older than the
2772                                                              value read by the
2773                                                              fence-paired-atomic.
2774
2775      fence        acquire      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
2776                                - system                     vmcnt(0)
2777
2778                                                            - If OpenCL and
2779                                                              address space is
2780                                                              not generic, omit
2781                                                              lgkmcnt(0).
2782                                                            - However, since LLVM
2783                                                              currently has no
2784                                                              address space on
2785                                                              the fence need to
2786                                                              conservatively
2787                                                              always generate
2788                                                              (see comment for
2789                                                              previous fence).
2790                                                            - Could be split into
2791                                                              separate s_waitcnt
2792                                                              vmcnt(0) and
2793                                                              s_waitcnt
2794                                                              lgkmcnt(0) to allow
2795                                                              them to be
2796                                                              independently moved
2797                                                              according to the
2798                                                              following rules.
2799                                                            - s_waitcnt vmcnt(0)
2800                                                              must happen after
2801                                                              any preceding
2802                                                              global/generic load
2803                                                              atomic/atomicrmw
2804                                                              with an equal or
2805                                                              wider sync scope
2806                                                              and memory ordering
2807                                                              stronger than
2808                                                              unordered (this is
2809                                                              termed the
2810                                                              fence-paired-atomic).
2811                                                            - s_waitcnt lgkmcnt(0)
2812                                                              must happen after
2813                                                              any preceding
2814                                                              local/generic load
2815                                                              atomic/atomicrmw
2816                                                              with an equal or
2817                                                              wider sync scope
2818                                                              and memory ordering
2819                                                              stronger than
2820                                                              unordered (this is
2821                                                              termed the
2822                                                              fence-paired-atomic).
2823                                                            - Must happen before
2824                                                              the following
2825                                                              buffer_wbinvl1_vol.
2826                                                            - Ensures that the
2827                                                              fence-paired atomic
2828                                                              has completed
2829                                                              before invalidating
2830                                                              the
2831                                                              cache. Therefore
2832                                                              any following
2833                                                              locations read must
2834                                                              be no older than
2835                                                              the value read by
2836                                                              the
2837                                                              fence-paired-atomic.
2838
2839                                                          2. buffer_wbinvl1_vol
2840
2841                                                            - Must happen before any
2842                                                              following global/generic
2843                                                              load/load
2844                                                              atomic/store/store
2845                                                              atomic/atomicrmw.
2846                                                            - Ensures that
2847                                                              following loads
2848                                                              will not see stale
2849                                                              global data.
2850
2851      **Release Atomic**
2852      -----------------------------------------------------------------------------------
2853      store atomic release      - singlethread - global   1. buffer/global/ds/flat_store
2854                                - wavefront    - local
2855                                               - generic
2856      store atomic release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
2857
2858                                                            - If OpenCL, omit.
2859                                                            - Must happen after
2860                                                              any preceding
2861                                                              local/generic
2862                                                              load/store/load
2863                                                              atomic/store
2864                                                              atomic/atomicrmw.
2865                                                            - Must happen before
2866                                                              the following
2867                                                              store.
2868                                                            - Ensures that all
2869                                                              memory operations
2870                                                              to local have
2871                                                              completed before
2872                                                              performing the
2873                                                              store that is being
2874                                                              released.
2875
2876                                                          2. buffer/global/flat_store
2877      store atomic release      - workgroup    - local    1. ds_store
2878      store atomic release      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)
2879
2880                                                            - If OpenCL, omit.
2881                                                            - Must happen after
2882                                                              any preceding
2883                                                              local/generic
2884                                                              load/store/load
2885                                                              atomic/store
2886                                                              atomic/atomicrmw.
2887                                                            - Must happen before
2888                                                              the following
2889                                                              store.
2890                                                            - Ensures that all
2891                                                              memory operations
2892                                                              to local have
2893                                                              completed before
2894                                                              performing the
2895                                                              store that is being
2896                                                              released.
2897
2898                                                          2. flat_store
2899      store atomic release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
2900                                - system       - generic     vmcnt(0)
2901
2902                                                            - If OpenCL, omit
2903                                                              lgkmcnt(0).
2904                                                            - Could be split into
2905                                                              separate s_waitcnt
2906                                                              vmcnt(0) and
2907                                                              s_waitcnt
2908                                                              lgkmcnt(0) to allow
2909                                                              them to be
2910                                                              independently moved
2911                                                              according to the
2912                                                              following rules.
2913                                                            - s_waitcnt vmcnt(0)
2914                                                              must happen after
2915                                                              any preceding
2916                                                              global/generic
2917                                                              load/store/load
2918                                                              atomic/store
2919                                                              atomic/atomicrmw.
2920                                                            - s_waitcnt lgkmcnt(0)
2921                                                              must happen after
2922                                                              any preceding
2923                                                              local/generic
2924                                                              load/store/load
2925                                                              atomic/store
2926                                                              atomic/atomicrmw.
2927                                                            - Must happen before
2928                                                              the following
2929                                                              store.
2930                                                            - Ensures that all
2931                                                              memory operations
2932                                                              to memory have
2933                                                              completed before
2934                                                              performing the
2935                                                              store that is being
2936                                                              released.
2937
2938                                                          2. buffer/global/ds/flat_store
2939      atomicrmw    release      - singlethread - global   1. buffer/global/ds/flat_atomic
2940                                - wavefront    - local
2941                                               - generic
2942      atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
2943
2944                                                            - If OpenCL, omit.
2945                                                            - Must happen after
2946                                                              any preceding
2947                                                              local/generic
2948                                                              load/store/load
2949                                                              atomic/store
2950                                                              atomic/atomicrmw.
2951                                                            - Must happen before
2952                                                              the following
2953                                                              atomicrmw.
2954                                                            - Ensures that all
2955                                                              memory operations
2956                                                              to local have
2957                                                              completed before
2958                                                              performing the
2959                                                              atomicrmw that is
2960                                                              being released.
2961
2962                                                          2. buffer/global/flat_atomic
2963      atomicrmw    release      - workgroup    - local    1. ds_atomic
2964      atomicrmw    release      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)
2965
2966                                                            - If OpenCL, omit.
2967                                                            - Must happen after
2968                                                              any preceding
2969                                                              local/generic
2970                                                              load/store/load
2971                                                              atomic/store
2972                                                              atomic/atomicrmw.
2973                                                            - Must happen before
2974                                                              the following
2975                                                              atomicrmw.
2976                                                            - Ensures that all
2977                                                              memory operations
2978                                                              to local have
2979                                                              completed before
2980                                                              performing the
2981                                                              atomicrmw that is
2982                                                              being released.
2983
2984                                                          2. flat_atomic
2985      atomicrmw    release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
2986                                - system       - generic     vmcnt(0)
2987
2988                                                            - If OpenCL, omit
2989                                                              lgkmcnt(0).
2990                                                            - Could be split into
2991                                                              separate s_waitcnt
2992                                                              vmcnt(0) and
2993                                                              s_waitcnt
2994                                                              lgkmcnt(0) to allow
2995                                                              them to be
2996                                                              independently moved
2997                                                              according to the
2998                                                              following rules.
2999                                                            - s_waitcnt vmcnt(0)
3000                                                              must happen after
3001                                                              any preceding
3002                                                              global/generic
3003                                                              load/store/load
3004                                                              atomic/store
3005                                                              atomic/atomicrmw.
3006                                                            - s_waitcnt lgkmcnt(0)
3007                                                              must happen after
3008                                                              any preceding
3009                                                              local/generic
3010                                                              load/store/load
3011                                                              atomic/store
3012                                                              atomic/atomicrmw.
3013                                                            - Must happen before
3014                                                              the following
3015                                                              atomicrmw.
3016                                                            - Ensures that all
3017                                                              memory operations
3018                                                              to global and local
3019                                                              have completed
3020                                                              before performing
3021                                                              the atomicrmw that
3022                                                              is being released.
3023
3024                                                          2. buffer/global/ds/flat_atomic
3025      fence        release      - singlethread *none*     *none*
3026                                - wavefront
3027      fence        release      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
3028
3029                                                            - If OpenCL and
3030                                                              address space is
3031                                                              not generic, omit.
3032                                                            - However, since LLVM
3033                                                              currently has no
3034                                                              address space on
3035                                                              the fence need to
3036                                                              conservatively
3037                                                              always generate. If
3038                                                              fence had an
3039                                                              address space then
3040                                                              set to address
3041                                                              space of OpenCL
3042                                                              fence flag, or to
3043                                                              generic if both
3044                                                              local and global
3045                                                              flags are
3046                                                              specified.
3047                                                            - Must happen after
3048                                                              any preceding
3049                                                              local/generic
3050                                                              load/load
3051                                                              atomic/store/store
3052                                                              atomic/atomicrmw.
3053                                                            - Must happen before
3054                                                              any following store
3055                                                              atomic/atomicrmw
3056                                                              with an equal or
3057                                                              wider sync scope
3058                                                              and memory ordering
3059                                                              stronger than
3060                                                              unordered (this is
3061                                                              termed the
3062                                                              fence-paired-atomic).
3063                                                            - Ensures that all
3064                                                              memory operations
3065                                                              to local have
3066                                                              completed before
3067                                                              performing the
3068                                                              following
3069                                                              fence-paired-atomic.
3070
3071      fence        release      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
3072                                - system                     vmcnt(0)
3073
3074                                                            - If OpenCL and
3075                                                              address space is
3076                                                              not generic, omit
3077                                                              lgkmcnt(0).
3078                                                            - If OpenCL and
3079                                                              address space is
3080                                                              local, omit
3081                                                              vmcnt(0).
3082                                                            - However, since LLVM
3083                                                              currently has no
3084                                                              address space on
3085                                                              the fence need to
3086                                                              conservatively
3087                                                              always generate. If
3088                                                              fence had an
3089                                                              address space then
3090                                                              set to address
3091                                                              space of OpenCL
3092                                                              fence flag, or to
3093                                                              generic if both
3094                                                              local and global
3095                                                              flags are
3096                                                              specified.
3097                                                            - Could be split into
3098                                                              separate s_waitcnt
3099                                                              vmcnt(0) and
3100                                                              s_waitcnt
3101                                                              lgkmcnt(0) to allow
3102                                                              them to be
3103                                                              independently moved
3104                                                              according to the
3105                                                              following rules.
3106                                                            - s_waitcnt vmcnt(0)
3107                                                              must happen after
3108                                                              any preceding
3109                                                              global/generic
3110                                                              load/store/load
3111                                                              atomic/store
3112                                                              atomic/atomicrmw.
3113                                                            - s_waitcnt lgkmcnt(0)
3114                                                              must happen after
3115                                                              any preceding
3116                                                              local/generic
3117                                                              load/store/load
3118                                                              atomic/store
3119                                                              atomic/atomicrmw.
3120                                                            - Must happen before
3121                                                              any following store
3122                                                              atomic/atomicrmw
3123                                                              with an equal or
3124                                                              wider sync scope
3125                                                              and memory ordering
3126                                                              stronger than
3127                                                              unordered (this is
3128                                                              termed the
3129                                                              fence-paired-atomic).
3130                                                            - Ensures that all
3131                                                              memory operations
3132                                                              have
3133                                                              completed before
3134                                                              performing the
3135                                                              following
3136                                                              fence-paired-atomic.
3137
3138      **Acquire-Release Atomic**
3139      -----------------------------------------------------------------------------------
3140      atomicrmw    acq_rel      - singlethread - global   1. buffer/global/ds/flat_atomic
3141                                - wavefront    - local
3142                                               - generic
3143      atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
3144
3145                                                            - If OpenCL, omit.
3146                                                            - Must happen after
3147                                                              any preceding
3148                                                              local/generic
3149                                                              load/store/load
3150                                                              atomic/store
3151                                                              atomic/atomicrmw.
3152                                                            - Must happen before
3153                                                              the following
3154                                                              atomicrmw.
3155                                                            - Ensures that all
3156                                                              memory operations
3157                                                              to local have
3158                                                              completed before
3159                                                              performing the
3160                                                              atomicrmw that is
3161                                                              being released.
3162
3163                                                          2. buffer/global/flat_atomic
3164      atomicrmw    acq_rel      - workgroup    - local    1. ds_atomic
3165                                                          2. s_waitcnt lgkmcnt(0)
3166
3167                                                            - If OpenCL, omit.
3168                                                            - Must happen before
3169                                                              any following
3170                                                              global/generic
3171                                                              load/load
3172                                                              atomic/store/store
3173                                                              atomic/atomicrmw.
3174                                                            - Ensures any
3175                                                              following global
3176                                                              data read is no
3177                                                              older than the load
3178                                                              atomic value being
3179                                                              acquired.
3180
3181      atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)
3182
3183                                                            - If OpenCL, omit.
3184                                                            - Must happen after
3185                                                              any preceding
3186                                                              local/generic
3187                                                              load/store/load
3188                                                              atomic/store
3189                                                              atomic/atomicrmw.
3190                                                            - Must happen before
3191                                                              the following
3192                                                              atomicrmw.
3193                                                            - Ensures that all
3194                                                              memory operations
3195                                                              to local have
3196                                                              completed before
3197                                                              performing the
3198                                                              atomicrmw that is
3199                                                              being released.
3200
3201                                                          2. flat_atomic
3202                                                          3. s_waitcnt lgkmcnt(0)
3203
3204                                                            - If OpenCL, omit.
3205                                                            - Must happen before
3206                                                              any following
3207                                                              global/generic
3208                                                              load/load
3209                                                              atomic/store/store
3210                                                              atomic/atomicrmw.
3211                                                            - Ensures any
3212                                                              following global
3213                                                              data read is no
3214                                                              older than the load
3215                                                              atomic value being
3216                                                              acquired.
3217
3218      atomicrmw    acq_rel      - agent        - global   1. s_waitcnt lgkmcnt(0) &
3219                                - system                     vmcnt(0)
3220
3221                                                            - If OpenCL, omit
3222                                                              lgkmcnt(0).
3223                                                            - Could be split into
3224                                                              separate s_waitcnt
3225                                                              vmcnt(0) and
3226                                                              s_waitcnt
3227                                                              lgkmcnt(0) to allow
3228                                                              them to be
3229                                                              independently moved
3230                                                              according to the
3231                                                              following rules.
3232                                                            - s_waitcnt vmcnt(0)
3233                                                              must happen after
3234                                                              any preceding
3235                                                              global/generic
3236                                                              load/store/load
3237                                                              atomic/store
3238                                                              atomic/atomicrmw.
3239                                                            - s_waitcnt lgkmcnt(0)
3240                                                              must happen after
3241                                                              any preceding
3242                                                              local/generic
3243                                                              load/store/load
3244                                                              atomic/store
3245                                                              atomic/atomicrmw.
3246                                                            - Must happen before
3247                                                              the following
3248                                                              atomicrmw.
3249                                                            - Ensures that all
3250                                                              memory operations
3251                                                              to global have
3252                                                              completed before
3253                                                              performing the
3254                                                              atomicrmw that is
3255                                                              being released.
3256
3257                                                          2. buffer/global/flat_atomic
3258                                                          3. s_waitcnt vmcnt(0)
3259
3260                                                            - Must happen before
3261                                                              following
3262                                                              buffer_wbinvl1_vol.
3263                                                            - Ensures the
3264                                                              atomicrmw has
3265                                                              completed before
3266                                                              invalidating the
3267                                                              cache.
3268
3269                                                          4. buffer_wbinvl1_vol
3270
3271                                                            - Must happen before
3272                                                              any following
3273                                                              global/generic
3274                                                              load/load
3275                                                              atomic/atomicrmw.
3276                                                            - Ensures that
3277                                                              following loads
3278                                                              will not see stale
3279                                                              global data.
3280
3281      atomicrmw    acq_rel      - agent        - generic  1. s_waitcnt lgkmcnt(0) &
3282                                - system                     vmcnt(0)
3283
3284                                                            - If OpenCL, omit
3285                                                              lgkmcnt(0).
3286                                                            - Could be split into
3287                                                              separate s_waitcnt
3288                                                              vmcnt(0) and
3289                                                              s_waitcnt
3290                                                              lgkmcnt(0) to allow
3291                                                              them to be
3292                                                              independently moved
3293                                                              according to the
3294                                                              following rules.
3295                                                            - s_waitcnt vmcnt(0)
3296                                                              must happen after
3297                                                              any preceding
3298                                                              global/generic
3299                                                              load/store/load
3300                                                              atomic/store
3301                                                              atomic/atomicrmw.
3302                                                            - s_waitcnt lgkmcnt(0)
3303                                                              must happen after
3304                                                              any preceding
3305                                                              local/generic
3306                                                              load/store/load
3307                                                              atomic/store
3308                                                              atomic/atomicrmw.
3309                                                            - Must happen before
3310                                                              the following
3311                                                              atomicrmw.
3312                                                            - Ensures that all
3313                                                              memory operations
3314                                                              to global have
3315                                                              completed before
3316                                                              performing the
3317                                                              atomicrmw that is
3318                                                              being released.
3319
3320                                                          2. flat_atomic
3321                                                          3. s_waitcnt vmcnt(0) &
3322                                                             lgkmcnt(0)
3323
3324                                                            - If OpenCL, omit
3325                                                              lgkmcnt(0).
3326                                                            - Must happen before
3327                                                              following
3328                                                              buffer_wbinvl1_vol.
3329                                                            - Ensures the
3330                                                              atomicrmw has
3331                                                              completed before
3332                                                              invalidating the
3333                                                              cache.
3334
3335                                                          4. buffer_wbinvl1_vol
3336
3337                                                            - Must happen before
3338                                                              any following
3339                                                              global/generic
3340                                                              load/load
3341                                                              atomic/atomicrmw.
3342                                                            - Ensures that
3343                                                              following loads
3344                                                              will not see stale
3345                                                              global data.
3346
3347      fence        acq_rel      - singlethread *none*     *none*
3348                                - wavefront
3349      fence        acq_rel      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
3350
3351                                                            - If OpenCL and
3352                                                              address space is
3353                                                              not generic, omit.
3354                                                            - However,
3355                                                              since LLVM
3356                                                              currently has no
3357                                                              address space on
3358                                                              the fence need to
3359                                                              conservatively
3360                                                              always generate
3361                                                              (see comment for
3362                                                              previous fence).
3363                                                            - Must happen after
3364                                                              any preceding
3365                                                              local/generic
3366                                                              load/load
3367                                                              atomic/store/store
3368                                                              atomic/atomicrmw.
3369                                                            - Must happen before
3370                                                              any following
3371                                                              global/generic
3372                                                              load/load
3373                                                              atomic/store/store
3374                                                              atomic/atomicrmw.
3375                                                            - Ensures that all
3376                                                              memory operations
3377                                                              to local have
3378                                                              completed before
3379                                                              performing any
3380                                                              following global
3381                                                              memory operations.
3382                                                            - Ensures that the
3383                                                              preceding
3384                                                              local/generic load
3385                                                              atomic/atomicrmw
3386                                                              with an equal or
3387                                                              wider sync scope
3388                                                              and memory ordering
3389                                                              stronger than
3390                                                              unordered (this is
3391                                                              termed the
3392                                                              acquire-fence-paired-atomic
3393                                                              ) has completed
3394                                                              before following
3395                                                              global memory
3396                                                              operations. This
3397                                                              satisfies the
3398                                                              requirements of
3399                                                              acquire.
3400                                                            - Ensures that all
3401                                                              previous memory
3402                                                              operations have
3403                                                              completed before a
3404                                                              following
3405                                                              local/generic store
3406                                                              atomic/atomicrmw
3407                                                              with an equal or
3408                                                              wider sync scope
3409                                                              and memory ordering
3410                                                              stronger than
3411                                                              unordered (this is
3412                                                              termed the
3413                                                              release-fence-paired-atomic
3414                                                              ). This satisfies the
3415                                                              requirements of
3416                                                              release.
3417
3418      fence        acq_rel      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
3419                                - system                     vmcnt(0)
3420
3421                                                            - If OpenCL and
3422                                                              address space is
3423                                                              not generic, omit
3424                                                              lgkmcnt(0).
3425                                                            - However, since LLVM
3426                                                              currently has no
3427                                                              address space on
3428                                                              the fence need to
3429                                                              conservatively
3430                                                              always generate
3431                                                              (see comment for
3432                                                              previous fence).
3433                                                            - Could be split into
3434                                                              separate s_waitcnt
3435                                                              vmcnt(0) and
3436                                                              s_waitcnt
3437                                                              lgkmcnt(0) to allow
3438                                                              them to be
3439                                                              independently moved
3440                                                              according to the
3441                                                              following rules.
3442                                                            - s_waitcnt vmcnt(0)
3443                                                              must happen after
3444                                                              any preceding
3445                                                              global/generic
3446                                                              load/store/load
3447                                                              atomic/store
3448                                                              atomic/atomicrmw.
3449                                                            - s_waitcnt lgkmcnt(0)
3450                                                              must happen after
3451                                                              any preceding
3452                                                              local/generic
3453                                                              load/store/load
3454                                                              atomic/store
3455                                                              atomic/atomicrmw.
3456                                                            - Must happen before
3457                                                              the following
3458                                                              buffer_wbinvl1_vol.
3459                                                            - Ensures that the
3460                                                              preceding
3461                                                              global/local/generic
3462                                                              load
3463                                                              atomic/atomicrmw
3464                                                              with an equal or
3465                                                              wider sync scope
3466                                                              and memory ordering
3467                                                              stronger than
3468                                                              unordered (this is
3469                                                              termed the
3470                                                              acquire-fence-paired-atomic
3471                                                              ) has completed
3472                                                              before invalidating
3473                                                              the cache. This
3474                                                              satisfies the
3475                                                              requirements of
3476                                                              acquire.
3477                                                            - Ensures that all
3478                                                              previous memory
3479                                                              operations have
3480                                                              completed before a
3481                                                              following
3482                                                              global/local/generic
3483                                                              store
3484                                                              atomic/atomicrmw
3485                                                              with an equal or
3486                                                              wider sync scope
3487                                                              and memory ordering
3488                                                              stronger than
3489                                                              unordered (this is
3490                                                              termed the
3491                                                              release-fence-paired-atomic
3492                                                              ). This satisfies the
3493                                                              requirements of
3494                                                              release.
3495
3496                                                          2. buffer_wbinvl1_vol
3497
3498                                                            - Must happen before
3499                                                              any following
3500                                                              global/generic
3501                                                              load/load
3502                                                              atomic/store/store
3503                                                              atomic/atomicrmw.
3504                                                            - Ensures that
3505                                                              following loads
3506                                                              will not see stale
3507                                                              global data. This
3508                                                              satisfies the
3509                                                              requirements of
3510                                                              acquire.
3511
3512      **Sequential Consistent Atomic**
3513      -----------------------------------------------------------------------------------
3514      load atomic  seq_cst      - singlethread - global   *Same as corresponding
3515                                - wavefront    - local    load atomic acquire,
3516                                               - generic  except must generated
3517                                                          all instructions even
3518                                                          for OpenCL.*
3519      load atomic  seq_cst      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
3520                                               - generic
3521                                                            - Must
3522                                                              happen after
3523                                                              preceding
3524                                                              global/generic load
3525                                                              atomic/store
3526                                                              atomic/atomicrmw
3527                                                              with memory
3528                                                              ordering of seq_cst
3529                                                              and with equal or
3530                                                              wider sync scope.
3531                                                              (Note that seq_cst
3532                                                              fences have their
3533                                                              own s_waitcnt
3534                                                              lgkmcnt(0) and so do
3535                                                              not need to be
3536                                                              considered.)
3537                                                            - Ensures any
3538                                                              preceding
3539                                                              sequential
3540                                                              consistent local
3541                                                              memory instructions
3542                                                              have completed
3543                                                              before executing
3544                                                              this sequentially
3545                                                              consistent
3546                                                              instruction. This
3547                                                              prevents reordering
3548                                                              a seq_cst store
3549                                                              followed by a
3550                                                              seq_cst load. (Note
3551                                                              that seq_cst is
3552                                                              stronger than
3553                                                              acquire/release as
3554                                                              the reordering of
3555                                                              load acquire
3556                                                              followed by a store
3557                                                              release is
3558                                                              prevented by the
3559                                                              waitcnt of
3560                                                              the release, but
3561                                                              there is nothing
3562                                                              preventing a store
3563                                                              release followed by
3564                                                              load acquire from
3565                                                              competing out of
3566                                                              order.)
3567
3568                                                          2. *Following
3569                                                             instructions same as
3570                                                             corresponding load
3571                                                             atomic acquire,
3572                                                             except must generated
3573                                                             all instructions even
3574                                                             for OpenCL.*
3575      load atomic  seq_cst      - workgroup    - local    *Same as corresponding
3576                                                          load atomic acquire,
3577                                                          except must generated
3578                                                          all instructions even
3579                                                          for OpenCL.*
3580      load atomic  seq_cst      - agent        - global   1. s_waitcnt lgkmcnt(0) &
3581                                - system       - generic     vmcnt(0)
3582
3583                                                            - Could be split into
3584                                                              separate s_waitcnt
3585                                                              vmcnt(0)
3586                                                              and s_waitcnt
3587                                                              lgkmcnt(0) to allow
3588                                                              them to be
3589                                                              independently moved
3590                                                              according to the
3591                                                              following rules.
3592                                                            - waitcnt lgkmcnt(0)
3593                                                              must happen after
3594                                                              preceding
3595                                                              global/generic load
3596                                                              atomic/store
3597                                                              atomic/atomicrmw
3598                                                              with memory
3599                                                              ordering of seq_cst
3600                                                              and with equal or
3601                                                              wider sync scope.
3602                                                              (Note that seq_cst
3603                                                              fences have their
3604                                                              own s_waitcnt
3605                                                              lgkmcnt(0) and so do
3606                                                              not need to be
3607                                                              considered.)
3608                                                            - waitcnt vmcnt(0)
3609                                                              must happen after
3610                                                              preceding
3611                                                              global/generic load
3612                                                              atomic/store
3613                                                              atomic/atomicrmw
3614                                                              with memory
3615                                                              ordering of seq_cst
3616                                                              and with equal or
3617                                                              wider sync scope.
3618                                                              (Note that seq_cst
3619                                                              fences have their
3620                                                              own s_waitcnt
3621                                                              vmcnt(0) and so do
3622                                                              not need to be
3623                                                              considered.)
3624                                                            - Ensures any
3625                                                              preceding
3626                                                              sequential
3627                                                              consistent global
3628                                                              memory instructions
3629                                                              have completed
3630                                                              before executing
3631                                                              this sequentially
3632                                                              consistent
3633                                                              instruction. This
3634                                                              prevents reordering
3635                                                              a seq_cst store
3636                                                              followed by a
3637                                                              seq_cst load. (Note
3638                                                              that seq_cst is
3639                                                              stronger than
3640                                                              acquire/release as
3641                                                              the reordering of
3642                                                              load acquire
3643                                                              followed by a store
3644                                                              release is
3645                                                              prevented by the
3646                                                              waitcnt of
3647                                                              the release, but
3648                                                              there is nothing
3649                                                              preventing a store
3650                                                              release followed by
3651                                                              load acquire from
3652                                                              competing out of
3653                                                              order.)
3654
3655                                                          2. *Following
3656                                                             instructions same as
3657                                                             corresponding load
3658                                                             atomic acquire,
3659                                                             except must generated
3660                                                             all instructions even
3661                                                             for OpenCL.*
3662      store atomic seq_cst      - singlethread - global   *Same as corresponding
3663                                - wavefront    - local    store atomic release,
3664                                - workgroup    - generic  except must generated
3665                                                          all instructions even
3666                                                          for OpenCL.*
3667      store atomic seq_cst      - agent        - global   *Same as corresponding
3668                                - system       - generic  store atomic release,
3669                                                          except must generated
3670                                                          all instructions even
3671                                                          for OpenCL.*
3672      atomicrmw    seq_cst      - singlethread - global   *Same as corresponding
3673                                - wavefront    - local    atomicrmw acq_rel,
3674                                - workgroup    - generic  except must generated
3675                                                          all instructions even
3676                                                          for OpenCL.*
3677      atomicrmw    seq_cst      - agent        - global   *Same as corresponding
3678                                - system       - generic  atomicrmw acq_rel,
3679                                                          except must generated
3680                                                          all instructions even
3681                                                          for OpenCL.*
3682      fence        seq_cst      - singlethread *none*     *Same as corresponding
3683                                - wavefront               fence acq_rel,
3684                                - workgroup               except must generated
3685                                - agent                   all instructions even
3686                                - system                  for OpenCL.*
3687      ============ ============ ============== ========== ===============================
3688
3689 The memory order also adds the single thread optimization constrains defined in
3690 table
3691 :ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
3692
3693   .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
3694      :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
3695
3696      ============ ==============================================================
3697      LLVM Memory  Optimization Constraints
3698      Ordering
3699      ============ ==============================================================
3700      unordered    *none*
3701      monotonic    *none*
3702      acquire      - If a load atomic/atomicrmw then no following load/load
3703                     atomic/store/ store atomic/atomicrmw/fence instruction can
3704                     be moved before the acquire.
3705                   - If a fence then same as load atomic, plus no preceding
3706                     associated fence-paired-atomic can be moved after the fence.
3707      release      - If a store atomic/atomicrmw then no preceding load/load
3708                     atomic/store/ store atomic/atomicrmw/fence instruction can
3709                     be moved after the release.
3710                   - If a fence then same as store atomic, plus no following
3711                     associated fence-paired-atomic can be moved before the
3712                     fence.
3713      acq_rel      Same constraints as both acquire and release.
3714      seq_cst      - If a load atomic then same constraints as acquire, plus no
3715                     preceding sequentially consistent load atomic/store
3716                     atomic/atomicrmw/fence instruction can be moved after the
3717                     seq_cst.
3718                   - If a store atomic then the same constraints as release, plus
3719                     no following sequentially consistent load atomic/store
3720                     atomic/atomicrmw/fence instruction can be moved before the
3721                     seq_cst.
3722                   - If an atomicrmw/fence then same constraints as acq_rel.
3723      ============ ==============================================================
3724
3725 Trap Handler ABI
3726 ~~~~~~~~~~~~~~~~
3727
3728 For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
3729 (such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
3730 the ``s_trap`` instruction with the following usage:
3731
3732   .. table:: AMDGPU Trap Handler for AMDHSA OS
3733      :name: amdgpu-trap-handler-for-amdhsa-os-table
3734
3735      =================== =============== =============== =======================
3736      Usage               Code Sequence   Trap Handler    Description
3737                                          Inputs
3738      =================== =============== =============== =======================
3739      reserved            ``s_trap 0x00``                 Reserved by hardware.
3740      ``debugtrap(arg)``  ``s_trap 0x01`` ``SGPR0-1``:    Reserved for HSA
3741                                            ``queue_ptr`` ``debugtrap``
3742                                          ``VGPR0``:      intrinsic (not
3743                                            ``arg``       implemented).
3744      ``llvm.trap``       ``s_trap 0x02`` ``SGPR0-1``:    Causes dispatch to be
3745                                            ``queue_ptr`` terminated and its
3746                                                          associated queue put
3747                                                          into the error state.
3748      ``llvm.debugtrap``  ``s_trap 0x03`` ``SGPR0-1``:    If debugger not
3749                                            ``queue_ptr`` installed handled
3750                                                          same as ``llvm.trap``.
3751      debugger breakpoint ``s_trap 0x07``                 Reserved for  debugger
3752                                                          breakpoints.
3753      debugger            ``s_trap 0x08``                 Reserved for debugger.
3754      debugger            ``s_trap 0xfe``                 Reserved for debugger.
3755      debugger            ``s_trap 0xff``                 Reserved for debugger.
3756      =================== =============== =============== =======================
3757
3758 Unspecified OS
3759 --------------
3760
3761 This section provides code conventions used when the target triple OS is
3762 empty (see :ref:`amdgpu-target-triples`).
3763
3764 Trap Handler ABI
3765 ~~~~~~~~~~~~~~~~
3766
3767 For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
3768 not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
3769 instructions are handled as follows:
3770
3771   .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
3772      :name: amdgpu-trap-handler-for-non-amdhsa-os-table
3773
3774      =============== =============== ===========================================
3775      Usage           Code Sequence   Description
3776      =============== =============== ===========================================
3777      llvm.trap       s_endpgm        Causes wavefront to be terminated.
3778      llvm.debugtrap  *none*          Compiler warning given that there is no
3779                                      trap handler installed.
3780      =============== =============== ===========================================
3781
3782 Source Languages
3783 ================
3784
3785 .. _amdgpu-opencl:
3786
3787 OpenCL
3788 ------
3789
3790 When generating code for the OpenCL language the target triple environment
3791 should be ``opencl`` or ``amdgizcl`` (see :ref:`amdgpu-target-triples`).
3792
3793 When the language is OpenCL the following differences occur:
3794
3795 1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
3796 2. The AMDGPU backend adds additional arguments to the kernel.
3797 3. Additional metadata is generated
3798    (:ref:`amdgpu-amdhsa-hsa-code-object-metadata`).
3799
3800 .. TODO
3801    Specify what affect this has. Hidden arguments added. Additional metadata
3802    generated.
3803
3804 .. _amdgpu-hcc:
3805
3806 HCC
3807 ---
3808
3809 When generating code for the OpenCL language the target triple environment
3810 should be ``hcc`` (see :ref:`amdgpu-target-triples`).
3811
3812 When the language is OpenCL the following differences occur:
3813
3814 1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
3815
3816 .. TODO
3817    Specify what affect this has.
3818
3819 Assembler
3820 ---------
3821
3822 AMDGPU backend has LLVM-MC based assembler which is currently in development.
3823 It supports AMDGCN GFX6-GFX9.
3824
3825 This section describes general syntax for instructions and operands.
3826
3827 Instructions
3828 ~~~~~~~~~~~~
3829
3830 .. toctree::
3831    :hidden:
3832
3833    AMDGPUAsmGFX7
3834    AMDGPUAsmGFX8
3835    AMDGPUAsmGFX9
3836    AMDGPUOperandSyntax
3837
3838 An instruction has the following syntax:
3839
3840     *<opcode> <operand0>, <operand1>,... <modifier0> <modifier1>...*
3841
3842 Note that operands are normally comma-separated while modifiers are space-separated.
3843
3844 The order of operands and modifiers is fixed. Most modifiers are optional and may be omitted.
3845
3846 See detailed instruction syntax description for :doc:`GFX7<AMDGPUAsmGFX7>`,
3847 :doc:`GFX8<AMDGPUAsmGFX8>` and :doc:`GFX9<AMDGPUAsmGFX9>`.
3848
3849 Note that features under development are not included in this description.
3850
3851 For more information about instructions, their semantics and supported combinations of
3852 operands, refer to one of instruction set architecture manuals
3853 [AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_ and [AMD-GCN-GFX9]_.
3854
3855 Operands
3856 ~~~~~~~~
3857
3858 The following syntax for register operands is supported:
3859
3860 * SGPR registers: s0, ... or s[0], ...
3861 * VGPR registers: v0, ... or v[0], ...
3862 * TTMP registers: ttmp0, ... or ttmp[0], ...
3863 * Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
3864 * Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
3865 * Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
3866 * Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
3867 * Register index expressions: v[2*2], s[1-1:2-1]
3868 * 'off' indicates that an operand is not enabled
3869
3870 Modifiers
3871 ~~~~~~~~~
3872
3873 Detailed description of modifiers may be found :doc:`here<AMDGPUOperandSyntax>`.
3874
3875 Instruction Examples
3876 ~~~~~~~~~~~~~~~~~~~~
3877
3878 DS
3879 ++
3880
3881 .. code-block:: nasm
3882
3883   ds_add_u32 v2, v4 offset:16
3884   ds_write_src2_b64 v2 offset0:4 offset1:8
3885   ds_cmpst_f32 v2, v4, v6
3886   ds_min_rtn_f64 v[8:9], v2, v[4:5]
3887
3888
3889 For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
3890
3891 FLAT
3892 ++++
3893
3894 .. code-block:: nasm
3895
3896   flat_load_dword v1, v[3:4]
3897   flat_store_dwordx3 v[3:4], v[5:7]
3898   flat_atomic_swap v1, v[3:4], v5 glc
3899   flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
3900   flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
3901
3902 For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
3903
3904 MUBUF
3905 +++++
3906
3907 .. code-block:: nasm
3908
3909   buffer_load_dword v1, off, s[4:7], s1
3910   buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
3911   buffer_store_format_xy v[1:2], off, s[4:7], s1
3912   buffer_wbinvl1
3913   buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
3914
3915 For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
3916
3917 SMRD/SMEM
3918 +++++++++
3919
3920 .. code-block:: nasm
3921
3922   s_load_dword s1, s[2:3], 0xfc
3923   s_load_dwordx8 s[8:15], s[2:3], s4
3924   s_load_dwordx16 s[88:103], s[2:3], s4
3925   s_dcache_inv_vol
3926   s_memtime s[4:5]
3927
3928 For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
3929
3930 SOP1
3931 ++++
3932
3933 .. code-block:: nasm
3934
3935   s_mov_b32 s1, s2
3936   s_mov_b64 s[0:1], 0x80000000
3937   s_cmov_b32 s1, 200
3938   s_wqm_b64 s[2:3], s[4:5]
3939   s_bcnt0_i32_b64 s1, s[2:3]
3940   s_swappc_b64 s[2:3], s[4:5]
3941   s_cbranch_join s[4:5]
3942
3943 For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
3944
3945 SOP2
3946 ++++
3947
3948 .. code-block:: nasm
3949
3950   s_add_u32 s1, s2, s3
3951   s_and_b64 s[2:3], s[4:5], s[6:7]
3952   s_cselect_b32 s1, s2, s3
3953   s_andn2_b32 s2, s4, s6
3954   s_lshr_b64 s[2:3], s[4:5], s6
3955   s_ashr_i32 s2, s4, s6
3956   s_bfm_b64 s[2:3], s4, s6
3957   s_bfe_i64 s[2:3], s[4:5], s6
3958   s_cbranch_g_fork s[4:5], s[6:7]
3959
3960 For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
3961
3962 SOPC
3963 ++++
3964
3965 .. code-block:: nasm
3966
3967   s_cmp_eq_i32 s1, s2
3968   s_bitcmp1_b32 s1, s2
3969   s_bitcmp0_b64 s[2:3], s4
3970   s_setvskip s3, s5
3971
3972 For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
3973
3974 SOPP
3975 ++++
3976
3977 .. code-block:: nasm
3978
3979   s_barrier
3980   s_nop 2
3981   s_endpgm
3982   s_waitcnt 0 ; Wait for all counters to be 0
3983   s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
3984   s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
3985   s_sethalt 9
3986   s_sleep 10
3987   s_sendmsg 0x1
3988   s_sendmsg sendmsg(MSG_INTERRUPT)
3989   s_trap 1
3990
3991 For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
3992
3993 Unless otherwise mentioned, little verification is performed on the operands
3994 of SOPP Instructions, so it is up to the programmer to be familiar with the
3995 range or acceptable values.
3996
3997 VALU
3998 ++++
3999
4000 For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
4001 the assembler will automatically use optimal encoding based on its operands.
4002 To force specific encoding, one can add a suffix to the opcode of the instruction:
4003
4004 * _e32 for 32-bit VOP1/VOP2/VOPC
4005 * _e64 for 64-bit VOP3
4006 * _dpp for VOP_DPP
4007 * _sdwa for VOP_SDWA
4008
4009 VOP1/VOP2/VOP3/VOPC examples:
4010
4011 .. code-block:: nasm
4012
4013   v_mov_b32 v1, v2
4014   v_mov_b32_e32 v1, v2
4015   v_nop
4016   v_cvt_f64_i32_e32 v[1:2], v2
4017   v_floor_f32_e32 v1, v2
4018   v_bfrev_b32_e32 v1, v2
4019   v_add_f32_e32 v1, v2, v3
4020   v_mul_i32_i24_e64 v1, v2, 3
4021   v_mul_i32_i24_e32 v1, -3, v3
4022   v_mul_i32_i24_e32 v1, -100, v3
4023   v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
4024   v_max_f16_e32 v1, v2, v3
4025
4026 VOP_DPP examples:
4027
4028 .. code-block:: nasm
4029
4030   v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
4031   v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
4032   v_mov_b32 v0, v0 wave_shl:1
4033   v_mov_b32 v0, v0 row_mirror
4034   v_mov_b32 v0, v0 row_bcast:31
4035   v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
4036   v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
4037   v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
4038
4039 VOP_SDWA examples:
4040
4041 .. code-block:: nasm
4042
4043   v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
4044   v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
4045   v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
4046   v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
4047   v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
4048
4049 For full list of supported instructions, refer to "Vector ALU instructions".
4050
4051 HSA Code Object Directives
4052 ~~~~~~~~~~~~~~~~~~~~~~~~~~
4053
4054 AMDGPU ABI defines auxiliary data in output code object. In assembly source,
4055 one can specify them with assembler directives.
4056
4057 .hsa_code_object_version major, minor
4058 +++++++++++++++++++++++++++++++++++++
4059
4060 *major* and *minor* are integers that specify the version of the HSA code
4061 object that will be generated by the assembler.
4062
4063 .hsa_code_object_isa [major, minor, stepping, vendor, arch]
4064 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4065
4066
4067 *major*, *minor*, and *stepping* are all integers that describe the instruction
4068 set architecture (ISA) version of the assembly program.
4069
4070 *vendor* and *arch* are quoted strings.  *vendor* should always be equal to
4071 "AMD" and *arch* should always be equal to "AMDGPU".
4072
4073 By default, the assembler will derive the ISA version, *vendor*, and *arch*
4074 from the value of the -mcpu option that is passed to the assembler.
4075
4076 .amdgpu_hsa_kernel (name)
4077 +++++++++++++++++++++++++
4078
4079 This directives specifies that the symbol with given name is a kernel entry point
4080 (label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
4081
4082 .amd_kernel_code_t
4083 ++++++++++++++++++
4084
4085 This directive marks the beginning of a list of key / value pairs that are used
4086 to specify the amd_kernel_code_t object that will be emitted by the assembler.
4087 The list must be terminated by the *.end_amd_kernel_code_t* directive.  For
4088 any amd_kernel_code_t values that are unspecified a default value will be
4089 used.  The default value for all keys is 0, with the following exceptions:
4090
4091 - *kernel_code_version_major* defaults to 1.
4092 - *machine_kind* defaults to 1.
4093 - *machine_version_major*, *machine_version_minor*, and
4094   *machine_version_stepping* are derived from the value of the -mcpu option
4095   that is passed to the assembler.
4096 - *kernel_code_entry_byte_offset* defaults to 256.
4097 - *wavefront_size* defaults to 6.
4098 - *kernarg_segment_alignment*, *group_segment_alignment*, and
4099   *private_segment_alignment* default to 4. Note that alignments are specified
4100   as a power of two, so a value of **n** means an alignment of 2^ **n**.
4101
4102 The *.amd_kernel_code_t* directive must be placed immediately after the
4103 function label and before any instructions.
4104
4105 For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
4106 comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
4107
4108 Here is an example of a minimal amd_kernel_code_t specification:
4109
4110 .. code-block:: none
4111
4112    .hsa_code_object_version 1,0
4113    .hsa_code_object_isa
4114
4115    .hsatext
4116    .globl  hello_world
4117    .p2align 8
4118    .amdgpu_hsa_kernel hello_world
4119
4120    hello_world:
4121
4122       .amd_kernel_code_t
4123          enable_sgpr_kernarg_segment_ptr = 1
4124          is_ptr64 = 1
4125          compute_pgm_rsrc1_vgprs = 0
4126          compute_pgm_rsrc1_sgprs = 0
4127          compute_pgm_rsrc2_user_sgpr = 2
4128          kernarg_segment_byte_size = 8
4129          wavefront_sgpr_count = 2
4130          workitem_vgpr_count = 3
4131      .end_amd_kernel_code_t
4132
4133      s_load_dwordx2 s[0:1], s[0:1] 0x0
4134      v_mov_b32 v0, 3.14159
4135      s_waitcnt lgkmcnt(0)
4136      v_mov_b32 v1, s0
4137      v_mov_b32 v2, s1
4138      flat_store_dword v[1:2], v0
4139      s_endpgm
4140    .Lfunc_end0:
4141         .size   hello_world, .Lfunc_end0-hello_world
4142
4143 Additional Documentation
4144 ========================
4145
4146 .. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
4147 .. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
4148 .. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
4149 .. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
4150 .. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
4151 .. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
4152 .. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
4153 .. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
4154 .. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
4155 .. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
4156 .. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
4157 .. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
4158 .. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
4159 .. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
4160 .. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
4161 .. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__