docs/AMDGPUUsage.rst

   1 =============================
   2 User Guide for AMDGPU Backend
   3 =============================
   4
   5 .. contents::
   6    :local:
   7
   8 Introduction
   9 ============
  10
  11 The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
  12 R600 family up until the current GCN families. It lives in the
  13 ``lib/Target/AMDGPU`` directory.
  14
  15 LLVM
  16 ====
  17
  18 .. _amdgpu-target-triples:
  19
  20 Target Triples
  21 --------------
  22
  23 Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
  24 specify the target triple:
  25
  26   .. table:: AMDGPU Architectures
  27      :name: amdgpu-architecture-table
  28
  29      ============ ==============================================================
  30      Architecture Description
  31      ============ ==============================================================
  32      ``r600``     AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
  33      ``amdgcn``   AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
  34      ============ ==============================================================
  35
  36   .. table:: AMDGPU Vendors
  37      :name: amdgpu-vendor-table
  38
  39      ============ ==============================================================
  40      Vendor       Description
  41      ============ ==============================================================
  42      ``amd``      Can be used for all AMD GPU usage.
  43      ``mesa3d``   Can be used if the OS is ``mesa3d``.
  44      ============ ==============================================================
  45
  46   .. table:: AMDGPU Operating Systems
  47      :name: amdgpu-os-table
  48
  49      ============== ============================================================
  50      OS             Description
  51      ============== ============================================================
  52      *<empty>*      Defaults to the *unknown* OS.
  53      ``amdhsa``     Compute kernels executed on HSA [HSA]_ compatible runtimes
  54                     such as AMD's ROCm [AMD-ROCm]_.
  55      ``amdpal``     Graphic shaders and compute kernels executed on AMD PAL
  56                     runtime.
  57      ``mesa3d``     Graphic shaders and compute kernels executed on Mesa 3D
  58                     runtime.
  59      ============== ============================================================
  60
  61   .. table:: AMDGPU Environments
  62      :name: amdgpu-environment-table
  63
  64      ============ ==============================================================
  65      Environment  Description
  66      ============ ==============================================================
  67      *<empty>*    Defaults to ``opencl``.
  68      ``opencl``   OpenCL compute kernel (see :ref:`amdgpu-opencl`).
  69      ``hcc``      AMD HC language compute kernel (see :ref:`amdgpu-hcc`).
  70      ============ ==============================================================
  71
  72 .. _amdgpu-processors:
  73
  74 Processors
  75 ----------
  76
  77 Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
  78 names from both the *Processor* and *Alternative Processor* can be used.
  79
  80   .. table:: AMDGPU Processors
  81      :name: amdgpu-processor-table
  82
  83      =========== =============== ============ ===== ========= ======= ==================
  84      Processor   Alternative     Target       dGPU/ Target    ROCm    Example
  85                  Processor       Triple       APU   Features  Support Products
  86                                  Architecture       Supported
  87                                                     [Default]
  88      =========== =============== ============ ===== ========= ======= ==================
  89      **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
  90      -----------------------------------------------------------------------------------
  91      ``r600``                    ``r600``     dGPU
  92      ``r630``                    ``r600``     dGPU
  93      ``rs880``                   ``r600``     dGPU
  94      ``rv670``                   ``r600``     dGPU
  95      **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
  96      -----------------------------------------------------------------------------------
  97      ``rv710``                   ``r600``     dGPU
  98      ``rv730``                   ``r600``     dGPU
  99      ``rv770``                   ``r600``     dGPU
 100      **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
 101      -----------------------------------------------------------------------------------
 102      ``cedar``                   ``r600``     dGPU
 103      ``cypress``                 ``r600``     dGPU
 104      ``juniper``                 ``r600``     dGPU
 105      ``redwood``                 ``r600``     dGPU
 106      ``sumo``                    ``r600``     dGPU
 107      **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
 108      -----------------------------------------------------------------------------------
 109      ``barts``                   ``r600``     dGPU
 110      ``caicos``                  ``r600``     dGPU
 111      ``cayman``                  ``r600``     dGPU
 112      ``turks``                   ``r600``     dGPU
 113      **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
 114      -----------------------------------------------------------------------------------
 115      ``gfx600``  - ``tahiti``    ``amdgcn``   dGPU
 116      ``gfx601``  - ``hainan``    ``amdgcn``   dGPU
 117                  - ``oland``
 118                  - ``pitcairn``
 119                  - ``verde``
 120      **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
 121      -----------------------------------------------------------------------------------
 122      ``gfx700``  - ``kaveri``    ``amdgcn``   APU                     - A6-7000
 123                                                                       - A6 Pro-7050B
 124                                                                       - A8-7100
 125                                                                       - A8 Pro-7150B
 126                                                                       - A10-7300
 127                                                                       - A10 Pro-7350B
 128                                                                       - FX-7500
 129                                                                       - A8-7200P
 130                                                                       - A10-7400P
 131                                                                       - FX-7600P
 132      ``gfx701``  - ``hawaii``    ``amdgcn``   dGPU            ROCm    - FirePro W8100
 133                                                                       - FirePro W9100
 134                                                                       - FirePro S9150
 135                                                                       - FirePro S9170
 136      ``gfx702``                  ``amdgcn``   dGPU            ROCm    - Radeon R9 290
 137                                                                       - Radeon R9 290x
 138                                                                       - Radeon R390
 139                                                                       - Radeon R390x
 140      ``gfx703``  - ``kabini``    ``amdgcn``   APU                     - E1-2100
 141                  - ``mullins``                                        - E1-2200
 142                                                                       - E1-2500
 143                                                                       - E2-3000
 144                                                                       - E2-3800
 145                                                                       - A4-5000
 146                                                                       - A4-5100
 147                                                                       - A6-5200
 148                                                                       - A4 Pro-3340B
 149      ``gfx704``  - ``bonaire``   ``amdgcn``   dGPU                    - Radeon HD 7790
 150                                                                       - Radeon HD 8770
 151                                                                       - R7 260
 152                                                                       - R7 260X
 153      **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
 154      -----------------------------------------------------------------------------------
 155      ``gfx801``  - ``carrizo``   ``amdgcn``   APU   - xnack           - A6-8500P
 156                                                       [on]            - Pro A6-8500B
 157                                                                       - A8-8600P
 158                                                                       - Pro A8-8600B
 159                                                                       - FX-8800P
 160                                                                       - Pro A12-8800B
 161      \                           ``amdgcn``   APU   - xnack   ROCm    - A10-8700P
 162                                                       [on]            - Pro A10-8700B
 163                                                                       - A10-8780P
 164      \                           ``amdgcn``   APU   - xnack           - A10-9600P
 165                                                       [on]            - A10-9630P
 166                                                                       - A12-9700P
 167                                                                       - A12-9730P
 168                                                                       - FX-9800P
 169                                                                       - FX-9830P
 170      \                           ``amdgcn``   APU   - xnack           - E2-9010
 171                                                       [on]            - A6-9210
 172                                                                       - A9-9410
 173      ``gfx802``  - ``iceland``   ``amdgcn``   dGPU  - xnack   ROCm    - FirePro S7150
 174                  - ``tonga``                          [off]           - FirePro S7100
 175                                                                       - FirePro W7100
 176                                                                       - Radeon R285
 177                                                                       - Radeon R9 380
 178                                                                       - Radeon R9 385
 179                                                                       - Mobile FirePro
 180                                                                         M7170
 181      ``gfx803``  - ``fiji``      ``amdgcn``   dGPU  - xnack   ROCm    - Radeon R9 Nano
 182                                                       [off]           - Radeon R9 Fury
 183                                                                       - Radeon R9 FuryX
 184                                                                       - Radeon Pro Duo
 185                                                                       - FirePro S9300x2
 186                                                                       - Radeon Instinct MI8
 187      \           - ``polaris10`` ``amdgcn``   dGPU  - xnack   ROCm    - Radeon RX 470
 188                                                       [off]           - Radeon RX 480
 189                                                                       - Radeon Instinct MI6
 190      \           - ``polaris11`` ``amdgcn``   dGPU  - xnack   ROCm    - Radeon RX 460
 191                                                       [off]
 192      ``gfx810``  - ``stoney``    ``amdgcn``   APU   - xnack
 193                                                       [on]
 194      **GCN GFX9** [AMD-GCN-GFX9]_
 195      -----------------------------------------------------------------------------------
 196      ``gfx900``                  ``amdgcn``   dGPU  - xnack   ROCm    - Radeon Vega
 197                                                       [off]             Frontier Edition
 198                                                                       - Radeon RX Vega 56
 199                                                                       - Radeon RX Vega 64
 200                                                                       - Radeon RX Vega 64
 201                                                                         Liquid
 202                                                                       - Radeon Instinct MI25
 203      ``gfx902``                  ``amdgcn``   APU   - xnack           *TBA*
 204                                                       [on]
 205                                                                       .. TODO
 206                                                                          Add product
 207                                                                          names.
 208      =========== =============== ============ ===== ========= ======= ==================
 209
 210 .. _amdgpu-target-features:
 211
 212 Target Features
 213 ---------------
 214
 215 Target features control how code is generated to support certain
 216 processor specific features. Not all target features are supported by
 217 all processors. The runtime must ensure that the features supported by
 218 the device used to execute the code match the features enabled when
 219 generating the code. A mismatch of features may result in incorrect
 220 execution, or a reduction in performance.
 221
 222 The target features supported by each processor, and the default value
 223 used if not specified explicitly, is listed in
 224 :ref:`amdgpu-processor-table`.
 225
 226 Use the ``clang -m[no-]<TargetFeature>`` option to specify the AMD GPU
 227 target features.
 228
 229 For example:
 230
 231 ``-mxnack``
 232   Enable the ``xnack`` feature.
 233 ``-mno-xnack``
 234   Disable the ``xnack`` feature.
 235
 236   .. table:: AMDGPU Target Features
 237      :name: amdgpu-target-feature-table
 238
 239      ============== ==================================================
 240      Target Feature Description
 241      ============== ==================================================
 242      -m[no-]xnack   Enable/disable generating code that has
 243                     memory clauses that are compatible with
 244                     having XNACK replay enabled.
 245
 246                     This is used for demand paging and page
 247                     migration. If XNACK replay is enabled in
 248                     the device, then if a page fault occurs
 249                     the code may execute incorrectly if the
 250                     ``xnack`` feature is not enabled. Executing
 251                     code that has the feature enabled on a
 252                     device that does not have XNACK replay
 253                     enabled will execute correctly, but may
 254                     be less performant than code with the
 255                     feature disabled.
 256      ============== ==================================================
 257
 258 .. _amdgpu-address-spaces:
 259
 260 Address Spaces
 261 --------------
 262
 263 The AMDGPU backend uses the following address space mappings.
 264
 265 The memory space names used in the table, aside from the region memory space, is
 266 from the OpenCL standard.
 267
 268 LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
 269
 270   .. table:: Address Space Mapping
 271      :name: amdgpu-address-space-mapping-table
 272
 273      ================== =================
 274      LLVM Address Space Memory Space
 275      ================== =================
 276      0                  Generic (Flat)
 277      1                  Global
 278      2                  Region (GDS)
 279      3                  Local (group/LDS)
 280      4                  Constant
 281      5                  Private (Scratch)
 282      6                  Constant 32-bit
 283      ================== =================
 284
 285 .. _amdgpu-memory-scopes:
 286
 287 Memory Scopes
 288 -------------
 289
 290 This section provides LLVM memory synchronization scopes supported by the AMDGPU
 291 backend memory model when the target triple OS is ``amdhsa`` (see
 292 :ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
 293
 294 The memory model supported is based on the HSA memory model [HSA]_ which is
 295 based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
 296 relation is transitive over the synchonizes-with relation independent of scope,
 297 and synchonizes-with allows the memory scope instances to be inclusive (see
 298 table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
 299
 300 This is different to the OpenCL [OpenCL]_ memory model which does not have scope
 301 inclusion and requires the memory scopes to exactly match. However, this
 302 is conservatively correct for OpenCL.
 303
 304   .. table:: AMDHSA LLVM Sync Scopes
 305      :name: amdgpu-amdhsa-llvm-sync-scopes-table
 306
 307      ================ ==========================================================
 308      LLVM Sync Scope  Description
 309      ================ ==========================================================
 310      *none*           The default: ``system``.
 311
 312                       Synchronizes with, and participates in modification and
 313                       seq_cst total orderings with, other operations (except
 314                       image operations) for all address spaces (except private,
 315                       or generic that accesses private) provided the other
 316                       operation's sync scope is:
 317
 318                       - ``system``.
 319                       - ``agent`` and executed by a thread on the same agent.
 320                       - ``workgroup`` and executed by a thread in the same
 321                         workgroup.
 322                       - ``wavefront`` and executed by a thread in the same
 323                         wavefront.
 324
 325      ``agent``        Synchronizes with, and participates in modification and
 326                       seq_cst total orderings with, other operations (except
 327                       image operations) for all address spaces (except private,
 328                       or generic that accesses private) provided the other
 329                       operation's sync scope is:
 330
 331                       - ``system`` or ``agent`` and executed by a thread on the
 332                         same agent.
 333                       - ``workgroup`` and executed by a thread in the same
 334                         workgroup.
 335                       - ``wavefront`` and executed by a thread in the same
 336                         wavefront.
 337
 338      ``workgroup``    Synchronizes with, and participates in modification and
 339                       seq_cst total orderings with, other operations (except
 340                       image operations) for all address spaces (except private,
 341                       or generic that accesses private) provided the other
 342                       operation's sync scope is:
 343
 344                       - ``system``, ``agent`` or ``workgroup`` and executed by a
 345                         thread in the same workgroup.
 346                       - ``wavefront`` and executed by a thread in the same
 347                         wavefront.
 348
 349      ``wavefront``    Synchronizes with, and participates in modification and
 350                       seq_cst total orderings with, other operations (except
 351                       image operations) for all address spaces (except private,
 352                       or generic that accesses private) provided the other
 353                       operation's sync scope is:
 354
 355                       - ``system``, ``agent``, ``workgroup`` or ``wavefront``
 356                         and executed by a thread in the same wavefront.
 357
 358      ``singlethread`` Only synchronizes with, and participates in modification
 359                       and seq_cst total orderings with, other operations (except
 360                       image operations) running in the same thread for all
 361                       address spaces (for example, in signal handlers).
 362      ================ ==========================================================
 363
 364 AMDGPU Intrinsics
 365 -----------------
 366
 367 The AMDGPU backend implements the following intrinsics.
 368
 369 *This section is WIP.*
 370
 371 .. TODO
 372    List AMDGPU intrinsics
 373
 374 Code Object
 375 ===========
 376
 377 The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
 378 can be linked by ``lld`` to produce a standard ELF shared code object which can
 379 be loaded and executed on an AMDGPU target.
 380
 381 Header
 382 ------
 383
 384 The AMDGPU backend uses the following ELF header:
 385
 386   .. table:: AMDGPU ELF Header
 387      :name: amdgpu-elf-header-table
 388
 389      ========================== ===============================
 390      Field                      Value
 391      ========================== ===============================
 392      ``e_ident[EI_CLASS]``      ``ELFCLASS64``
 393      ``e_ident[EI_DATA]``       ``ELFDATA2LSB``
 394      ``e_ident[EI_OSABI]``      - ``ELFOSABI_NONE``
 395                                 - ``ELFOSABI_AMDGPU_HSA``
 396                                 - ``ELFOSABI_AMDGPU_PAL``
 397                                 - ``ELFOSABI_AMDGPU_MESA3D``
 398      ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA``
 399                                 - ``ELFABIVERSION_AMDGPU_PAL``
 400                                 - ``ELFABIVERSION_AMDGPU_MESA3D``
 401      ``e_type``                 - ``ET_REL``
 402                                 - ``ET_DYN``
 403      ``e_machine``              ``EM_AMDGPU``
 404      ``e_entry``                0
 405      ``e_flags``                See :ref:`amdgpu-elf-header-e_flags-table`
 406      ========================== ===============================
 407
 408 ..
 409
 410   .. table:: AMDGPU ELF Header Enumeration Values
 411      :name: amdgpu-elf-header-enumeration-values-table
 412
 413      =============================== =====
 414      Name                            Value
 415      =============================== =====
 416      ``EM_AMDGPU``                   224
 417      ``ELFOSABI_NONE``               0
 418      ``ELFOSABI_AMDGPU_HSA``         64
 419      ``ELFOSABI_AMDGPU_PAL``         65
 420      ``ELFOSABI_AMDGPU_MESA3D``      66
 421      ``ELFABIVERSION_AMDGPU_HSA``    1
 422      ``ELFABIVERSION_AMDGPU_PAL``    0
 423      ``ELFABIVERSION_AMDGPU_MESA3D`` 0
 424      =============================== =====
 425
 426 ``e_ident[EI_CLASS]``
 427   The ELF class is:
 428
 429   * ``ELFCLASS32`` for ``r600`` architecture.
 430
 431   * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64
 432     bit applications.
 433
 434 ``e_ident[EI_DATA]``
 435   All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
 436
 437 ``e_ident[EI_OSABI]``
 438   One of the following AMD GPU architecture specific OS ABIs
 439   (see :ref:`amdgpu-os-table`):
 440
 441   * ``ELFOSABI_NONE`` for *unknown* OS.
 442
 443   * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
 444
 445   * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
 446
 447   * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
 448
 449 ``e_ident[EI_ABIVERSION]``
 450   The ABI version of the AMD GPU architecture specific OS ABI to which the code
 451   object conforms:
 452
 453   * ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
 454     runtime ABI.
 455
 456   * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
 457     runtime ABI.
 458
 459   * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
 460     3D runtime ABI.
 461
 462 ``e_type``
 463   Can be one of the following values:
 464
 465
 466   ``ET_REL``
 467     The type produced by the AMD GPU backend compiler as it is relocatable code
 468     object.
 469
 470   ``ET_DYN``
 471     The type produced by the linker as it is a shared code object.
 472
 473   The AMD HSA runtime loader requires a ``ET_DYN`` code object.
 474
 475 ``e_machine``
 476   The value ``EM_AMDGPU`` is used for the machine for all processors supported
 477   by the ``r600`` and ``amdgcn`` architectures (see
 478   :ref:`amdgpu-processor-table`). The specific processor is specified in the
 479   ``EF_AMDGPU_MACH`` bit field of the ``e_flags`` (see
 480   :ref:`amdgpu-elf-header-e_flags-table`).
 481
 482 ``e_entry``
 483   The entry point is 0 as the entry points for individual kernels must be
 484   selected in order to invoke them through AQL packets.
 485
 486 ``e_flags``
 487   The AMDGPU backend uses the following ELF header flags:
 488
 489   .. table:: AMDGPU ELF Header ``e_flags``
 490      :name: amdgpu-elf-header-e_flags-table
 491
 492      ================================= ========== =============================
 493      Name                              Value      Description
 494      ================================= ========== =============================
 495      **AMDGPU Processor Flag**                    See :ref:`amdgpu-processor-table`.
 496      -------------------------------------------- -----------------------------
 497      ``EF_AMDGPU_MACH``                0x000000ff AMDGPU processor selection
 498                                                   mask for
 499                                                   ``EF_AMDGPU_MACH_xxx`` values
 500                                                   defined in
 501                                                   :ref:`amdgpu-ef-amdgpu-mach-table`.
 502      ``EF_AMDGPU_XNACK``               0x00000100 Indicates if the ``xnack``
 503                                                   target feature is
 504                                                   enabled for all code
 505                                                   contained in the code object.
 506                                                   See
 507                                                   :ref:`amdgpu-target-features`.
 508      ================================= ========== =============================
 509
 510   .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
 511      :name: amdgpu-ef-amdgpu-mach-table
 512
 513      ================================= ========== =============================
 514      Name                              Value      Description (see
 515                                                   :ref:`amdgpu-processor-table`)
 516      ================================= ========== =============================
 517      ``EF_AMDGPU_MACH_NONE``           0x000      *not specified*
 518      ``EF_AMDGPU_MACH_R600_R600``      0x001      ``r600``
 519      ``EF_AMDGPU_MACH_R600_R630``      0x002      ``r630``
 520      ``EF_AMDGPU_MACH_R600_RS880``     0x003      ``rs880``
 521      ``EF_AMDGPU_MACH_R600_RV670``     0x004      ``rv670``
 522      ``EF_AMDGPU_MACH_R600_RV710``     0x005      ``rv710``
 523      ``EF_AMDGPU_MACH_R600_RV730``     0x006      ``rv730``
 524      ``EF_AMDGPU_MACH_R600_RV770``     0x007      ``rv770``
 525      ``EF_AMDGPU_MACH_R600_CEDAR``     0x008      ``cedar``
 526      ``EF_AMDGPU_MACH_R600_CYPRESS``   0x009      ``cypress``
 527      ``EF_AMDGPU_MACH_R600_JUNIPER``   0x00a      ``juniper``
 528      ``EF_AMDGPU_MACH_R600_REDWOOD``   0x00b      ``redwood``
 529      ``EF_AMDGPU_MACH_R600_SUMO``      0x00c      ``sumo``
 530      ``EF_AMDGPU_MACH_R600_BARTS``     0x00d      ``barts``
 531      ``EF_AMDGPU_MACH_R600_CAICOS``    0x00e      ``caicos``
 532      ``EF_AMDGPU_MACH_R600_CAYMAN``    0x00f      ``cayman``
 533      ``EF_AMDGPU_MACH_R600_TURKS``     0x010      ``turks``
 534      *reserved*                        0x011 -    Reserved for ``r600``
 535                                        0x01f      architecture processors.
 536      ``EF_AMDGPU_MACH_AMDGCN_GFX600``  0x020      ``gfx600``
 537      ``EF_AMDGPU_MACH_AMDGCN_GFX601``  0x021      ``gfx601``
 538      ``EF_AMDGPU_MACH_AMDGCN_GFX700``  0x022      ``gfx700``
 539      ``EF_AMDGPU_MACH_AMDGCN_GFX701``  0x023      ``gfx701``
 540      ``EF_AMDGPU_MACH_AMDGCN_GFX702``  0x024      ``gfx702``
 541      ``EF_AMDGPU_MACH_AMDGCN_GFX703``  0x025      ``gfx703``
 542      ``EF_AMDGPU_MACH_AMDGCN_GFX704``  0x026      ``gfx704``
 543      *reserved*                        0x027      Reserved.
 544      ``EF_AMDGPU_MACH_AMDGCN_GFX801``  0x028      ``gfx801``
 545      ``EF_AMDGPU_MACH_AMDGCN_GFX802``  0x029      ``gfx802``
 546      ``EF_AMDGPU_MACH_AMDGCN_GFX803``  0x02a      ``gfx803``
 547      ``EF_AMDGPU_MACH_AMDGCN_GFX810``  0x02b      ``gfx810``
 548      ``EF_AMDGPU_MACH_AMDGCN_GFX900``  0x02c      ``gfx900``
 549      ``EF_AMDGPU_MACH_AMDGCN_GFX902``  0x02d      ``gfx902``
 550      *reserved*                        0x02e      Reserved.
 551      *reserved*                        0x02f      Reserved.
 552      *reserved*                        0x030      Reserved.
 553      ================================= ========== =============================
 554
 555 Sections
 556 --------
 557
 558 An AMDGPU target ELF code object has the standard ELF sections which include:
 559
 560   .. table:: AMDGPU ELF Sections
 561      :name: amdgpu-elf-sections-table
 562
 563      ================== ================ =================================
 564      Name               Type             Attributes
 565      ================== ================ =================================
 566      ``.bss``           ``SHT_NOBITS``   ``SHF_ALLOC`` + ``SHF_WRITE``
 567      ``.data``          ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
 568      ``.debug_``\ *\**  ``SHT_PROGBITS`` *none*
 569      ``.dynamic``       ``SHT_DYNAMIC``  ``SHF_ALLOC``
 570      ``.dynstr``        ``SHT_PROGBITS`` ``SHF_ALLOC``
 571      ``.dynsym``        ``SHT_PROGBITS`` ``SHF_ALLOC``
 572      ``.got``           ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
 573      ``.hash``          ``SHT_HASH``     ``SHF_ALLOC``
 574      ``.note``          ``SHT_NOTE``     *none*
 575      ``.rela``\ *name*  ``SHT_RELA``     *none*
 576      ``.rela.dyn``      ``SHT_RELA``     *none*
 577      ``.rodata``        ``SHT_PROGBITS`` ``SHF_ALLOC``
 578      ``.shstrtab``      ``SHT_STRTAB``   *none*
 579      ``.strtab``        ``SHT_STRTAB``   *none*
 580      ``.symtab``        ``SHT_SYMTAB``   *none*
 581      ``.text``          ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
 582      ================== ================ =================================
 583
 584 These sections have their standard meanings (see [ELF]_) and are only generated
 585 if needed.
 586
 587 ``.debug``\ *\**
 588   The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
 589   DWARF produced by the AMDGPU backend.
 590
 591 ``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
 592   The standard sections used by a dynamic loader.
 593
 594 ``.note``
 595   See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
 596   backend.
 597
 598 ``.rela``\ *name*, ``.rela.dyn``
 599   For relocatable code objects, *name* is the name of the section that the
 600   relocation records apply. For example, ``.rela.text`` is the section name for
 601   relocation records associated with the ``.text`` section.
 602
 603   For linked shared code objects, ``.rela.dyn`` contains all the relocation
 604   records from each of the relocatable code object's ``.rela``\ *name* sections.
 605
 606   See :ref:`amdgpu-relocation-records` for the relocation records supported by
 607   the AMDGPU backend.
 608
 609 ``.text``
 610   The executable machine code for the kernels and functions they call. Generated
 611   as position independent code. See :ref:`amdgpu-code-conventions` for
 612   information on conventions used in the isa generation.
 613
 614 .. _amdgpu-note-records:
 615
 616 Note Records
 617 ------------
 618
 619 As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero byte padding must
 620 be generated after the ``name`` field to ensure the ``desc`` field is 4 byte
 621 aligned. In addition, minimal zero byte padding must be generated to ensure the
 622 ``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` field of the
 623 ``.note`` section must be at least 4 to indicate at least 8 byte alignment.
 624
 625 The AMDGPU backend code object uses the following ELF note records in the
 626 ``.note`` section. The *Description* column specifies the layout of the note
 627 record's ``desc`` field. All fields are consecutive bytes. Note records with
 628 variable size strings have a corresponding ``*_size`` field that specifies the
 629 number of bytes, including the terminating null character, in the string. The
 630 string(s) come immediately after the preceding fields.
 631
 632 Additional note records can be present.
 633
 634   .. table:: AMDGPU ELF Note Records
 635      :name: amdgpu-elf-note-records-table
 636
 637      ===== ============================== ======================================
 638      Name  Type                           Description
 639      ===== ============================== ======================================
 640      "AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
 641      ===== ============================== ======================================
 642
 643 ..
 644
 645   .. table:: AMDGPU ELF Note Record Enumeration Values
 646      :name: amdgpu-elf-note-record-enumeration-values-table
 647
 648      ============================== =====
 649      Name                           Value
 650      ============================== =====
 651      *reserved*                       0-9
 652      ``NT_AMD_AMDGPU_HSA_METADATA``    10
 653      *reserved*                        11
 654      ============================== =====
 655
 656 ``NT_AMD_AMDGPU_HSA_METADATA``
 657   Specifies extensible metadata associated with the code objects executed on HSA
 658   [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
 659   the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
 660   :ref:`amdgpu-amdhsa-hsa-code-object-metadata` for the syntax of the code
 661   object metadata string.
 662
 663 .. _amdgpu-symbols:
 664
 665 Symbols
 666 -------
 667
 668 Symbols include the following:
 669
 670   .. table:: AMDGPU ELF Symbols
 671      :name: amdgpu-elf-symbols-table
 672
 673      ===================== ============== ============= ==================
 674      Name                  Type           Section       Description
 675      ===================== ============== ============= ==================
 676      *link-name*           ``STT_OBJECT`` - ``.data``   Global variable
 677                                           - ``.rodata``
 678                                           - ``.bss``
 679      *link-name*\ ``@kd``  ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
 680      *link-name*           ``STT_FUNC``   - ``.text``   Kernel entry point
 681      ===================== ============== ============= ==================
 682
 683 Global variable
 684   Global variables both used and defined by the compilation unit.
 685
 686   If the symbol is defined in the compilation unit then it is allocated in the
 687   appropriate section according to if it has initialized data or is readonly.
 688
 689   If the symbol is external then its section is ``STN_UNDEF`` and the loader
 690   will resolve relocations using the definition provided by another code object
 691   or explicitly defined by the runtime.
 692
 693   All global symbols, whether defined in the compilation unit or external, are
 694   accessed by the machine code indirectly through a GOT table entry. This
 695   allows them to be preemptable. The GOT table is only supported when the target
 696   triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
 697
 698   .. TODO
 699      Add description of linked shared object symbols. Seems undefined symbols
 700      are marked as STT_NOTYPE.
 701
 702 Kernel descriptor
 703   Every HSA kernel has an associated kernel descriptor. It is the address of the
 704   kernel descriptor that is used in the AQL dispatch packet used to invoke the
 705   kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
 706   defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
 707
 708 Kernel entry point
 709   Every HSA kernel also has a symbol for its machine code entry point.
 710
 711 .. _amdgpu-relocation-records:
 712
 713 Relocation Records
 714 ------------------
 715
 716 AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
 717 relocatable fields are:
 718
 719 ``word32``
 720   This specifies a 32-bit field occupying 4 bytes with arbitrary byte
 721   alignment. These values use the same byte order as other word values in the
 722   AMD GPU architecture.
 723
 724 ``word64``
 725   This specifies a 64-bit field occupying 8 bytes with arbitrary byte
 726   alignment. These values use the same byte order as other word values in the
 727   AMD GPU architecture.
 728
 729 Following notations are used for specifying relocation calculations:
 730
 731 **A**
 732   Represents the addend used to compute the value of the relocatable field.
 733
 734 **G**
 735   Represents the offset into the global offset table at which the relocation
 736   entry's symbol will reside during execution.
 737
 738 **GOT**
 739   Represents the address of the global offset table.
 740
 741 **P**
 742   Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
 743   of the storage unit being relocated (computed using ``r_offset``).
 744
 745 **S**
 746   Represents the value of the symbol whose index resides in the relocation
 747   entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``.
 748
 749 **B**
 750   Represents the base address of a loaded executable or shared object which is
 751   the difference between the ELF address and the actual load address. Relocations
 752   using this are only valid in executable or shared objects.
 753
 754 The following relocation types are supported:
 755
 756   .. table:: AMDGPU ELF Relocation Records
 757      :name: amdgpu-elf-relocation-records-table
 758
 759      ========================== ======= =====  ==========  ==============================
 760      Relocation Type            Kind    Value  Field       Calculation
 761      ========================== ======= =====  ==========  ==============================
 762      ``R_AMDGPU_NONE``                  0      *none*      *none*
 763      ``R_AMDGPU_ABS32_LO``      Dynamic 1      ``word32``  (S + A) & 0xFFFFFFFF
 764      ``R_AMDGPU_ABS32_HI``      Dynamic 2      ``word32``  (S + A) >> 32
 765      ``R_AMDGPU_ABS64``         Dynamic 3      ``word64``  S + A
 766      ``R_AMDGPU_REL32``         Static  4      ``word32``  S + A - P
 767      ``R_AMDGPU_REL64``         Static  5      ``word64``  S + A - P
 768      ``R_AMDGPU_ABS32``         Static  6      ``word32``  S + A
 769      ``R_AMDGPU_GOTPCREL``      Static  7      ``word32``  G + GOT + A - P
 770      ``R_AMDGPU_GOTPCREL32_LO`` Static  8      ``word32``  (G + GOT + A - P) & 0xFFFFFFFF
 771      ``R_AMDGPU_GOTPCREL32_HI`` Static  9      ``word32``  (G + GOT + A - P) >> 32
 772      ``R_AMDGPU_REL32_LO``      Static  10     ``word32``  (S + A - P) & 0xFFFFFFFF
 773      ``R_AMDGPU_REL32_HI``      Static  11     ``word32``  (S + A - P) >> 32
 774      *reserved*                         12
 775      ``R_AMDGPU_RELATIVE64``    Dynamic 13     ``word64``  B + A
 776      ========================== ======= =====  ==========  ==============================
 777
 778 .. _amdgpu-dwarf:
 779
 780 DWARF
 781 -----
 782
 783 Standard DWARF [DWARF]_ Version 5 sections can be generated. These contain
 784 information that maps the code object executable code and data to the source
 785 language constructs. It can be used by tools such as debuggers and profilers.
 786
 787 Address Space Mapping
 788 ~~~~~~~~~~~~~~~~~~~~~
 789
 790 The following address space mapping is used:
 791
 792   .. table:: AMDGPU DWARF Address Space Mapping
 793      :name: amdgpu-dwarf-address-space-mapping-table
 794
 795      =================== =================
 796      DWARF Address Space Memory Space
 797      =================== =================
 798      1                   Private (Scratch)
 799      2                   Local (group/LDS)
 800      *omitted*           Global
 801      *omitted*           Constant
 802      *omitted*           Generic (Flat)
 803      *not supported*     Region (GDS)
 804      =================== =================
 805
 806 See :ref:`amdgpu-address-spaces` for information on the memory space terminology
 807 used in the table.
 808
 809 An ``address_class`` attribute is generated on pointer type DIEs to specify the
 810 DWARF address space of the value of the pointer when it is in the *private* or
 811 *local* address space. Otherwise the attribute is omitted.
 812
 813 An ``XDEREF`` operation is generated in location list expressions for variables
 814 that are allocated in the *private* and *local* address space. Otherwise no
 815 ``XDREF`` is omitted.
 816
 817 Register Mapping
 818 ~~~~~~~~~~~~~~~~
 819
 820 *This section is WIP.*
 821
 822 .. TODO
 823    Define DWARF register enumeration.
 824
 825    If want to present a wavefront state then should expose vector registers as
 826    64 wide (rather than per work-item view that LLVM uses). Either as separate
 827    registers, or a 64x4 byte single register. In either case use a new LANE op
 828    (akin to XDREF) to select the current lane usage in a location
 829    expression. This would also allow scalar register spilling to vector register
 830    lanes to be expressed (currently no debug information is being generated for
 831    spilling). If choose a wide single register approach then use LANE in
 832    conjunction with PIECE operation to select the dword part of the register for
 833    the current lane. If the separate register approach then use LANE to select
 834    the register.
 835
 836 Source Text
 837 ~~~~~~~~~~~
 838
 839 Source text for online-compiled programs (e.g. those compiled by the OpenCL
 840 runtime) may be embedded into the DWARF v5 line table using the ``clang
 841 -gembed-source`` option, described in table :ref:`amdgpu-debug-options`.
 842
 843 For example:
 844
 845 ``-gembed-source``
 846   Enable the embedded source DWARF v5 extension.
 847 ``-gno-embed-source``
 848   Disable the embedded source DWARF v5 extension.
 849
 850   .. table:: AMDGPU Debug Options
 851      :name: amdgpu-debug-options
 852
 853      ==================== ==================================================
 854      Debug Flag           Description
 855      ==================== ==================================================
 856      -g[no-]embed-source  Enable/disable embedding source text in DWARF
 857                           debug sections. Useful for environments where
 858                           source cannot be written to disk, such as
 859                           when performing online compilation.
 860      ==================== ==================================================
 861
 862 This option enables one extended content types in the DWARF v5 Line Number
 863 Program Header, which is used to encode embedded source.
 864
 865   .. table:: AMDGPU DWARF Line Number Program Header Extended Content Types
 866      :name: amdgpu-dwarf-extended-content-types
 867
 868      ============================  ======================
 869      Content Type                  Form
 870      ============================  ======================
 871      ``DW_LNCT_LLVM_source``       ``DW_FORM_line_strp``
 872      ============================  ======================
 873
 874 The source field will contain the UTF-8 encoded, null-terminated source text
 875 with ``'\n'`` line endings. When the source field is present, consumers can use
 876 the embedded source instead of attempting to discover the source on disk. When
 877 the source field is absent, consumers can access the file to get the source
 878 text.
 879
 880 The above content type appears in the ``file_name_entry_format`` field of the
 881 line table prologue, and its corresponding value appear in the ``file_names``
 882 field. The current encoding of the content type is documented in table
 883 :ref:`amdgpu-dwarf-extended-content-types-encoding`
 884
 885   .. table:: AMDGPU DWARF Line Number Program Header Extended Content Types Encoding
 886      :name: amdgpu-dwarf-extended-content-types-encoding
 887
 888      ============================  ====================
 889      Content Type                  Value
 890      ============================  ====================
 891      ``DW_LNCT_LLVM_source``       0x2001
 892      ============================  ====================
 893
 894 .. _amdgpu-code-conventions:
 895
 896 Code Conventions
 897 ================
 898
 899 This section provides code conventions used for each supported target triple OS
 900 (see :ref:`amdgpu-target-triples`).
 901
 902 AMDHSA
 903 ------
 904
 905 This section provides code conventions used when the target triple OS is
 906 ``amdhsa`` (see :ref:`amdgpu-target-triples`).
 907
 908 .. _amdgpu-amdhsa-hsa-code-object-metadata:
 909
 910 Code Object Metadata
 911 ~~~~~~~~~~~~~~~~~~~~
 912
 913 The code object metadata specifies extensible metadata associated with the code
 914 objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
 915 [AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record
 916 (see :ref:`amdgpu-note-records`) and is required when the target triple OS is
 917 ``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
 918 information necessary to support the ROCM kernel queries. For example, the
 919 segment sizes needed in a dispatch packet. In addition, a high level language
 920 runtime may require other information to be included. For example, the AMD
 921 OpenCL runtime records kernel argument information.
 922
 923 The metadata is specified as a YAML formatted string (see [YAML]_ and
 924 :doc:`YamlIO`).
 925
 926 .. TODO
 927    Is the string null terminated? It probably should not if YAML allows it to
 928    contain null characters, otherwise it should be.
 929
 930 The metadata is represented as a single YAML document comprised of the mapping
 931 defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
 932 referenced tables.
 933
 934 For boolean values, the string values of ``false`` and ``true`` are used for
 935 false and true respectively.
 936
 937 Additional information can be added to the mappings. To avoid conflicts, any
 938 non-AMD key names should be prefixed by "*vendor-name*.".
 939
 940   .. table:: AMDHSA Code Object Metadata Mapping
 941      :name: amdgpu-amdhsa-code-object-metadata-mapping-table
 942
 943      ========== ============== ========= =======================================
 944      String Key Value Type     Required? Description
 945      ========== ============== ========= =======================================
 946      "Version"  sequence of    Required  - The first integer is the major
 947                 2 integers                 version. Currently 1.
 948                                          - The second integer is the minor
 949                                            version. Currently 0.
 950      "Printf"   sequence of              Each string is encoded information
 951                 strings                  about a printf function call. The
 952                                          encoded information is organized as
 953                                          fields separated by colon (':'):
 954
 955                                          ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
 956
 957                                          where:
 958
 959                                          ``ID``
 960                                            A 32 bit integer as a unique id for
 961                                            each printf function call
 962
 963                                          ``N``
 964                                            A 32 bit integer equal to the number
 965                                            of arguments of printf function call
 966                                            minus 1
 967
 968                                          ``S[i]`` (where i = 0, 1, ... , N-1)
 969                                            32 bit integers for the size in bytes
 970                                            of the i-th FormatString argument of
 971                                            the printf function call
 972
 973                                          FormatString
 974                                            The format string passed to the
 975                                            printf function call.
 976      "Kernels"  sequence of    Required  Sequence of the mappings for each
 977                 mapping                  kernel in the code object. See
 978                                          :ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
 979                                          for the definition of the mapping.
 980      ========== ============== ========= =======================================
 981
 982 ..
 983
 984   .. table:: AMDHSA Code Object Kernel Metadata Mapping
 985      :name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
 986
 987      ================= ============== ========= ================================
 988      String Key        Value Type     Required? Description
 989      ================= ============== ========= ================================
 990      "Name"            string         Required  Source name of the kernel.
 991      "SymbolName"      string         Required  Name of the kernel
 992                                                 descriptor ELF symbol.
 993      "Language"        string                   Source language of the kernel.
 994                                                 Values include:
 995
 996                                                 - "OpenCL C"
 997                                                 - "OpenCL C++"
 998                                                 - "HCC"
 999                                                 - "OpenMP"
1000
1001      "LanguageVersion" sequence of              - The first integer is the major
1002                        2 integers                 version.
1003                                                 - The second integer is the
1004                                                   minor version.
1005      "Attrs"           mapping                  Mapping of kernel attributes.
1006                                                 See
1007                                                 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
1008                                                 for the mapping definition.
1009      "Args"            sequence of              Sequence of mappings of the
1010                        mapping                  kernel arguments. See
1011                                                 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
1012                                                 for the definition of the mapping.
1013      "CodeProps"       mapping                  Mapping of properties related to
1014                                                 the kernel code. See
1015                                                 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
1016                                                 for the mapping definition.
1017      ================= ============== ========= ================================
1018
1019 ..
1020
1021   .. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
1022      :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
1023
1024      =================== ============== ========= ==============================
1025      String Key          Value Type     Required? Description
1026      =================== ============== ========= ==============================
1027      "ReqdWorkGroupSize" sequence of              If not 0, 0, 0 then all values
1028                          3 integers               must be >=1 and the dispatch
1029                                                   work-group size X, Y, Z must
1030                                                   correspond to the specified
1031                                                   values. Defaults to 0, 0, 0.
1032
1033                                                   Corresponds to the OpenCL
1034                                                   ``reqd_work_group_size``
1035                                                   attribute.
1036      "WorkGroupSizeHint" sequence of              The dispatch work-group size
1037                          3 integers               X, Y, Z is likely to be the
1038                                                   specified values.
1039
1040                                                   Corresponds to the OpenCL
1041                                                   ``work_group_size_hint``
1042                                                   attribute.
1043      "VecTypeHint"       string                   The name of a scalar or vector
1044                                                   type.
1045
1046                                                   Corresponds to the OpenCL
1047                                                   ``vec_type_hint`` attribute.
1048
1049      "RuntimeHandle"     string                   The external symbol name
1050                                                   associated with a kernel.
1051                                                   OpenCL runtime allocates a
1052                                                   global buffer for the symbol
1053                                                   and saves the kernel's address
1054                                                   to it, which is used for
1055                                                   device side enqueueing. Only
1056                                                   available for device side
1057                                                   enqueued kernels.
1058      =================== ============== ========= ==============================
1059
1060 ..
1061
1062   .. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
1063      :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
1064
1065      ================= ============== ========= ================================
1066      String Key        Value Type     Required? Description
1067      ================= ============== ========= ================================
1068      "Name"            string                   Kernel argument name.
1069      "TypeName"        string                   Kernel argument type name.
1070      "Size"            integer        Required  Kernel argument size in bytes.
1071      "Align"           integer        Required  Kernel argument alignment in
1072                                                 bytes. Must be a power of two.
1073      "ValueKind"       string         Required  Kernel argument kind that
1074                                                 specifies how to set up the
1075                                                 corresponding argument.
1076                                                 Values include:
1077
1078                                                 "ByValue"
1079                                                   The argument is copied
1080                                                   directly into the kernarg.
1081
1082                                                 "GlobalBuffer"
1083                                                   A global address space pointer
1084                                                   to the buffer data is passed
1085                                                   in the kernarg.
1086
1087                                                 "DynamicSharedPointer"
1088                                                   A group address space pointer
1089                                                   to dynamically allocated LDS
1090                                                   is passed in the kernarg.
1091
1092                                                 "Sampler"
1093                                                   A global address space
1094                                                   pointer to a S# is passed in
1095                                                   the kernarg.
1096
1097                                                 "Image"
1098                                                   A global address space
1099                                                   pointer to a T# is passed in
1100                                                   the kernarg.
1101
1102                                                 "Pipe"
1103                                                   A global address space pointer
1104                                                   to an OpenCL pipe is passed in
1105                                                   the kernarg.
1106
1107                                                 "Queue"
1108                                                   A global address space pointer
1109                                                   to an OpenCL device enqueue
1110                                                   queue is passed in the
1111                                                   kernarg.
1112
1113                                                 "HiddenGlobalOffsetX"
1114                                                   The OpenCL grid dispatch
1115                                                   global offset for the X
1116                                                   dimension is passed in the
1117                                                   kernarg.
1118
1119                                                 "HiddenGlobalOffsetY"
1120                                                   The OpenCL grid dispatch
1121                                                   global offset for the Y
1122                                                   dimension is passed in the
1123                                                   kernarg.
1124
1125                                                 "HiddenGlobalOffsetZ"
1126                                                   The OpenCL grid dispatch
1127                                                   global offset for the Z
1128                                                   dimension is passed in the
1129                                                   kernarg.
1130
1131                                                 "HiddenNone"
1132                                                   An argument that is not used
1133                                                   by the kernel. Space needs to
1134                                                   be left for it, but it does
1135                                                   not need to be set up.
1136
1137                                                 "HiddenPrintfBuffer"
1138                                                   A global address space pointer
1139                                                   to the runtime printf buffer
1140                                                   is passed in kernarg.
1141
1142                                                 "HiddenDefaultQueue"
1143                                                   A global address space pointer
1144                                                   to the OpenCL device enqueue
1145                                                   queue that should be used by
1146                                                   the kernel by default is
1147                                                   passed in the kernarg.
1148
1149                                                 "HiddenCompletionAction"
1150                                                   A global address space pointer
1151                                                   to help link enqueued kernels into
1152                                                   the ancestor tree for determining
1153                                                   when the parent kernel has finished.
1154
1155      "ValueType"       string         Required  Kernel argument value type. Only
1156                                                 present if "ValueKind" is
1157                                                 "ByValue". For vector data
1158                                                 types, the value is for the
1159                                                 element type. Values include:
1160
1161                                                 - "Struct"
1162                                                 - "I8"
1163                                                 - "U8"
1164                                                 - "I16"
1165                                                 - "U16"
1166                                                 - "F16"
1167                                                 - "I32"
1168                                                 - "U32"
1169                                                 - "F32"
1170                                                 - "I64"
1171                                                 - "U64"
1172                                                 - "F64"
1173
1174                                                 .. TODO
1175                                                    How can it be determined if a
1176                                                    vector type, and what size
1177                                                    vector?
1178      "PointeeAlign"    integer                  Alignment in bytes of pointee
1179                                                 type for pointer type kernel
1180                                                 argument. Must be a power
1181                                                 of 2. Only present if
1182                                                 "ValueKind" is
1183                                                 "DynamicSharedPointer".
1184      "AddrSpaceQual"   string                   Kernel argument address space
1185                                                 qualifier. Only present if
1186                                                 "ValueKind" is "GlobalBuffer" or
1187                                                 "DynamicSharedPointer". Values
1188                                                 are:
1189
1190                                                 - "Private"
1191                                                 - "Global"
1192                                                 - "Constant"
1193                                                 - "Local"
1194                                                 - "Generic"
1195                                                 - "Region"
1196
1197                                                 .. TODO
1198                                                    Is GlobalBuffer only Global
1199                                                    or Constant? Is
1200                                                    DynamicSharedPointer always
1201                                                    Local? Can HCC allow Generic?
1202                                                    How can Private or Region
1203                                                    ever happen?
1204      "AccQual"         string                   Kernel argument access
1205                                                 qualifier. Only present if
1206                                                 "ValueKind" is "Image" or
1207                                                 "Pipe". Values
1208                                                 are:
1209
1210                                                 - "ReadOnly"
1211                                                 - "WriteOnly"
1212                                                 - "ReadWrite"
1213
1214                                                 .. TODO
1215                                                    Does this apply to
1216                                                    GlobalBuffer?
1217      "ActualAccQual"   string                   The actual memory accesses
1218                                                 performed by the kernel on the
1219                                                 kernel argument. Only present if
1220                                                 "ValueKind" is "GlobalBuffer",
1221                                                 "Image", or "Pipe". This may be
1222                                                 more restrictive than indicated
1223                                                 by "AccQual" to reflect what the
1224                                                 kernel actual does. If not
1225                                                 present then the runtime must
1226                                                 assume what is implied by
1227                                                 "AccQual" and "IsConst". Values
1228                                                 are:
1229
1230                                                 - "ReadOnly"
1231                                                 - "WriteOnly"
1232                                                 - "ReadWrite"
1233
1234      "IsConst"         boolean                  Indicates if the kernel argument
1235                                                 is const qualified. Only present
1236                                                 if "ValueKind" is
1237                                                 "GlobalBuffer".
1238
1239      "IsRestrict"      boolean                  Indicates if the kernel argument
1240                                                 is restrict qualified. Only
1241                                                 present if "ValueKind" is
1242                                                 "GlobalBuffer".
1243
1244      "IsVolatile"      boolean                  Indicates if the kernel argument
1245                                                 is volatile qualified. Only
1246                                                 present if "ValueKind" is
1247                                                 "GlobalBuffer".
1248
1249      "IsPipe"          boolean                  Indicates if the kernel argument
1250                                                 is pipe qualified. Only present
1251                                                 if "ValueKind" is "Pipe".
1252
1253                                                 .. TODO
1254                                                    Can GlobalBuffer be pipe
1255                                                    qualified?
1256      ================= ============== ========= ================================
1257
1258 ..
1259
1260   .. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
1261      :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
1262
1263      ============================ ============== ========= =====================
1264      String Key                   Value Type     Required? Description
1265      ============================ ============== ========= =====================
1266      "KernargSegmentSize"         integer        Required  The size in bytes of
1267                                                            the kernarg segment
1268                                                            that holds the values
1269                                                            of the arguments to
1270                                                            the kernel.
1271      "GroupSegmentFixedSize"      integer        Required  The amount of group
1272                                                            segment memory
1273                                                            required by a
1274                                                            work-group in
1275                                                            bytes. This does not
1276                                                            include any
1277                                                            dynamically allocated
1278                                                            group segment memory
1279                                                            that may be added
1280                                                            when the kernel is
1281                                                            dispatched.
1282      "PrivateSegmentFixedSize"    integer        Required  The amount of fixed
1283                                                            private address space
1284                                                            memory required for a
1285                                                            work-item in
1286                                                            bytes. If the kernel
1287                                                            uses a dynamic call
1288                                                            stack then additional
1289                                                            space must be added
1290                                                            to this value for the
1291                                                            call stack.
1292      "KernargSegmentAlign"        integer        Required  The maximum byte
1293                                                            alignment of
1294                                                            arguments in the
1295                                                            kernarg segment. Must
1296                                                            be a power of 2.
1297      "WavefrontSize"              integer        Required  Wavefront size. Must
1298                                                            be a power of 2.
1299      "NumSGPRs"                   integer        Required  Number of scalar
1300                                                            registers used by a
1301                                                            wavefront for
1302                                                            GFX6-GFX9. This
1303                                                            includes the special
1304                                                            SGPRs for VCC, Flat
1305                                                            Scratch (GFX7-GFX9)
1306                                                            and XNACK (for
1307                                                            GFX8-GFX9). It does
1308                                                            not include the 16
1309                                                            SGPR added if a trap
1310                                                            handler is
1311                                                            enabled. It is not
1312                                                            rounded up to the
1313                                                            allocation
1314                                                            granularity.
1315      "NumVGPRs"                   integer        Required  Number of vector
1316                                                            registers used by
1317                                                            each work-item for
1318                                                            GFX6-GFX9
1319      "MaxFlatWorkGroupSize"       integer        Required  Maximum flat
1320                                                            work-group size
1321                                                            supported by the
1322                                                            kernel in work-items.
1323                                                            Must be >=1 and
1324                                                            consistent with
1325                                                            ReqdWorkGroupSize if
1326                                                            not 0, 0, 0.
1327      "NumSpilledSGPRs"            integer                  Number of stores from
1328                                                            a scalar register to
1329                                                            a register allocator
1330                                                            created spill
1331                                                            location.
1332      "NumSpilledVGPRs"            integer                  Number of stores from
1333                                                            a vector register to
1334                                                            a register allocator
1335                                                            created spill
1336                                                            location.
1337      ============================ ============== ========= =====================
1338
1339 ..
1340
1341 Kernel Dispatch
1342 ~~~~~~~~~~~~~~~
1343
1344 The HSA architected queuing language (AQL) defines a user space memory interface
1345 that can be used to control the dispatch of kernels, in an agent independent
1346 way. An agent can have zero or more AQL queues created for it using the ROCm
1347 runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
1348 *HSA Platform System Architecture Specification* [HSA]_ for the AQL queue
1349 mechanics and packet layouts.
1350
1351 The packet processor of a kernel agent is responsible for detecting and
1352 dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
1353 packet processor is implemented by the hardware command processor (CP),
1354 asynchronous dispatch controller (ADC) and shader processor input controller
1355 (SPI).
1356
1357 The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
1358 mode driver to initialize and register the AQL queue with CP.
1359
1360 To dispatch a kernel the following actions are performed. This can occur in the
1361 CPU host program, or from an HSA kernel executing on a GPU.
1362
1363 1. A pointer to an AQL queue for the kernel agent on which the kernel is to be
1364    executed is obtained.
1365 2. A pointer to the kernel descriptor (see
1366    :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
1367    obtained. It must be for a kernel that is contained in a code object that that
1368    was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
1369    associated.
1370 3. Space is allocated for the kernel arguments using the ROCm runtime allocator
1371    for a memory region with the kernarg property for the kernel agent that will
1372    execute the kernel. It must be at least 16 byte aligned.
1373 4. Kernel argument values are assigned to the kernel argument memory
1374    allocation. The layout is defined in the *HSA Programmer's Language Reference*
1375    [HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
1376    memory in the same way constant memory is accessed. (Note that the HSA
1377    specification allows an implementation to copy the kernel argument contents to
1378    another location that is accessed by the kernel.)
1379 5. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
1380    api uses 64 bit atomic operations to reserve space in the AQL queue for the
1381    packet. The packet must be set up, and the final write must use an atomic
1382    store release to set the packet kind to ensure the packet contents are
1383    visible to the kernel agent. AQL defines a doorbell signal mechanism to
1384    notify the kernel agent that the AQL queue has been updated. These rules, and
1385    the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
1386    System Architecture Specification* [HSA]_.
1387 6. A kernel dispatch packet includes information about the actual dispatch,
1388    such as grid and work-group size, together with information from the code
1389    object about the kernel, such as segment sizes. The ROCm runtime queries on
1390    the kernel symbol can be used to obtain the code object values which are
1391    recorded in the :ref:`amdgpu-amdhsa-hsa-code-object-metadata`.
1392 7. CP executes micro-code and is responsible for detecting and setting up the
1393    GPU to execute the wavefronts of a kernel dispatch.
1394 8. CP ensures that when the a wavefront starts executing the kernel machine
1395    code, the scalar general purpose registers (SGPR) and vector general purpose
1396    registers (VGPR) are set up as required by the machine code. The required
1397    setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
1398    register state is defined in
1399    :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
1400 9. The prolog of the kernel machine code (see
1401    :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
1402    before continuing executing the machine code that corresponds to the kernel.
1403 10. When the kernel dispatch has completed execution, CP signals the completion
1404     signal specified in the kernel dispatch packet if not 0.
1405
1406 .. _amdgpu-amdhsa-memory-spaces:
1407
1408 Memory Spaces
1409 ~~~~~~~~~~~~~
1410
1411 The memory space properties are:
1412
1413   .. table:: AMDHSA Memory Spaces
1414      :name: amdgpu-amdhsa-memory-spaces-table
1415
1416      ================= =========== ======== ======= ==================
1417      Memory Space Name HSA Segment Hardware Address NULL Value
1418                        Name        Name     Size
1419      ================= =========== ======== ======= ==================
1420      Private           private     scratch  32      0x00000000
1421      Local             group       LDS      32      0xFFFFFFFF
1422      Global            global      global   64      0x0000000000000000
1423      Constant          constant    *same as 64      0x0000000000000000
1424                                    global*
1425      Generic           flat        flat     64      0x0000000000000000
1426      Region            N/A         GDS      32      *not implemented
1427                                                     for AMDHSA*
1428      ================= =========== ======== ======= ==================
1429
1430 The global and constant memory spaces both use global virtual addresses, which
1431 are the same virtual address space used by the CPU. However, some virtual
1432 addresses may only be accessible to the CPU, some only accessible by the GPU,
1433 and some by both.
1434
1435 Using the constant memory space indicates that the data will not change during
1436 the execution of the kernel. This allows scalar read instructions to be
1437 used. The vector and scalar L1 caches are invalidated of volatile data before
1438 each kernel dispatch execution to allow constant memory to change values between
1439 kernel dispatches.
1440
1441 The local memory space uses the hardware Local Data Store (LDS) which is
1442 automatically allocated when the hardware creates work-groups of wavefronts, and
1443 freed when all the wavefronts of a work-group have terminated. The data store
1444 (DS) instructions can be used to access it.
1445
1446 The private memory space uses the hardware scratch memory support. If the kernel
1447 uses scratch, then the hardware allocates memory that is accessed using
1448 wavefront lane dword (4 byte) interleaving. The mapping used from private
1449 address to physical address is:
1450
1451   ``wavefront-scratch-base +
1452   (private-address * wavefront-size * 4) +
1453   (wavefront-lane-id * 4)``
1454
1455 There are different ways that the wavefront scratch base address is determined
1456 by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
1457 memory can be accessed in an interleaved manner using buffer instruction with
1458 the scratch buffer descriptor and per wave scratch offset, by the scratch
1459 instructions, or by flat instructions. If each lane of a wavefront accesses the
1460 same private address, the interleaving results in adjacent dwords being accessed
1461 and hence requires fewer cache lines to be fetched. Multi-dword access is not
1462 supported except by flat and scratch instructions in GFX9.
1463
1464 The generic address space uses the hardware flat address support available in
1465 GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
1466 local appertures), that are outside the range of addressible global memory, to
1467 map from a flat address to a private or local address.
1468
1469 FLAT instructions can take a flat address and access global, private (scratch)
1470 and group (LDS) memory depending in if the address is within one of the
1471 apperture ranges. Flat access to scratch requires hardware aperture setup and
1472 setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
1473 access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
1474 (see :ref:`amdgpu-amdhsa-m0`).
1475
1476 To convert between a segment address and a flat address the base address of the
1477 appertures address can be used. For GFX7-GFX8 these are available in the
1478 :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
1479 Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
1480 GFX9 the appature base addresses are directly available as inline constant
1481 registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
1482 address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
1483 which makes it easier to convert from flat to segment or segment to flat.
1484
1485 Image and Samplers
1486 ~~~~~~~~~~~~~~~~~~
1487
1488 Image and sample handles created by the ROCm runtime are 64 bit addresses of a
1489 hardware 32 byte V# and 48 byte S# object respectively. In order to support the
1490 HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
1491 enumeration values for the queries that are not trivially deducible from the S#
1492 representation.
1493
1494 HSA Signals
1495 ~~~~~~~~~~~
1496
1497 HSA signal handles created by the ROCm runtime are 64 bit addresses of a
1498 structure allocated in memory accessible from both the CPU and GPU. The
1499 structure is defined by the ROCm runtime and subject to change between releases
1500 (see [AMD-ROCm-github]_).
1501
1502 .. _amdgpu-amdhsa-hsa-aql-queue:
1503
1504 HSA AQL Queue
1505 ~~~~~~~~~~~~~
1506
1507 The HSA AQL queue structure is defined by the ROCm runtime and subject to change
1508 between releases (see [AMD-ROCm-github]_). For some processors it contains
1509 fields needed to implement certain language features such as the flat address
1510 aperture bases. It also contains fields used by CP such as managing the
1511 allocation of scratch memory.
1512
1513 .. _amdgpu-amdhsa-kernel-descriptor:
1514
1515 Kernel Descriptor
1516 ~~~~~~~~~~~~~~~~~
1517
1518 A kernel descriptor consists of the information needed by CP to initiate the
1519 execution of a kernel, including the entry point address of the machine code
1520 that implements the kernel.
1521
1522 Kernel Descriptor for GFX6-GFX9
1523 +++++++++++++++++++++++++++++++
1524
1525 CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
1526
1527   .. table:: Kernel Descriptor for GFX6-GFX9
1528      :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
1529
1530      ======= ======= =============================== ============================
1531      Bits    Size    Field Name                      Description
1532      ======= ======= =============================== ============================
1533      31:0    4 bytes GroupSegmentFixedSize           The amount of fixed local
1534                                                      address space memory
1535                                                      required for a work-group
1536                                                      in bytes. This does not
1537                                                      include any dynamically
1538                                                      allocated local address
1539                                                      space memory that may be
1540                                                      added when the kernel is
1541                                                      dispatched.
1542      63:32   4 bytes PrivateSegmentFixedSize         The amount of fixed
1543                                                      private address space
1544                                                      memory required for a
1545                                                      work-item in bytes. If
1546                                                      is_dynamic_callstack is 1
1547                                                      then additional space must
1548                                                      be added to this value for
1549                                                      the call stack.
1550      127:64  8 bytes                                 Reserved, must be 0.
1551      191:128 8 bytes KernelCodeEntryByteOffset       Byte offset (possibly
1552                                                      negative) from base
1553                                                      address of kernel
1554                                                      descriptor to kernel's
1555                                                      entry point instruction
1556                                                      which must be 256 byte
1557                                                      aligned.
1558      383:192 24                                      Reserved, must be 0.
1559              bytes
1560      415:384 4 bytes ComputePgmRsrc1                 Compute Shader (CS)
1561                                                      program settings used by
1562                                                      CP to set up
1563                                                      ``COMPUTE_PGM_RSRC1``
1564                                                      configuration
1565                                                      register. See
1566                                                      :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
1567      447:416 4 bytes ComputePgmRsrc2                 Compute Shader (CS)
1568                                                      program settings used by
1569                                                      CP to set up
1570                                                      ``COMPUTE_PGM_RSRC2``
1571                                                      configuration
1572                                                      register. See
1573                                                      :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
1574      448     1 bit   EnableSGPRPrivateSegmentBuffer  Enable the setup of the
1575                                                      SGPR user data registers
1576                                                      (see
1577                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1578
1579                                                      The total number of SGPR
1580                                                      user data registers
1581                                                      requested must not exceed
1582                                                      16 and match value in
1583                                                      ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
1584                                                      Any requests beyond 16
1585                                                      will be ignored.
1586      449     1 bit   EnableSGPRDispatchPtr           *see above*
1587      450     1 bit   EnableSGPRQueuePtr              *see above*
1588      451     1 bit   EnableSGPRKernargSegmentPtr     *see above*
1589      452     1 bit   EnableSGPRDispatchID            *see above*
1590      453     1 bit   EnableSGPRFlatScratchInit       *see above*
1591      454     1 bit   EnableSGPRPrivateSegmentSize    *see above*
1592      455     1 bit   EnableSGPRGridWorkgroupCountX   Not implemented in CP and
1593                                                      should always be 0.
1594      456     1 bit   EnableSGPRGridWorkgroupCountY   Not implemented in CP and
1595                                                      should always be 0.
1596      457     1 bit   EnableSGPRGridWorkgroupCountZ   Not implemented in CP and
1597                                                      should always be 0.
1598      463:458 6 bits                                  Reserved, must be 0.
1599      511:464 6                                       Reserved, must be 0.
1600              bytes
1601      512     **Total size 64 bytes.**
1602      ======= ====================================================================
1603
1604 ..
1605
1606   .. table:: compute_pgm_rsrc1 for GFX6-GFX9
1607      :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table
1608
1609      ======= ======= =============================== ===========================================================================
1610      Bits    Size    Field Name                      Description
1611      ======= ======= =============================== ===========================================================================
1612      5:0     6 bits  GRANULATED_WORKITEM_VGPR_COUNT  Number of vector registers
1613                                                      used by each work-item,
1614                                                      granularity is device
1615                                                      specific:
1616
1617                                                      GFX6-GFX9
1618                                                        - max_vgpr 1..256
1619                                                        - roundup((max_vgpg + 1)
1620                                                          / 4) - 1
1621
1622                                                      Used by CP to set up
1623                                                      ``COMPUTE_PGM_RSRC1.VGPRS``.
1624      9:6     4 bits  GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar registers
1625                                                      used by a wavefront,
1626                                                      granularity is device
1627                                                      specific:
1628
1629                                                      GFX6-GFX8
1630                                                        - max_sgpr 1..112
1631                                                        - roundup((max_sgpg + 1)
1632                                                          / 8) - 1
1633                                                      GFX9
1634                                                        - max_sgpr 1..112
1635                                                        - roundup((max_sgpg + 1)
1636                                                          / 16) - 1
1637
1638                                                      Includes the special SGPRs
1639                                                      for VCC, Flat Scratch (for
1640                                                      GFX7 onwards) and XNACK
1641                                                      (for GFX8 onwards). It does
1642                                                      not include the 16 SGPR
1643                                                      added if a trap handler is
1644                                                      enabled.
1645
1646                                                      Used by CP to set up
1647                                                      ``COMPUTE_PGM_RSRC1.SGPRS``.
1648      11:10   2 bits  PRIORITY                        Must be 0.
1649
1650                                                      Start executing wavefront
1651                                                      at the specified priority.
1652
1653                                                      CP is responsible for
1654                                                      filling in
1655                                                      ``COMPUTE_PGM_RSRC1.PRIORITY``.
1656      13:12   2 bits  FLOAT_ROUND_MODE_32             Wavefront starts execution
1657                                                      with specified rounding
1658                                                      mode for single (32
1659                                                      bit) floating point
1660                                                      precision floating point
1661                                                      operations.
1662
1663                                                      Floating point rounding
1664                                                      mode values are defined in
1665                                                      :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1666
1667                                                      Used by CP to set up
1668                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1669      15:14   2 bits  FLOAT_ROUND_MODE_16_64          Wavefront starts execution
1670                                                      with specified rounding
1671                                                      denorm mode for half/double (16
1672                                                      and 64 bit) floating point
1673                                                      precision floating point
1674                                                      operations.
1675
1676                                                      Floating point rounding
1677                                                      mode values are defined in
1678                                                      :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1679
1680                                                      Used by CP to set up
1681                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1682      17:16   2 bits  FLOAT_DENORM_MODE_32            Wavefront starts execution
1683                                                      with specified denorm mode
1684                                                      for single (32
1685                                                      bit)  floating point
1686                                                      precision floating point
1687                                                      operations.
1688
1689                                                      Floating point denorm mode
1690                                                      values are defined in
1691                                                      :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1692
1693                                                      Used by CP to set up
1694                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1695      19:18   2 bits  FLOAT_DENORM_MODE_16_64         Wavefront starts execution
1696                                                      with specified denorm mode
1697                                                      for half/double (16
1698                                                      and 64 bit) floating point
1699                                                      precision floating point
1700                                                      operations.
1701
1702                                                      Floating point denorm mode
1703                                                      values are defined in
1704                                                      :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1705
1706                                                      Used by CP to set up
1707                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1708      20      1 bit   PRIV                            Must be 0.
1709
1710                                                      Start executing wavefront
1711                                                      in privilege trap handler
1712                                                      mode.
1713
1714                                                      CP is responsible for
1715                                                      filling in
1716                                                      ``COMPUTE_PGM_RSRC1.PRIV``.
1717      21      1 bit   ENABLE_DX10_CLAMP               Wavefront starts execution
1718                                                      with DX10 clamp mode
1719                                                      enabled. Used by the vector
1720                                                      ALU to force DX10 style
1721                                                      treatment of NaN's (when
1722                                                      set, clamp NaN to zero,
1723                                                      otherwise pass NaN
1724                                                      through).
1725
1726                                                      Used by CP to set up
1727                                                      ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
1728      22      1 bit   DEBUG_MODE                      Must be 0.
1729
1730                                                      Start executing wavefront
1731                                                      in single step mode.
1732
1733                                                      CP is responsible for
1734                                                      filling in
1735                                                      ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
1736      23      1 bit   ENABLE_IEEE_MODE                Wavefront starts execution
1737                                                      with IEEE mode
1738                                                      enabled. Floating point
1739                                                      opcodes that support
1740                                                      exception flag gathering
1741                                                      will quiet and propagate
1742                                                      signaling-NaN inputs per
1743                                                      IEEE 754-2008. Min_dx10 and
1744                                                      max_dx10 become IEEE
1745                                                      754-2008 compliant due to
1746                                                      signaling-NaN propagation
1747                                                      and quieting.
1748
1749                                                      Used by CP to set up
1750                                                      ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
1751      24      1 bit   BULKY                           Must be 0.
1752
1753                                                      Only one work-group allowed
1754                                                      to execute on a compute
1755                                                      unit.
1756
1757                                                      CP is responsible for
1758                                                      filling in
1759                                                      ``COMPUTE_PGM_RSRC1.BULKY``.
1760      25      1 bit   CDBG_USER                       Must be 0.
1761
1762                                                      Flag that can be used to
1763                                                      control debugging code.
1764
1765                                                      CP is responsible for
1766                                                      filling in
1767                                                      ``COMPUTE_PGM_RSRC1.CDBG_USER``.
1768      26      1 bit   FP16_OVFL                       GFX6-GFX8
1769                                                        Reserved, must be 0.
1770                                                      GFX9
1771                                                        Wavefront starts execution
1772                                                        with specified fp16 overflow
1773                                                        mode.
1774
1775                                                        - If 0, fp16 overflow generates
1776                                                          +/-INF values.
1777                                                        - If 1, fp16 overflow that is the
1778                                                          result of an +/-INF input value
1779                                                          or divide by 0 produces a +/-INF,
1780                                                          otherwise clamps computed
1781                                                          overflow to +/-MAX_FP16 as
1782                                                          appropriate.
1783
1784                                                        Used by CP to set up
1785                                                        ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
1786      31:27   5 bits                                  Reserved, must be 0.
1787      32      **Total size 4 bytes**
1788      ======= ===================================================================================================================
1789
1790 ..
1791
1792   .. table:: compute_pgm_rsrc2 for GFX6-GFX9
1793      :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
1794
1795      ======= ======= =============================== ===========================================================================
1796      Bits    Size    Field Name                      Description
1797      ======= ======= =============================== ===========================================================================
1798      0       1 bit   ENABLE_SGPR_PRIVATE_SEGMENT     Enable the setup of the
1799                      _WAVE_OFFSET                    SGPR wave scratch offset
1800                                                      system register (see
1801                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1802
1803                                                      Used by CP to set up
1804                                                      ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
1805      5:1     5 bits  USER_SGPR_COUNT                 The total number of SGPR
1806                                                      user data registers
1807                                                      requested. This number must
1808                                                      match the number of user
1809                                                      data registers enabled.
1810
1811                                                      Used by CP to set up
1812                                                      ``COMPUTE_PGM_RSRC2.USER_SGPR``.
1813      6       1 bit   ENABLE_TRAP_HANDLER             Set to 1 if code contains a
1814                                                      TRAP instruction which
1815                                                      requires a trap handler to
1816                                                      be enabled.
1817
1818                                                      CP sets
1819                                                      ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``
1820                                                      if the runtime has
1821                                                      installed a trap handler
1822                                                      regardless of the setting
1823                                                      of this field.
1824      7       1 bit   ENABLE_SGPR_WORKGROUP_ID_X      Enable the setup of the
1825                                                      system SGPR register for
1826                                                      the work-group id in the X
1827                                                      dimension (see
1828                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1829
1830                                                      Used by CP to set up
1831                                                      ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
1832      8       1 bit   ENABLE_SGPR_WORKGROUP_ID_Y      Enable the setup of the
1833                                                      system SGPR register for
1834                                                      the work-group id in the Y
1835                                                      dimension (see
1836                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1837
1838                                                      Used by CP to set up
1839                                                      ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
1840      9       1 bit   ENABLE_SGPR_WORKGROUP_ID_Z      Enable the setup of the
1841                                                      system SGPR register for
1842                                                      the work-group id in the Z
1843                                                      dimension (see
1844                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1845
1846                                                      Used by CP to set up
1847                                                      ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
1848      10      1 bit   ENABLE_SGPR_WORKGROUP_INFO      Enable the setup of the
1849                                                      system SGPR register for
1850                                                      work-group information (see
1851                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1852
1853                                                      Used by CP to set up
1854                                                      ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
1855      12:11   2 bits  ENABLE_VGPR_WORKITEM_ID         Enable the setup of the
1856                                                      VGPR system registers used
1857                                                      for the work-item ID.
1858                                                      :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
1859                                                      defines the values.
1860
1861                                                      Used by CP to set up
1862                                                      ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
1863      13      1 bit   ENABLE_EXCEPTION_ADDRESS_WATCH  Must be 0.
1864
1865                                                      Wavefront starts execution
1866                                                      with address watch
1867                                                      exceptions enabled which
1868                                                      are generated when L1 has
1869                                                      witnessed a thread access
1870                                                      an *address of
1871                                                      interest*.
1872
1873                                                      CP is responsible for
1874                                                      filling in the address
1875                                                      watch bit in
1876                                                      ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1877                                                      according to what the
1878                                                      runtime requests.
1879      14      1 bit   ENABLE_EXCEPTION_MEMORY         Must be 0.
1880
1881                                                      Wavefront starts execution
1882                                                      with memory violation
1883                                                      exceptions exceptions
1884                                                      enabled which are generated
1885                                                      when a memory violation has
1886                                                      occurred for this wave from
1887                                                      L1 or LDS
1888                                                      (write-to-read-only-memory,
1889                                                      mis-aligned atomic, LDS
1890                                                      address out of range,
1891                                                      illegal address, etc.).
1892
1893                                                      CP sets the memory
1894                                                      violation bit in
1895                                                      ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1896                                                      according to what the
1897                                                      runtime requests.
1898      23:15   9 bits  GRANULATED_LDS_SIZE             Must be 0.
1899
1900                                                      CP uses the rounded value
1901                                                      from the dispatch packet,
1902                                                      not this value, as the
1903                                                      dispatch may contain
1904                                                      dynamically allocated group
1905                                                      segment memory. CP writes
1906                                                      directly to
1907                                                      ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
1908
1909                                                      Amount of group segment
1910                                                      (LDS) to allocate for each
1911                                                      work-group. Granularity is
1912                                                      device specific:
1913
1914                                                      GFX6:
1915                                                        roundup(lds-size / (64 * 4))
1916                                                      GFX7-GFX9:
1917                                                        roundup(lds-size / (128 * 4))
1918
1919      24      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    Wavefront starts execution
1920                      _INVALID_OPERATION              with specified exceptions
1921                                                      enabled.
1922
1923                                                      Used by CP to set up
1924                                                      ``COMPUTE_PGM_RSRC2.EXCP_EN``
1925                                                      (set from bits 0..6).
1926
1927                                                      IEEE 754 FP Invalid
1928                                                      Operation
1929      25      1 bit   ENABLE_EXCEPTION_FP_DENORMAL    FP Denormal one or more
1930                      _SOURCE                         input operands is a
1931                                                      denormal number
1932      26      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Division by
1933                      _DIVISION_BY_ZERO               Zero
1934      27      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP FP Overflow
1935                      _OVERFLOW
1936      28      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Underflow
1937                      _UNDERFLOW
1938      29      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Inexact
1939                      _INEXACT
1940      30      1 bit   ENABLE_EXCEPTION_INT_DIVIDE_BY  Integer Division by Zero
1941                      _ZERO                           (rcp_iflag_f32 instruction
1942                                                      only)
1943      31      1 bit                                   Reserved, must be 0.
1944      32      **Total size 4 bytes.**
1945      ======= ===================================================================================================================
1946
1947 ..
1948
1949   .. table:: Floating Point Rounding Mode Enumeration Values
1950      :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
1951
1952      ====================================== ===== ==============================
1953      Enumeration Name                       Value Description
1954      ====================================== ===== ==============================
1955      AMDGPU_FLOAT_ROUND_MODE_NEAR_EVEN      0     Round Ties To Even
1956      AMDGPU_FLOAT_ROUND_MODE_PLUS_INFINITY  1     Round Toward +infinity
1957      AMDGPU_FLOAT_ROUND_MODE_MINUS_INFINITY 2     Round Toward -infinity
1958      AMDGPU_FLOAT_ROUND_MODE_ZERO           3     Round Toward 0
1959      ====================================== ===== ==============================
1960
1961 ..
1962
1963   .. table:: Floating Point Denorm Mode Enumeration Values
1964      :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
1965
1966      ====================================== ===== ==============================
1967      Enumeration Name                       Value Description
1968      ====================================== ===== ==============================
1969      AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC_DST 0     Flush Source and Destination
1970                                                   Denorms
1971      AMDGPU_FLOAT_DENORM_MODE_FLUSH_DST     1     Flush Output Denorms
1972      AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC     2     Flush Source Denorms
1973      AMDGPU_FLOAT_DENORM_MODE_FLUSH_NONE    3     No Flush
1974      ====================================== ===== ==============================
1975
1976 ..
1977
1978   .. table:: System VGPR Work-Item ID Enumeration Values
1979      :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
1980
1981      ======================================== ===== ============================
1982      Enumeration Name                         Value Description
1983      ======================================== ===== ============================
1984      AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X         0     Set work-item X dimension
1985                                                     ID.
1986      AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y       1     Set work-item X and Y
1987                                                     dimensions ID.
1988      AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y_Z     2     Set work-item X, Y and Z
1989                                                     dimensions ID.
1990      AMDGPU_SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3     Undefined.
1991      ======================================== ===== ============================
1992
1993 .. _amdgpu-amdhsa-initial-kernel-execution-state:
1994
1995 Initial Kernel Execution State
1996 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1997
1998 This section defines the register state that will be set up by the packet
1999 processor prior to the start of execution of every wavefront. This is limited by
2000 the constraints of the hardware controllers of CP/ADC/SPI.
2001
2002 The order of the SGPR registers is defined, but the compiler can specify which
2003 ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
2004 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2005 for enabled registers are dense starting at SGPR0: the first enabled register is
2006 SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
2007 an SGPR number.
2008
2009 The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
2010 all waves of the grid. It is possible to specify more than 16 User SGPRs using
2011 the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
2012 initialized. These are then immediately followed by the System SGPRs that are
2013 set up by ADC/SPI and can have different values for each wave of the grid
2014 dispatch.
2015
2016 SGPR register initial state is defined in
2017 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
2018
2019   .. table:: SGPR Register Set Up Order
2020      :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
2021
2022      ========== ========================== ====== ==============================
2023      SGPR Order Name                       Number Description
2024                 (kernel descriptor enable  of
2025                 field)                     SGPRs
2026      ========== ========================== ====== ==============================
2027      First      Private Segment Buffer     4      V# that can be used, together
2028                 (enable_sgpr_private              with Scratch Wave Offset as an
2029                 _segment_buffer)                  offset, to access the private
2030                                                   memory space using a segment
2031                                                   address.
2032
2033                                                   CP uses the value provided by
2034                                                   the runtime.
2035      then       Dispatch Ptr               2      64 bit address of AQL dispatch
2036                 (enable_sgpr_dispatch_ptr)        packet for kernel dispatch
2037                                                   actually executing.
2038      then       Queue Ptr                  2      64 bit address of amd_queue_t
2039                 (enable_sgpr_queue_ptr)           object for AQL queue on which
2040                                                   the dispatch packet was
2041                                                   queued.
2042      then       Kernarg Segment Ptr        2      64 bit address of Kernarg
2043                 (enable_sgpr_kernarg              segment. This is directly
2044                 _segment_ptr)                     copied from the
2045                                                   kernarg_address in the kernel
2046                                                   dispatch packet.
2047
2048                                                   Having CP load it once avoids
2049                                                   loading it at the beginning of
2050                                                   every wavefront.
2051      then       Dispatch Id                2      64 bit Dispatch ID of the
2052                 (enable_sgpr_dispatch_id)         dispatch packet being
2053                                                   executed.
2054      then       Flat Scratch Init          2      This is 2 SGPRs:
2055                 (enable_sgpr_flat_scratch
2056                 _init)                            GFX6
2057                                                     Not supported.
2058                                                   GFX7-GFX8
2059                                                     The first SGPR is a 32 bit
2060                                                     byte offset from
2061                                                     ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2062                                                     to per SPI base of memory
2063                                                     for scratch for the queue
2064                                                     executing the kernel
2065                                                     dispatch. CP obtains this
2066                                                     from the runtime. (The
2067                                                     Scratch Segment Buffer base
2068                                                     address is
2069                                                     ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2070                                                     plus this offset.) The value
2071                                                     of Scratch Wave Offset must
2072                                                     be added to this offset by
2073                                                     the kernel machine code,
2074                                                     right shifted by 8, and
2075                                                     moved to the FLAT_SCRATCH_HI
2076                                                     SGPR register.
2077                                                     FLAT_SCRATCH_HI corresponds
2078                                                     to SGPRn-4 on GFX7, and
2079                                                     SGPRn-6 on GFX8 (where SGPRn
2080                                                     is the highest numbered SGPR
2081                                                     allocated to the wave).
2082                                                     FLAT_SCRATCH_HI is
2083                                                     multiplied by 256 (as it is
2084                                                     in units of 256 bytes) and
2085                                                     added to
2086                                                     ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2087                                                     to calculate the per wave
2088                                                     FLAT SCRATCH BASE in flat
2089                                                     memory instructions that
2090                                                     access the scratch
2091                                                     apperture.
2092
2093                                                     The second SGPR is 32 bit
2094                                                     byte size of a single
2095                                                     work-item's scratch memory
2096                                                     usage. CP obtains this from
2097                                                     the runtime, and it is
2098                                                     always a multiple of DWORD.
2099                                                     CP checks that the value in
2100                                                     the kernel dispatch packet
2101                                                     Private Segment Byte Size is
2102                                                     not larger, and requests the
2103                                                     runtime to increase the
2104                                                     queue's scratch size if
2105                                                     necessary. The kernel code
2106                                                     must move it to
2107                                                     FLAT_SCRATCH_LO which is
2108                                                     SGPRn-3 on GFX7 and SGPRn-5
2109                                                     on GFX8. FLAT_SCRATCH_LO is
2110                                                     used as the FLAT SCRATCH
2111                                                     SIZE in flat memory
2112                                                     instructions. Having CP load
2113                                                     it once avoids loading it at
2114                                                     the beginning of every
2115                                                     wavefront.
2116                                                   GFX9
2117                                                     This is the
2118                                                     64 bit base address of the
2119                                                     per SPI scratch backing
2120                                                     memory managed by SPI for
2121                                                     the queue executing the
2122                                                     kernel dispatch. CP obtains
2123                                                     this from the runtime (and
2124                                                     divides it if there are
2125                                                     multiple Shader Arrays each
2126                                                     with its own SPI). The value
2127                                                     of Scratch Wave Offset must
2128                                                     be added by the kernel
2129                                                     machine code and the result
2130                                                     moved to the FLAT_SCRATCH
2131                                                     SGPR which is SGPRn-6 and
2132                                                     SGPRn-5. It is used as the
2133                                                     FLAT SCRATCH BASE in flat
2134                                                     memory instructions.
2135      then       Private Segment Size       1      The 32 bit byte size of a
2136                                                   (enable_sgpr_private single
2137                                                   work-item's
2138                                                   scratch_segment_size) memory
2139                                                   allocation. This is the
2140                                                   value from the kernel
2141                                                   dispatch packet Private
2142                                                   Segment Byte Size rounded up
2143                                                   by CP to a multiple of
2144                                                   DWORD.
2145
2146                                                   Having CP load it once avoids
2147                                                   loading it at the beginning of
2148                                                   every wavefront.
2149
2150                                                   This is not used for
2151                                                   GFX7-GFX8 since it is the same
2152                                                   value as the second SGPR of
2153                                                   Flat Scratch Init. However, it
2154                                                   may be needed for GFX9 which
2155                                                   changes the meaning of the
2156                                                   Flat Scratch Init value.
2157      then       Grid Work-Group Count X    1      32 bit count of the number of
2158                 (enable_sgpr_grid                 work-groups in the X dimension
2159                 _workgroup_count_X)               for the grid being
2160                                                   executed. Computed from the
2161                                                   fields in the kernel dispatch
2162                                                   packet as ((grid_size.x +
2163                                                   workgroup_size.x - 1) /
2164                                                   workgroup_size.x).
2165      then       Grid Work-Group Count Y    1      32 bit count of the number of
2166                 (enable_sgpr_grid                 work-groups in the Y dimension
2167                 _workgroup_count_Y &&             for the grid being
2168                 less than 16 previous             executed. Computed from the
2169                 SGPRs)                            fields in the kernel dispatch
2170                                                   packet as ((grid_size.y +
2171                                                   workgroup_size.y - 1) /
2172                                                   workgroupSize.y).
2173
2174                                                   Only initialized if <16
2175                                                   previous SGPRs initialized.
2176      then       Grid Work-Group Count Z    1      32 bit count of the number of
2177                 (enable_sgpr_grid                 work-groups in the Z dimension
2178                 _workgroup_count_Z &&             for the grid being
2179                 less than 16 previous             executed. Computed from the
2180                 SGPRs)                            fields in the kernel dispatch
2181                                                   packet as ((grid_size.z +
2182                                                   workgroup_size.z - 1) /
2183                                                   workgroupSize.z).
2184
2185                                                   Only initialized if <16
2186                                                   previous SGPRs initialized.
2187      then       Work-Group Id X            1      32 bit work-group id in X
2188                 (enable_sgpr_workgroup_id         dimension of grid for
2189                 _X)                               wavefront.
2190      then       Work-Group Id Y            1      32 bit work-group id in Y
2191                 (enable_sgpr_workgroup_id         dimension of grid for
2192                 _Y)                               wavefront.
2193      then       Work-Group Id Z            1      32 bit work-group id in Z
2194                 (enable_sgpr_workgroup_id         dimension of grid for
2195                 _Z)                               wavefront.
2196      then       Work-Group Info            1      {first_wave, 14'b0000,
2197                 (enable_sgpr_workgroup            ordered_append_term[10:0],
2198                 _info)                            threadgroup_size_in_waves[5:0]}
2199      then       Scratch Wave Offset        1      32 bit byte offset from base
2200                 (enable_sgpr_private              of scratch base of queue
2201                 _segment_wave_offset)             executing the kernel
2202                                                   dispatch. Must be used as an
2203                                                   offset with Private
2204                                                   segment address when using
2205                                                   Scratch Segment Buffer. It
2206                                                   must be used to set up FLAT
2207                                                   SCRATCH for flat addressing
2208                                                   (see
2209                                                   :ref:`amdgpu-amdhsa-flat-scratch`).
2210      ========== ========================== ====== ==============================
2211
2212 The order of the VGPR registers is defined, but the compiler can specify which
2213 ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
2214 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2215 for enabled registers are dense starting at VGPR0: the first enabled register is
2216 VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
2217 VGPR number.
2218
2219 VGPR register initial state is defined in
2220 :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
2221
2222   .. table:: VGPR Register Set Up Order
2223      :name: amdgpu-amdhsa-vgpr-register-set-up-order-table
2224
2225      ========== ========================== ====== ==============================
2226      VGPR Order Name                       Number Description
2227                 (kernel descriptor enable  of
2228                 field)                     VGPRs
2229      ========== ========================== ====== ==============================
2230      First      Work-Item Id X             1      32 bit work item id in X
2231                 (Always initialized)              dimension of work-group for
2232                                                   wavefront lane.
2233      then       Work-Item Id Y             1      32 bit work item id in Y
2234                 (enable_vgpr_workitem_id          dimension of work-group for
2235                 > 0)                              wavefront lane.
2236      then       Work-Item Id Z             1      32 bit work item id in Z
2237                 (enable_vgpr_workitem_id          dimension of work-group for
2238                 > 1)                              wavefront lane.
2239      ========== ========================== ====== ==============================
2240
2241 The setting of registers is is done by GPU CP/ADC/SPI hardware as follows:
2242
2243 1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
2244    registers.
2245 2. Work-group Id registers X, Y, Z are set by ADC which supports any
2246    combination including none.
2247 3. Scratch Wave Offset is set by SPI in a per wave basis which is why its value
2248    cannot included with the flat scratch init value which is per queue.
2249 4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
2250    or (X, Y, Z).
2251
2252 Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
2253 value to the hardware required SGPRn-3 and SGPRn-4 respectively.
2254
2255 The global segment can be accessed either using buffer instructions (GFX6 which
2256 has V# 64 bit address support), flat instructions (GFX7-GFX9), or global
2257 instructions (GFX9).
2258
2259 If buffer operations are used then the compiler can generate a V# with the
2260 following properties:
2261
2262 * base address of 0
2263 * no swizzle
2264 * ATC: 1 if IOMMU present (such as APU)
2265 * ptr64: 1
2266 * MTYPE set to support memory coherence that matches the runtime (such as CC for
2267   APU and NC for dGPU).
2268
2269 .. _amdgpu-amdhsa-kernel-prolog:
2270
2271 Kernel Prolog
2272 ~~~~~~~~~~~~~
2273
2274 .. _amdgpu-amdhsa-m0:
2275
2276 M0
2277 ++
2278
2279 GFX6-GFX8
2280   The M0 register must be initialized with a value at least the total LDS size
2281   if the kernel may access LDS via DS or flat operations. Total LDS size is
2282   available in dispatch packet. For M0, it is also possible to use maximum
2283   possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
2284   GFX7-GFX8).
2285 GFX9
2286   The M0 register is not used for range checking LDS accesses and so does not
2287   need to be initialized in the prolog.
2288
2289 .. _amdgpu-amdhsa-flat-scratch:
2290
2291 Flat Scratch
2292 ++++++++++++
2293
2294 If the kernel may use flat operations to access scratch memory, the prolog code
2295 must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
2296 are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wave
2297 Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
2298
2299 GFX6
2300   Flat scratch is not supported.
2301
2302 GFX7-GFX8
2303   1. The low word of Flat Scratch Init is 32 bit byte offset from
2304      ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
2305      being managed by SPI for the queue executing the kernel dispatch. This is
2306      the same value used in the Scratch Segment Buffer V# base address. The
2307      prolog must add the value of Scratch Wave Offset to get the wave's byte
2308      scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
2309      FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
2310      by 8 before moving into FLAT_SCRATCH_LO.
2311   2. The second word of Flat Scratch Init is 32 bit byte size of a single
2312      work-items scratch memory usage. This is directly loaded from the kernel
2313      dispatch packet Private Segment Byte Size and rounded up to a multiple of
2314      DWORD. Having CP load it once avoids loading it at the beginning of every
2315      wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
2316      SIZE.
2317
2318 GFX9
2319   The Flat Scratch Init is the 64 bit address of the base of scratch backing
2320   memory being managed by SPI for the queue executing the kernel dispatch. The
2321   prolog must add the value of Scratch Wave Offset and moved to the FLAT_SCRATCH
2322   pair for use as the flat scratch base in flat memory instructions.
2323
2324 .. _amdgpu-amdhsa-memory-model:
2325
2326 Memory Model
2327 ~~~~~~~~~~~~
2328
2329 This section describes the mapping of LLVM memory model onto AMDGPU machine code
2330 (see :ref:`memmodel`). *The implementation is WIP.*
2331
2332 .. TODO
2333    Update when implementation complete.
2334
2335 The AMDGPU backend supports the memory synchronization scopes specified in
2336 :ref:`amdgpu-memory-scopes`.
2337
2338 The code sequences used to implement the memory model are defined in table
2339 :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
2340
2341 The sequences specify the order of instructions that a single thread must
2342 execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
2343 to other memory instructions executed by the same thread. This allows them to be
2344 moved earlier or later which can allow them to be combined with other instances
2345 of the same instruction, or hoisted/sunk out of loops to improve
2346 performance. Only the instructions related to the memory model are given;
2347 additional ``s_waitcnt`` instructions are required to ensure registers are
2348 defined before being used. These may be able to be combined with the memory
2349 model ``s_waitcnt`` instructions as described above.
2350
2351 The AMDGPU backend supports the following memory models:
2352
2353   HSA Memory Model [HSA]_
2354     The HSA memory model uses a single happens-before relation for all address
2355     spaces (see :ref:`amdgpu-address-spaces`).
2356   OpenCL Memory Model [OpenCL]_
2357     The OpenCL memory model which has separate happens-before relations for the
2358     global and local address spaces. Only a fence specifying both global and
2359     local address space, and seq_cst instructions join the relationships. Since
2360     the LLVM ``memfence`` instruction does not allow an address space to be
2361     specified the OpenCL fence has to convervatively assume both local and
2362     global address space was specified. However, optimizations can often be
2363     done to eliminate the additional ``s_waitcnt`` instructions when there are
2364     no intervening memory instructions which access the corresponding address
2365     space. The code sequences in the table indicate what can be omitted for the
2366     OpenCL memory. The target triple environment is used to determine if the
2367     source language is OpenCL (see :ref:`amdgpu-opencl`).
2368
2369 ``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
2370 operations.
2371
2372 ``buffer/global/flat_load/store/atomic`` instructions to global memory are
2373 termed vector memory operations.
2374
2375 For GFX6-GFX9:
2376
2377 * Each agent has multiple compute units (CU).
2378 * Each CU has multiple SIMDs that execute wavefronts.
2379 * The wavefronts for a single work-group are executed in the same CU but may be
2380   executed by different SIMDs.
2381 * Each CU has a single LDS memory shared by the wavefronts of the work-groups
2382   executing on it.
2383 * All LDS operations of a CU are performed as wavefront wide operations in a
2384   global order and involve no caching. Completion is reported to a wavefront in
2385   execution order.
2386 * The LDS memory has multiple request queues shared by the SIMDs of a
2387   CU. Therefore, the LDS operations performed by different waves of a work-group
2388   can be reordered relative to each other, which can result in reordering the
2389   visibility of vector memory operations with respect to LDS operations of other
2390   wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
2391   ensure synchronization between LDS operations and vector memory operations
2392   between waves of a work-group, but not between operations performed by the
2393   same wavefront.
2394 * The vector memory operations are performed as wavefront wide operations and
2395   completion is reported to a wavefront in execution order. The exception is
2396   that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
2397   vector memory order if they access LDS memory, and out of LDS operation order
2398   if they access global memory.
2399 * The vector memory operations access a single vector L1 cache shared by all
2400   SIMDs a CU. Therefore, no special action is required for coherence between the
2401   lanes of a single wavefront, or for coherence between wavefronts in the same
2402   work-group. A ``buffer_wbinvl1_vol`` is required for coherence between waves
2403   executing in different work-groups as they may be executing on different CUs.
2404 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
2405   on a group of CUs. The scalar and vector L1 caches are not coherent. However,
2406   scalar operations are used in a restricted way so do not impact the memory
2407   model. See :ref:`amdgpu-amdhsa-memory-spaces`.
2408 * The vector and scalar memory operations use an L2 cache shared by all CUs on
2409   the same agent.
2410 * The L2 cache has independent channels to service disjoint ranges of virtual
2411   addresses.
2412 * Each CU has a separate request queue per channel. Therefore, the vector and
2413   scalar memory operations performed by waves executing in different work-groups
2414   (which may be executing on different CUs) of an agent can be reordered
2415   relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
2416   synchronization between vector memory operations of different CUs. It ensures a
2417   previous vector memory operation has completed before executing a subsequent
2418   vector memory or LDS operation and so can be used to meet the requirements of
2419   acquire and release.
2420 * The L2 cache can be kept coherent with other agents on some targets, or ranges
2421   of virtual addresses can be set up to bypass it to ensure system coherence.
2422
2423 Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-GFX8),
2424 or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
2425 memory, atomic memory orderings are not meaningful and all accesses are treated
2426 as non-atomic.
2427
2428 Constant address space uses ``buffer/global_load`` instructions (or equivalent
2429 scalar memory instructions). Since the constant address space contents do not
2430 change during the execution of a kernel dispatch it is not legal to perform
2431 stores, and atomic memory orderings are not meaningful and all access are
2432 treated as non-atomic.
2433
2434 A memory synchronization scope wider than work-group is not meaningful for the
2435 group (LDS) address space and is treated as work-group.
2436
2437 The memory model does not support the region address space which is treated as
2438 non-atomic.
2439
2440 Acquire memory ordering is not meaningful on store atomic instructions and is
2441 treated as non-atomic.
2442
2443 Release memory ordering is not meaningful on load atomic instructions and is
2444 treated a non-atomic.
2445
2446 Acquire-release memory ordering is not meaningful on load or store atomic
2447 instructions and is treated as acquire and release respectively.
2448
2449 AMDGPU backend only uses scalar memory operations to access memory that is
2450 proven to not change during the execution of the kernel dispatch. This includes
2451 constant address space and global address space for program scope const
2452 variables. Therefore the kernel machine code does not have to maintain the
2453 scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
2454 and vector L1 caches are invalidated between kernel dispatches by CP since
2455 constant address space data may change between kernel dispatch executions. See
2456 :ref:`amdgpu-amdhsa-memory-spaces`.
2457
2458 The one execption is if scalar writes are used to spill SGPR registers. In this
2459 case the AMDGPU backend ensures the memory location used to spill is never
2460 accessed by vector memory operations at the same time. If scalar writes are used
2461 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
2462 return since the locations may be used for vector memory instructions by a
2463 future wave that uses the same scratch area, or a function call that creates a
2464 frame at the same address, respectively. There is no need for a ``s_dcache_inv``
2465 as all scalar writes are write-before-read in the same thread.
2466
2467 Scratch backing memory (which is used for the private address space)
2468 is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private
2469 address space is only accessed by a single thread, and is always
2470 write-before-read, there is never a need to invalidate these entries from the L1
2471 cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the
2472 volatile cache lines.
2473
2474 On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
2475 to invalidate the L2 cache. This also causes it to be treated as
2476 non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC
2477 (cache coherent) and so the L2 cache will coherent with the CPU and other
2478 agents.
2479
2480   .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
2481      :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
2482
2483      ============ ============ ============== ========== ===============================
2484      LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code
2485                   Ordering     Sync Scope     Address
2486                                               Space
2487      ============ ============ ============== ========== ===============================
2488      **Non-Atomic**
2489      -----------------------------------------------------------------------------------
2490      load         *none*       *none*         - global   - !volatile & !nontemporal
2491                                               - generic
2492                                               - private    1. buffer/global/flat_load
2493                                               - constant
2494                                                          - volatile & !nontemporal
2495
2496                                                            1. buffer/global/flat_load
2497                                                               glc=1
2498
2499                                                          - nontemporal
2500
2501                                                            1. buffer/global/flat_load
2502                                                               glc=1 slc=1
2503
2504      load         *none*       *none*         - local    1. ds_load
2505      store        *none*       *none*         - global   - !nontemporal
2506                                               - generic
2507                                               - private    1. buffer/global/flat_store
2508                                               - constant
2509                                                          - nontemporal
2510
2511                                                            1. buffer/global/flat_stote
2512                                                               glc=1 slc=1
2513
2514      store        *none*       *none*         - local    1. ds_store
2515      **Unordered Atomic**
2516      -----------------------------------------------------------------------------------
2517      load atomic  unordered    *any*          *any*      *Same as non-atomic*.
2518      store atomic unordered    *any*          *any*      *Same as non-atomic*.
2519      atomicrmw    unordered    *any*          *any*      *Same as monotonic
2520                                                          atomic*.
2521      **Monotonic Atomic**
2522      -----------------------------------------------------------------------------------
2523      load atomic  monotonic    - singlethread - global   1. buffer/global/flat_load
2524                                - wavefront    - generic
2525                                - workgroup
2526      load atomic  monotonic    - singlethread - local    1. ds_load
2527                                - wavefront
2528                                - workgroup
2529      load atomic  monotonic    - agent        - global   1. buffer/global/flat_load
2530                                - system       - generic     glc=1
2531      store atomic monotonic    - singlethread - global   1. buffer/global/flat_store
2532                                - wavefront    - generic
2533                                - workgroup
2534                                - agent
2535                                - system
2536      store atomic monotonic    - singlethread - local    1. ds_store
2537                                - wavefront
2538                                - workgroup
2539      atomicrmw    monotonic    - singlethread - global   1. buffer/global/flat_atomic
2540                                - wavefront    - generic
2541                                - workgroup
2542                                - agent
2543                                - system
2544      atomicrmw    monotonic    - singlethread - local    1. ds_atomic
2545                                - wavefront
2546                                - workgroup
2547      **Acquire Atomic**
2548      -----------------------------------------------------------------------------------
2549      load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load
2550                                - wavefront    - local
2551                                               - generic
2552      load atomic  acquire      - workgroup    - global   1. buffer/global/flat_load
2553      load atomic  acquire      - workgroup    - local    1. ds_load
2554                                                          2. s_waitcnt lgkmcnt(0)
2555
2556                                                            - If OpenCL, omit.
2557                                                            - Must happen before
2558                                                              any following
2559                                                              global/generic
2560                                                              load/load
2561                                                              atomic/store/store
2562                                                              atomic/atomicrmw.
2563                                                            - Ensures any
2564                                                              following global
2565                                                              data read is no
2566                                                              older than the load
2567                                                              atomic value being
2568                                                              acquired.
2569      load atomic  acquire      - workgroup    - generic  1. flat_load
2570                                                          2. s_waitcnt lgkmcnt(0)
2571
2572                                                            - If OpenCL, omit.
2573                                                            - Must happen before
2574                                                              any following
2575                                                              global/generic
2576                                                              load/load
2577                                                              atomic/store/store
2578                                                              atomic/atomicrmw.
2579                                                            - Ensures any
2580                                                              following global
2581                                                              data read is no
2582                                                              older than the load
2583                                                              atomic value being
2584                                                              acquired.
2585      load atomic  acquire      - agent        - global   1. buffer/global/flat_load
2586                                - system                     glc=1
2587                                                          2. s_waitcnt vmcnt(0)
2588
2589                                                            - Must happen before
2590                                                              following
2591                                                              buffer_wbinvl1_vol.
2592                                                            - Ensures the load
2593                                                              has completed
2594                                                              before invalidating
2595                                                              the cache.
2596
2597                                                          3. buffer_wbinvl1_vol
2598
2599                                                            - Must happen before
2600                                                              any following
2601                                                              global/generic
2602                                                              load/load
2603                                                              atomic/atomicrmw.
2604                                                            - Ensures that
2605                                                              following
2606                                                              loads will not see
2607                                                              stale global data.
2608
2609      load atomic  acquire      - agent        - generic  1. flat_load glc=1
2610                                - system                  2. s_waitcnt vmcnt(0) &
2611                                                             lgkmcnt(0)
2612
2613                                                            - If OpenCL omit
2614                                                              lgkmcnt(0).
2615                                                            - Must happen before
2616                                                              following
2617                                                              buffer_wbinvl1_vol.
2618                                                            - Ensures the flat_load
2619                                                              has completed
2620                                                              before invalidating
2621                                                              the cache.
2622
2623                                                          3. buffer_wbinvl1_vol
2624
2625                                                            - Must happen before
2626                                                              any following
2627                                                              global/generic
2628                                                              load/load
2629                                                              atomic/atomicrmw.
2630                                                            - Ensures that
2631                                                              following loads
2632                                                              will not see stale
2633                                                              global data.
2634
2635      atomicrmw    acquire      - singlethread - global   1. buffer/global/ds/flat_atomic
2636                                - wavefront    - local
2637                                               - generic
2638      atomicrmw    acquire      - workgroup    - global   1. buffer/global/flat_atomic
2639      atomicrmw    acquire      - workgroup    - local    1. ds_atomic
2640                                                          2. waitcnt lgkmcnt(0)
2641
2642                                                            - If OpenCL, omit.
2643                                                            - Must happen before
2644                                                              any following
2645                                                              global/generic
2646                                                              load/load
2647                                                              atomic/store/store
2648                                                              atomic/atomicrmw.
2649                                                            - Ensures any
2650                                                              following global
2651                                                              data read is no
2652                                                              older than the
2653                                                              atomicrmw value
2654                                                              being acquired.
2655
2656      atomicrmw    acquire      - workgroup    - generic  1. flat_atomic
2657                                                          2. waitcnt lgkmcnt(0)
2658
2659                                                            - If OpenCL, omit.
2660                                                            - Must happen before
2661                                                              any following
2662                                                              global/generic
2663                                                              load/load
2664                                                              atomic/store/store
2665                                                              atomic/atomicrmw.
2666                                                            - Ensures any
2667                                                              following global
2668                                                              data read is no
2669                                                              older than the
2670                                                              atomicrmw value
2671                                                              being acquired.
2672
2673      atomicrmw    acquire      - agent        - global   1. buffer/global/flat_atomic
2674                                - system                  2. s_waitcnt vmcnt(0)
2675
2676                                                            - Must happen before
2677                                                              following
2678                                                              buffer_wbinvl1_vol.
2679                                                            - Ensures the
2680                                                              atomicrmw has
2681                                                              completed before
2682                                                              invalidating the
2683                                                              cache.
2684
2685                                                          3. buffer_wbinvl1_vol
2686
2687                                                            - Must happen before
2688                                                              any following
2689                                                              global/generic
2690                                                              load/load
2691                                                              atomic/atomicrmw.
2692                                                            - Ensures that
2693                                                              following loads
2694                                                              will not see stale
2695                                                              global data.
2696
2697      atomicrmw    acquire      - agent        - generic  1. flat_atomic
2698                                - system                  2. s_waitcnt vmcnt(0) &
2699                                                             lgkmcnt(0)
2700
2701                                                            - If OpenCL, omit
2702                                                              lgkmcnt(0).
2703                                                            - Must happen before
2704                                                              following
2705                                                              buffer_wbinvl1_vol.
2706                                                            - Ensures the
2707                                                              atomicrmw has
2708                                                              completed before
2709                                                              invalidating the
2710                                                              cache.
2711
2712                                                          3. buffer_wbinvl1_vol
2713
2714                                                            - Must happen before
2715                                                              any following
2716                                                              global/generic
2717                                                              load/load
2718                                                              atomic/atomicrmw.
2719                                                            - Ensures that
2720                                                              following loads
2721                                                              will not see stale
2722                                                              global data.
2723
2724      fence        acquire      - singlethread *none*     *none*
2725                                - wavefront
2726      fence        acquire      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
2727
2728                                                            - If OpenCL and
2729                                                              address space is
2730                                                              not generic, omit.
2731                                                            - However, since LLVM
2732                                                              currently has no
2733                                                              address space on
2734                                                              the fence need to
2735                                                              conservatively
2736                                                              always generate. If
2737                                                              fence had an
2738                                                              address space then
2739                                                              set to address
2740                                                              space of OpenCL
2741                                                              fence flag, or to
2742                                                              generic if both
2743                                                              local and global
2744                                                              flags are
2745                                                              specified.
2746                                                            - Must happen after
2747                                                              any preceding
2748                                                              local/generic load
2749                                                              atomic/atomicrmw
2750                                                              with an equal or
2751                                                              wider sync scope
2752                                                              and memory ordering
2753                                                              stronger than
2754                                                              unordered (this is
2755                                                              termed the
2756                                                              fence-paired-atomic).
2757                                                            - Must happen before
2758                                                              any following
2759                                                              global/generic
2760                                                              load/load
2761                                                              atomic/store/store
2762                                                              atomic/atomicrmw.
2763                                                            - Ensures any
2764                                                              following global
2765                                                              data read is no
2766                                                              older than the
2767                                                              value read by the
2768                                                              fence-paired-atomic.
2769
2770      fence        acquire      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
2771                                - system                     vmcnt(0)
2772
2773                                                            - If OpenCL and
2774                                                              address space is
2775                                                              not generic, omit
2776                                                              lgkmcnt(0).
2777                                                            - However, since LLVM
2778                                                              currently has no
2779                                                              address space on
2780                                                              the fence need to
2781                                                              conservatively
2782                                                              always generate
2783                                                              (see comment for
2784                                                              previous fence).
2785                                                            - Could be split into
2786                                                              separate s_waitcnt
2787                                                              vmcnt(0) and
2788                                                              s_waitcnt
2789                                                              lgkmcnt(0) to allow
2790                                                              them to be
2791                                                              independently moved
2792                                                              according to the
2793                                                              following rules.
2794                                                            - s_waitcnt vmcnt(0)
2795                                                              must happen after
2796                                                              any preceding
2797                                                              global/generic load
2798                                                              atomic/atomicrmw
2799                                                              with an equal or
2800                                                              wider sync scope
2801                                                              and memory ordering
2802                                                              stronger than
2803                                                              unordered (this is
2804                                                              termed the
2805                                                              fence-paired-atomic).
2806                                                            - s_waitcnt lgkmcnt(0)
2807                                                              must happen after
2808                                                              any preceding
2809                                                              local/generic load
2810                                                              atomic/atomicrmw
2811                                                              with an equal or
2812                                                              wider sync scope
2813                                                              and memory ordering
2814                                                              stronger than
2815                                                              unordered (this is
2816                                                              termed the
2817                                                              fence-paired-atomic).
2818                                                            - Must happen before
2819                                                              the following
2820                                                              buffer_wbinvl1_vol.
2821                                                            - Ensures that the
2822                                                              fence-paired atomic
2823                                                              has completed
2824                                                              before invalidating
2825                                                              the
2826                                                              cache. Therefore
2827                                                              any following
2828                                                              locations read must
2829                                                              be no older than
2830                                                              the value read by
2831                                                              the
2832                                                              fence-paired-atomic.
2833
2834                                                          2. buffer_wbinvl1_vol
2835
2836                                                            - Must happen before any
2837                                                              following global/generic
2838                                                              load/load
2839                                                              atomic/store/store
2840                                                              atomic/atomicrmw.
2841                                                            - Ensures that
2842                                                              following loads
2843                                                              will not see stale
2844                                                              global data.
2845
2846      **Release Atomic**
2847      -----------------------------------------------------------------------------------
2848      store atomic release      - singlethread - global   1. buffer/global/ds/flat_store
2849                                - wavefront    - local
2850                                               - generic
2851      store atomic release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
2852
2853                                                            - If OpenCL, omit.
2854                                                            - Must happen after
2855                                                              any preceding
2856                                                              local/generic
2857                                                              load/store/load
2858                                                              atomic/store
2859                                                              atomic/atomicrmw.
2860                                                            - Must happen before
2861                                                              the following
2862                                                              store.
2863                                                            - Ensures that all
2864                                                              memory operations
2865                                                              to local have
2866                                                              completed before
2867                                                              performing the
2868                                                              store that is being
2869                                                              released.
2870
2871                                                          2. buffer/global/flat_store
2872      store atomic release      - workgroup    - local    1. ds_store
2873      store atomic release      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)
2874
2875                                                            - If OpenCL, omit.
2876                                                            - Must happen after
2877                                                              any preceding
2878                                                              local/generic
2879                                                              load/store/load
2880                                                              atomic/store
2881                                                              atomic/atomicrmw.
2882                                                            - Must happen before
2883                                                              the following
2884                                                              store.
2885                                                            - Ensures that all
2886                                                              memory operations
2887                                                              to local have
2888                                                              completed before
2889                                                              performing the
2890                                                              store that is being
2891                                                              released.
2892
2893                                                          2. flat_store
2894      store atomic release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
2895                                - system       - generic     vmcnt(0)
2896
2897                                                            - If OpenCL, omit
2898                                                              lgkmcnt(0).
2899                                                            - Could be split into
2900                                                              separate s_waitcnt
2901                                                              vmcnt(0) and
2902                                                              s_waitcnt
2903                                                              lgkmcnt(0) to allow
2904                                                              them to be
2905                                                              independently moved
2906                                                              according to the
2907                                                              following rules.
2908                                                            - s_waitcnt vmcnt(0)
2909                                                              must happen after
2910                                                              any preceding
2911                                                              global/generic
2912                                                              load/store/load
2913                                                              atomic/store
2914                                                              atomic/atomicrmw.
2915                                                            - s_waitcnt lgkmcnt(0)
2916                                                              must happen after
2917                                                              any preceding
2918                                                              local/generic
2919                                                              load/store/load
2920                                                              atomic/store
2921                                                              atomic/atomicrmw.
2922                                                            - Must happen before
2923                                                              the following
2924                                                              store.
2925                                                            - Ensures that all
2926                                                              memory operations
2927                                                              to memory have
2928                                                              completed before
2929                                                              performing the
2930                                                              store that is being
2931                                                              released.
2932
2933                                                          2. buffer/global/ds/flat_store
2934      atomicrmw    release      - singlethread - global   1. buffer/global/ds/flat_atomic
2935                                - wavefront    - local
2936                                               - generic
2937      atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
2938
2939                                                            - If OpenCL, omit.
2940                                                            - Must happen after
2941                                                              any preceding
2942                                                              local/generic
2943                                                              load/store/load
2944                                                              atomic/store
2945                                                              atomic/atomicrmw.
2946                                                            - Must happen before
2947                                                              the following
2948                                                              atomicrmw.
2949                                                            - Ensures that all
2950                                                              memory operations
2951                                                              to local have
2952                                                              completed before
2953                                                              performing the
2954                                                              atomicrmw that is
2955                                                              being released.
2956
2957                                                          2. buffer/global/flat_atomic
2958      atomicrmw    release      - workgroup    - local    1. ds_atomic
2959      atomicrmw    release      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)
2960
2961                                                            - If OpenCL, omit.
2962                                                            - Must happen after
2963                                                              any preceding
2964                                                              local/generic
2965                                                              load/store/load
2966                                                              atomic/store
2967                                                              atomic/atomicrmw.
2968                                                            - Must happen before
2969                                                              the following
2970                                                              atomicrmw.
2971                                                            - Ensures that all
2972                                                              memory operations
2973                                                              to local have
2974                                                              completed before
2975                                                              performing the
2976                                                              atomicrmw that is
2977                                                              being released.
2978
2979                                                          2. flat_atomic
2980      atomicrmw    release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
2981                                - system       - generic     vmcnt(0)
2982
2983                                                            - If OpenCL, omit
2984                                                              lgkmcnt(0).
2985                                                            - Could be split into
2986                                                              separate s_waitcnt
2987                                                              vmcnt(0) and
2988                                                              s_waitcnt
2989                                                              lgkmcnt(0) to allow
2990                                                              them to be
2991                                                              independently moved
2992                                                              according to the
2993                                                              following rules.
2994                                                            - s_waitcnt vmcnt(0)
2995                                                              must happen after
2996                                                              any preceding
2997                                                              global/generic
2998                                                              load/store/load
2999                                                              atomic/store
3000                                                              atomic/atomicrmw.
3001                                                            - s_waitcnt lgkmcnt(0)
3002                                                              must happen after
3003                                                              any preceding
3004                                                              local/generic
3005                                                              load/store/load
3006                                                              atomic/store
3007                                                              atomic/atomicrmw.
3008                                                            - Must happen before
3009                                                              the following
3010                                                              atomicrmw.
3011                                                            - Ensures that all
3012                                                              memory operations
3013                                                              to global and local
3014                                                              have completed
3015                                                              before performing
3016                                                              the atomicrmw that
3017                                                              is being released.
3018
3019                                                          2. buffer/global/ds/flat_atomic
3020      fence        release      - singlethread *none*     *none*
3021                                - wavefront
3022      fence        release      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
3023
3024                                                            - If OpenCL and
3025                                                              address space is
3026                                                              not generic, omit.
3027                                                            - However, since LLVM
3028                                                              currently has no
3029                                                              address space on
3030                                                              the fence need to
3031                                                              conservatively
3032                                                              always generate. If
3033                                                              fence had an
3034                                                              address space then
3035                                                              set to address
3036                                                              space of OpenCL
3037                                                              fence flag, or to
3038                                                              generic if both
3039                                                              local and global
3040                                                              flags are
3041                                                              specified.
3042                                                            - Must happen after
3043                                                              any preceding
3044                                                              local/generic
3045                                                              load/load
3046                                                              atomic/store/store
3047                                                              atomic/atomicrmw.
3048                                                            - Must happen before
3049                                                              any following store
3050                                                              atomic/atomicrmw
3051                                                              with an equal or
3052                                                              wider sync scope
3053                                                              and memory ordering
3054                                                              stronger than
3055                                                              unordered (this is
3056                                                              termed the
3057                                                              fence-paired-atomic).
3058                                                            - Ensures that all
3059                                                              memory operations
3060                                                              to local have
3061                                                              completed before
3062                                                              performing the
3063                                                              following
3064                                                              fence-paired-atomic.
3065
3066      fence        release      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
3067                                - system                     vmcnt(0)
3068
3069                                                            - If OpenCL and
3070                                                              address space is
3071                                                              not generic, omit
3072                                                              lgkmcnt(0).
3073                                                            - If OpenCL and
3074                                                              address space is
3075                                                              local, omit
3076                                                              vmcnt(0).
3077                                                            - However, since LLVM
3078                                                              currently has no
3079                                                              address space on
3080                                                              the fence need to
3081                                                              conservatively
3082                                                              always generate. If
3083                                                              fence had an
3084                                                              address space then
3085                                                              set to address
3086                                                              space of OpenCL
3087                                                              fence flag, or to
3088                                                              generic if both
3089                                                              local and global
3090                                                              flags are
3091                                                              specified.
3092                                                            - Could be split into
3093                                                              separate s_waitcnt
3094                                                              vmcnt(0) and
3095                                                              s_waitcnt
3096                                                              lgkmcnt(0) to allow
3097                                                              them to be
3098                                                              independently moved
3099                                                              according to the
3100                                                              following rules.
3101                                                            - s_waitcnt vmcnt(0)
3102                                                              must happen after
3103                                                              any preceding
3104                                                              global/generic
3105                                                              load/store/load
3106                                                              atomic/store
3107                                                              atomic/atomicrmw.
3108                                                            - s_waitcnt lgkmcnt(0)
3109                                                              must happen after
3110                                                              any preceding
3111                                                              local/generic
3112                                                              load/store/load
3113                                                              atomic/store
3114                                                              atomic/atomicrmw.
3115                                                            - Must happen before
3116                                                              any following store
3117                                                              atomic/atomicrmw
3118                                                              with an equal or
3119                                                              wider sync scope
3120                                                              and memory ordering
3121                                                              stronger than
3122                                                              unordered (this is
3123                                                              termed the
3124                                                              fence-paired-atomic).
3125                                                            - Ensures that all
3126                                                              memory operations
3127                                                              have
3128                                                              completed before
3129                                                              performing the
3130                                                              following
3131                                                              fence-paired-atomic.
3132
3133      **Acquire-Release Atomic**
3134      -----------------------------------------------------------------------------------
3135      atomicrmw    acq_rel      - singlethread - global   1. buffer/global/ds/flat_atomic
3136                                - wavefront    - local
3137                                               - generic
3138      atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
3139
3140                                                            - If OpenCL, omit.
3141                                                            - Must happen after
3142                                                              any preceding
3143                                                              local/generic
3144                                                              load/store/load
3145                                                              atomic/store
3146                                                              atomic/atomicrmw.
3147                                                            - Must happen before
3148                                                              the following
3149                                                              atomicrmw.
3150                                                            - Ensures that all
3151                                                              memory operations
3152                                                              to local have
3153                                                              completed before
3154                                                              performing the
3155                                                              atomicrmw that is
3156                                                              being released.
3157
3158                                                          2. buffer/global/flat_atomic
3159      atomicrmw    acq_rel      - workgroup    - local    1. ds_atomic
3160                                                          2. s_waitcnt lgkmcnt(0)
3161
3162                                                            - If OpenCL, omit.
3163                                                            - Must happen before
3164                                                              any following
3165                                                              global/generic
3166                                                              load/load
3167                                                              atomic/store/store
3168                                                              atomic/atomicrmw.
3169                                                            - Ensures any
3170                                                              following global
3171                                                              data read is no
3172                                                              older than the load
3173                                                              atomic value being
3174                                                              acquired.
3175
3176      atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)
3177
3178                                                            - If OpenCL, omit.
3179                                                            - Must happen after
3180                                                              any preceding
3181                                                              local/generic
3182                                                              load/store/load
3183                                                              atomic/store
3184                                                              atomic/atomicrmw.
3185                                                            - Must happen before
3186                                                              the following
3187                                                              atomicrmw.
3188                                                            - Ensures that all
3189                                                              memory operations
3190                                                              to local have
3191                                                              completed before
3192                                                              performing the
3193                                                              atomicrmw that is
3194                                                              being released.
3195
3196                                                          2. flat_atomic
3197                                                          3. s_waitcnt lgkmcnt(0)
3198
3199                                                            - If OpenCL, omit.
3200                                                            - Must happen before
3201                                                              any following
3202                                                              global/generic
3203                                                              load/load
3204                                                              atomic/store/store
3205                                                              atomic/atomicrmw.
3206                                                            - Ensures any
3207                                                              following global
3208                                                              data read is no
3209                                                              older than the load
3210                                                              atomic value being
3211                                                              acquired.
3212
3213      atomicrmw    acq_rel      - agent        - global   1. s_waitcnt lgkmcnt(0) &
3214                                - system                     vmcnt(0)
3215
3216                                                            - If OpenCL, omit
3217                                                              lgkmcnt(0).
3218                                                            - Could be split into
3219                                                              separate s_waitcnt
3220                                                              vmcnt(0) and
3221                                                              s_waitcnt
3222                                                              lgkmcnt(0) to allow
3223                                                              them to be
3224                                                              independently moved
3225                                                              according to the
3226                                                              following rules.
3227                                                            - s_waitcnt vmcnt(0)
3228                                                              must happen after
3229                                                              any preceding
3230                                                              global/generic
3231                                                              load/store/load
3232                                                              atomic/store
3233                                                              atomic/atomicrmw.
3234                                                            - s_waitcnt lgkmcnt(0)
3235                                                              must happen after
3236                                                              any preceding
3237                                                              local/generic
3238                                                              load/store/load
3239                                                              atomic/store
3240                                                              atomic/atomicrmw.
3241                                                            - Must happen before
3242                                                              the following
3243                                                              atomicrmw.
3244                                                            - Ensures that all
3245                                                              memory operations
3246                                                              to global have
3247                                                              completed before
3248                                                              performing the
3249                                                              atomicrmw that is
3250                                                              being released.
3251
3252                                                          2. buffer/global/flat_atomic
3253                                                          3. s_waitcnt vmcnt(0)
3254
3255                                                            - Must happen before
3256                                                              following
3257                                                              buffer_wbinvl1_vol.
3258                                                            - Ensures the
3259                                                              atomicrmw has
3260                                                              completed before
3261                                                              invalidating the
3262                                                              cache.
3263
3264                                                          4. buffer_wbinvl1_vol
3265
3266                                                            - Must happen before
3267                                                              any following
3268                                                              global/generic
3269                                                              load/load
3270                                                              atomic/atomicrmw.
3271                                                            - Ensures that
3272                                                              following loads
3273                                                              will not see stale
3274                                                              global data.
3275
3276      atomicrmw    acq_rel      - agent        - generic  1. s_waitcnt lgkmcnt(0) &
3277                                - system                     vmcnt(0)
3278
3279                                                            - If OpenCL, omit
3280                                                              lgkmcnt(0).
3281                                                            - Could be split into
3282                                                              separate s_waitcnt
3283                                                              vmcnt(0) and
3284                                                              s_waitcnt
3285                                                              lgkmcnt(0) to allow
3286                                                              them to be
3287                                                              independently moved
3288                                                              according to the
3289                                                              following rules.
3290                                                            - s_waitcnt vmcnt(0)
3291                                                              must happen after
3292                                                              any preceding
3293                                                              global/generic
3294                                                              load/store/load
3295                                                              atomic/store
3296                                                              atomic/atomicrmw.
3297                                                            - s_waitcnt lgkmcnt(0)
3298                                                              must happen after
3299                                                              any preceding
3300                                                              local/generic
3301                                                              load/store/load
3302                                                              atomic/store
3303                                                              atomic/atomicrmw.
3304                                                            - Must happen before
3305                                                              the following
3306                                                              atomicrmw.
3307                                                            - Ensures that all
3308                                                              memory operations
3309                                                              to global have
3310                                                              completed before
3311                                                              performing the
3312                                                              atomicrmw that is
3313                                                              being released.
3314
3315                                                          2. flat_atomic
3316                                                          3. s_waitcnt vmcnt(0) &
3317                                                             lgkmcnt(0)
3318
3319                                                            - If OpenCL, omit
3320                                                              lgkmcnt(0).
3321                                                            - Must happen before
3322                                                              following
3323                                                              buffer_wbinvl1_vol.
3324                                                            - Ensures the
3325                                                              atomicrmw has
3326                                                              completed before
3327                                                              invalidating the
3328                                                              cache.
3329
3330                                                          4. buffer_wbinvl1_vol
3331
3332                                                            - Must happen before
3333                                                              any following
3334                                                              global/generic
3335                                                              load/load
3336                                                              atomic/atomicrmw.
3337                                                            - Ensures that
3338                                                              following loads
3339                                                              will not see stale
3340                                                              global data.
3341
3342      fence        acq_rel      - singlethread *none*     *none*
3343                                - wavefront
3344      fence        acq_rel      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
3345
3346                                                            - If OpenCL and
3347                                                              address space is
3348                                                              not generic, omit.
3349                                                            - However,
3350                                                              since LLVM
3351                                                              currently has no
3352                                                              address space on
3353                                                              the fence need to
3354                                                              conservatively
3355                                                              always generate
3356                                                              (see comment for
3357                                                              previous fence).
3358                                                            - Must happen after
3359                                                              any preceding
3360                                                              local/generic
3361                                                              load/load
3362                                                              atomic/store/store
3363                                                              atomic/atomicrmw.
3364                                                            - Must happen before
3365                                                              any following
3366                                                              global/generic
3367                                                              load/load
3368                                                              atomic/store/store
3369                                                              atomic/atomicrmw.
3370                                                            - Ensures that all
3371                                                              memory operations
3372                                                              to local have
3373                                                              completed before
3374                                                              performing any
3375                                                              following global
3376                                                              memory operations.
3377                                                            - Ensures that the
3378                                                              preceding
3379                                                              local/generic load
3380                                                              atomic/atomicrmw
3381                                                              with an equal or
3382                                                              wider sync scope
3383                                                              and memory ordering
3384                                                              stronger than
3385                                                              unordered (this is
3386                                                              termed the
3387                                                              acquire-fence-paired-atomic
3388                                                              ) has completed
3389                                                              before following
3390                                                              global memory
3391                                                              operations. This
3392                                                              satisfies the
3393                                                              requirements of
3394                                                              acquire.
3395                                                            - Ensures that all
3396                                                              previous memory
3397                                                              operations have
3398                                                              completed before a
3399                                                              following
3400                                                              local/generic store
3401                                                              atomic/atomicrmw
3402                                                              with an equal or
3403                                                              wider sync scope
3404                                                              and memory ordering
3405                                                              stronger than
3406                                                              unordered (this is
3407                                                              termed the
3408                                                              release-fence-paired-atomic
3409                                                              ). This satisfies the
3410                                                              requirements of
3411                                                              release.
3412
3413      fence        acq_rel      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
3414                                - system                     vmcnt(0)
3415
3416                                                            - If OpenCL and
3417                                                              address space is
3418                                                              not generic, omit
3419                                                              lgkmcnt(0).
3420                                                            - However, since LLVM
3421                                                              currently has no
3422                                                              address space on
3423                                                              the fence need to
3424                                                              conservatively
3425                                                              always generate
3426                                                              (see comment for
3427                                                              previous fence).
3428                                                            - Could be split into
3429                                                              separate s_waitcnt
3430                                                              vmcnt(0) and
3431                                                              s_waitcnt
3432                                                              lgkmcnt(0) to allow
3433                                                              them to be
3434                                                              independently moved
3435                                                              according to the
3436                                                              following rules.
3437                                                            - s_waitcnt vmcnt(0)
3438                                                              must happen after
3439                                                              any preceding
3440                                                              global/generic
3441                                                              load/store/load
3442                                                              atomic/store
3443                                                              atomic/atomicrmw.
3444                                                            - s_waitcnt lgkmcnt(0)
3445                                                              must happen after
3446                                                              any preceding
3447                                                              local/generic
3448                                                              load/store/load
3449                                                              atomic/store
3450                                                              atomic/atomicrmw.
3451                                                            - Must happen before
3452                                                              the following
3453                                                              buffer_wbinvl1_vol.
3454                                                            - Ensures that the
3455                                                              preceding
3456                                                              global/local/generic
3457                                                              load
3458                                                              atomic/atomicrmw
3459                                                              with an equal or
3460                                                              wider sync scope
3461                                                              and memory ordering
3462                                                              stronger than
3463                                                              unordered (this is
3464                                                              termed the
3465                                                              acquire-fence-paired-atomic
3466                                                              ) has completed
3467                                                              before invalidating
3468                                                              the cache. This
3469                                                              satisfies the
3470                                                              requirements of
3471                                                              acquire.
3472                                                            - Ensures that all
3473                                                              previous memory
3474                                                              operations have
3475                                                              completed before a
3476                                                              following
3477                                                              global/local/generic
3478                                                              store
3479                                                              atomic/atomicrmw
3480                                                              with an equal or
3481                                                              wider sync scope
3482                                                              and memory ordering
3483                                                              stronger than
3484                                                              unordered (this is
3485                                                              termed the
3486                                                              release-fence-paired-atomic
3487                                                              ). This satisfies the
3488                                                              requirements of
3489                                                              release.
3490
3491                                                          2. buffer_wbinvl1_vol
3492
3493                                                            - Must happen before
3494                                                              any following
3495                                                              global/generic
3496                                                              load/load
3497                                                              atomic/store/store
3498                                                              atomic/atomicrmw.
3499                                                            - Ensures that
3500                                                              following loads
3501                                                              will not see stale
3502                                                              global data. This
3503                                                              satisfies the
3504                                                              requirements of
3505                                                              acquire.
3506
3507      **Sequential Consistent Atomic**
3508      -----------------------------------------------------------------------------------
3509      load atomic  seq_cst      - singlethread - global   *Same as corresponding
3510                                - wavefront    - local    load atomic acquire,
3511                                               - generic  except must generated
3512                                                          all instructions even
3513                                                          for OpenCL.*
3514      load atomic  seq_cst      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
3515                                               - generic
3516                                                            - Must
3517                                                              happen after
3518                                                              preceding
3519                                                              global/generic load
3520                                                              atomic/store
3521                                                              atomic/atomicrmw
3522                                                              with memory
3523                                                              ordering of seq_cst
3524                                                              and with equal or
3525                                                              wider sync scope.
3526                                                              (Note that seq_cst
3527                                                              fences have their
3528                                                              own s_waitcnt
3529                                                              lgkmcnt(0) and so do
3530                                                              not need to be
3531                                                              considered.)
3532                                                            - Ensures any
3533                                                              preceding
3534                                                              sequential
3535                                                              consistent local
3536                                                              memory instructions
3537                                                              have completed
3538                                                              before executing
3539                                                              this sequentially
3540                                                              consistent
3541                                                              instruction. This
3542                                                              prevents reordering
3543                                                              a seq_cst store
3544                                                              followed by a
3545                                                              seq_cst load. (Note
3546                                                              that seq_cst is
3547                                                              stronger than
3548                                                              acquire/release as
3549                                                              the reordering of
3550                                                              load acquire
3551                                                              followed by a store
3552                                                              release is
3553                                                              prevented by the
3554                                                              waitcnt of
3555                                                              the release, but
3556                                                              there is nothing
3557                                                              preventing a store
3558                                                              release followed by
3559                                                              load acquire from
3560                                                              competing out of
3561                                                              order.)
3562
3563                                                          2. *Following
3564                                                             instructions same as
3565                                                             corresponding load
3566                                                             atomic acquire,
3567                                                             except must generated
3568                                                             all instructions even
3569                                                             for OpenCL.*
3570      load atomic  seq_cst      - workgroup    - local    *Same as corresponding
3571                                                          load atomic acquire,
3572                                                          except must generated
3573                                                          all instructions even
3574                                                          for OpenCL.*
3575      load atomic  seq_cst      - agent        - global   1. s_waitcnt lgkmcnt(0) &
3576                                - system       - generic     vmcnt(0)
3577
3578                                                            - Could be split into
3579                                                              separate s_waitcnt
3580                                                              vmcnt(0)
3581                                                              and s_waitcnt
3582                                                              lgkmcnt(0) to allow
3583                                                              them to be
3584                                                              independently moved
3585                                                              according to the
3586                                                              following rules.
3587                                                            - waitcnt lgkmcnt(0)
3588                                                              must happen after
3589                                                              preceding
3590                                                              global/generic load
3591                                                              atomic/store
3592                                                              atomic/atomicrmw
3593                                                              with memory
3594                                                              ordering of seq_cst
3595                                                              and with equal or
3596                                                              wider sync scope.
3597                                                              (Note that seq_cst
3598                                                              fences have their
3599                                                              own s_waitcnt
3600                                                              lgkmcnt(0) and so do
3601                                                              not need to be
3602                                                              considered.)
3603                                                            - waitcnt vmcnt(0)
3604                                                              must happen after
3605                                                              preceding
3606                                                              global/generic load
3607                                                              atomic/store
3608                                                              atomic/atomicrmw
3609                                                              with memory
3610                                                              ordering of seq_cst
3611                                                              and with equal or
3612                                                              wider sync scope.
3613                                                              (Note that seq_cst
3614                                                              fences have their
3615                                                              own s_waitcnt
3616                                                              vmcnt(0) and so do
3617                                                              not need to be
3618                                                              considered.)
3619                                                            - Ensures any
3620                                                              preceding
3621                                                              sequential
3622                                                              consistent global
3623                                                              memory instructions
3624                                                              have completed
3625                                                              before executing
3626                                                              this sequentially
3627                                                              consistent
3628                                                              instruction. This
3629                                                              prevents reordering
3630                                                              a seq_cst store
3631                                                              followed by a
3632                                                              seq_cst load. (Note
3633                                                              that seq_cst is
3634                                                              stronger than
3635                                                              acquire/release as
3636                                                              the reordering of
3637                                                              load acquire
3638                                                              followed by a store
3639                                                              release is
3640                                                              prevented by the
3641                                                              waitcnt of
3642                                                              the release, but
3643                                                              there is nothing
3644                                                              preventing a store
3645                                                              release followed by
3646                                                              load acquire from
3647                                                              competing out of
3648                                                              order.)
3649
3650                                                          2. *Following
3651                                                             instructions same as
3652                                                             corresponding load
3653                                                             atomic acquire,
3654                                                             except must generated
3655                                                             all instructions even
3656                                                             for OpenCL.*
3657      store atomic seq_cst      - singlethread - global   *Same as corresponding
3658                                - wavefront    - local    store atomic release,
3659                                - workgroup    - generic  except must generated
3660                                                          all instructions even
3661                                                          for OpenCL.*
3662      store atomic seq_cst      - agent        - global   *Same as corresponding
3663                                - system       - generic  store atomic release,
3664                                                          except must generated
3665                                                          all instructions even
3666                                                          for OpenCL.*
3667      atomicrmw    seq_cst      - singlethread - global   *Same as corresponding
3668                                - wavefront    - local    atomicrmw acq_rel,
3669                                - workgroup    - generic  except must generated
3670                                                          all instructions even
3671                                                          for OpenCL.*
3672      atomicrmw    seq_cst      - agent        - global   *Same as corresponding
3673                                - system       - generic  atomicrmw acq_rel,
3674                                                          except must generated
3675                                                          all instructions even
3676                                                          for OpenCL.*
3677      fence        seq_cst      - singlethread *none*     *Same as corresponding
3678                                - wavefront               fence acq_rel,
3679                                - workgroup               except must generated
3680                                - agent                   all instructions even
3681                                - system                  for OpenCL.*
3682      ============ ============ ============== ========== ===============================
3683
3684 The memory order also adds the single thread optimization constrains defined in
3685 table
3686 :ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
3687
3688   .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
3689      :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
3690
3691      ============ ==============================================================
3692      LLVM Memory  Optimization Constraints
3693      Ordering
3694      ============ ==============================================================
3695      unordered    *none*
3696      monotonic    *none*
3697      acquire      - If a load atomic/atomicrmw then no following load/load
3698                     atomic/store/ store atomic/atomicrmw/fence instruction can
3699                     be moved before the acquire.
3700                   - If a fence then same as load atomic, plus no preceding
3701                     associated fence-paired-atomic can be moved after the fence.
3702      release      - If a store atomic/atomicrmw then no preceding load/load
3703                     atomic/store/ store atomic/atomicrmw/fence instruction can
3704                     be moved after the release.
3705                   - If a fence then same as store atomic, plus no following
3706                     associated fence-paired-atomic can be moved before the
3707                     fence.
3708      acq_rel      Same constraints as both acquire and release.
3709      seq_cst      - If a load atomic then same constraints as acquire, plus no
3710                     preceding sequentially consistent load atomic/store
3711                     atomic/atomicrmw/fence instruction can be moved after the
3712                     seq_cst.
3713                   - If a store atomic then the same constraints as release, plus
3714                     no following sequentially consistent load atomic/store
3715                     atomic/atomicrmw/fence instruction can be moved before the
3716                     seq_cst.
3717                   - If an atomicrmw/fence then same constraints as acq_rel.
3718      ============ ==============================================================
3719
3720 Trap Handler ABI
3721 ~~~~~~~~~~~~~~~~
3722
3723 For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
3724 (such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
3725 the ``s_trap`` instruction with the following usage:
3726
3727   .. table:: AMDGPU Trap Handler for AMDHSA OS
3728      :name: amdgpu-trap-handler-for-amdhsa-os-table
3729
3730      =================== =============== =============== =======================
3731      Usage               Code Sequence   Trap Handler    Description
3732                                          Inputs
3733      =================== =============== =============== =======================
3734      reserved            ``s_trap 0x00``                 Reserved by hardware.
3735      ``debugtrap(arg)``  ``s_trap 0x01`` ``SGPR0-1``:    Reserved for HSA
3736                                            ``queue_ptr`` ``debugtrap``
3737                                          ``VGPR0``:      intrinsic (not
3738                                            ``arg``       implemented).
3739      ``llvm.trap``       ``s_trap 0x02`` ``SGPR0-1``:    Causes dispatch to be
3740                                            ``queue_ptr`` terminated and its
3741                                                          associated queue put
3742                                                          into the error state.
3743      ``llvm.debugtrap``  ``s_trap 0x03`` ``SGPR0-1``:    If debugger not
3744                                            ``queue_ptr`` installed handled
3745                                                          same as ``llvm.trap``.
3746      debugger breakpoint ``s_trap 0x07``                 Reserved for  debugger
3747                                                          breakpoints.
3748      debugger            ``s_trap 0x08``                 Reserved for debugger.
3749      debugger            ``s_trap 0xfe``                 Reserved for debugger.
3750      debugger            ``s_trap 0xff``                 Reserved for debugger.
3751      =================== =============== =============== =======================
3752
3753 Unspecified OS
3754 --------------
3755
3756 This section provides code conventions used when the target triple OS is
3757 empty (see :ref:`amdgpu-target-triples`).
3758
3759 Trap Handler ABI
3760 ~~~~~~~~~~~~~~~~
3761
3762 For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
3763 not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
3764 instructions are handled as follows:
3765
3766   .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
3767      :name: amdgpu-trap-handler-for-non-amdhsa-os-table
3768
3769      =============== =============== ===========================================
3770      Usage           Code Sequence   Description
3771      =============== =============== ===========================================
3772      llvm.trap       s_endpgm        Causes wavefront to be terminated.
3773      llvm.debugtrap  *none*          Compiler warning given that there is no
3774                                      trap handler installed.
3775      =============== =============== ===========================================
3776
3777 Source Languages
3778 ================
3779
3780 .. _amdgpu-opencl:
3781
3782 OpenCL
3783 ------
3784
3785 When generating code for the OpenCL language the target triple environment
3786 should be ``opencl`` or ``amdgizcl`` (see :ref:`amdgpu-target-triples`).
3787
3788 When the language is OpenCL the following differences occur:
3789
3790 1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
3791 2. The AMDGPU backend adds additional arguments to the kernel.
3792 3. Additional metadata is generated
3793    (:ref:`amdgpu-amdhsa-hsa-code-object-metadata`).
3794
3795 .. TODO
3796    Specify what affect this has. Hidden arguments added. Additional metadata
3797    generated.
3798
3799 .. _amdgpu-hcc:
3800
3801 HCC
3802 ---
3803
3804 When generating code for the OpenCL language the target triple environment
3805 should be ``hcc`` (see :ref:`amdgpu-target-triples`).
3806
3807 When the language is OpenCL the following differences occur:
3808
3809 1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
3810
3811 .. TODO
3812    Specify what affect this has.
3813
3814 Assembler
3815 ---------
3816
3817 AMDGPU backend has LLVM-MC based assembler which is currently in development.
3818 It supports AMDGCN GFX6-GFX9.
3819
3820 This section describes general syntax for instructions and operands. For more
3821 information about instructions, their semantics and supported combinations of
3822 operands, refer to one of instruction set architecture manuals
3823 [AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_ and [AMD-GCN-GFX9]_.
3824
3825 An instruction has the following syntax (register operands are normally
3826 comma-separated while extra operands are space-separated):
3827
3828 *<opcode> <register_operand0>, ... <extra_operand0> ...*
3829
3830 Operands
3831 ~~~~~~~~
3832
3833 The following syntax for register operands is supported:
3834
3835 * SGPR registers: s0, ... or s[0], ...
3836 * VGPR registers: v0, ... or v[0], ...
3837 * TTMP registers: ttmp0, ... or ttmp[0], ...
3838 * Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
3839 * Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
3840 * Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
3841 * Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
3842 * Register index expressions: v[2*2], s[1-1:2-1]
3843 * 'off' indicates that an operand is not enabled
3844
3845 The following extra operands are supported:
3846
3847 * offset, offset0, offset1
3848 * idxen, offen bits
3849 * glc, slc, tfe bits
3850 * waitcnt: integer or combination of counter values
3851 * VOP3 modifiers:
3852
3853   - abs (\| \|), neg (\-)
3854
3855 * DPP modifiers:
3856
3857   - row_shl, row_shr, row_ror, row_rol
3858   - row_mirror, row_half_mirror, row_bcast
3859   - wave_shl, wave_shr, wave_ror, wave_rol, quad_perm
3860   - row_mask, bank_mask, bound_ctrl
3861
3862 * SDWA modifiers:
3863
3864   - dst_sel, src0_sel, src1_sel (BYTE_N, WORD_M, DWORD)
3865   - dst_unused (UNUSED_PAD, UNUSED_SEXT, UNUSED_PRESERVE)
3866   - abs, neg, sext
3867
3868 Instruction Examples
3869 ~~~~~~~~~~~~~~~~~~~~
3870
3871 DS
3872 ~~
3873
3874 .. code-block:: nasm
3875
3876   ds_add_u32 v2, v4 offset:16
3877   ds_write_src2_b64 v2 offset0:4 offset1:8
3878   ds_cmpst_f32 v2, v4, v6
3879   ds_min_rtn_f64 v[8:9], v2, v[4:5]
3880
3881
3882 For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
3883
3884 FLAT
3885 ++++
3886
3887 .. code-block:: nasm
3888
3889   flat_load_dword v1, v[3:4]
3890   flat_store_dwordx3 v[3:4], v[5:7]
3891   flat_atomic_swap v1, v[3:4], v5 glc
3892   flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
3893   flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
3894
3895 For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
3896
3897 MUBUF
3898 +++++
3899
3900 .. code-block:: nasm
3901
3902   buffer_load_dword v1, off, s[4:7], s1
3903   buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
3904   buffer_store_format_xy v[1:2], off, s[4:7], s1
3905   buffer_wbinvl1
3906   buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
3907
3908 For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
3909
3910 SMRD/SMEM
3911 +++++++++
3912
3913 .. code-block:: nasm
3914
3915   s_load_dword s1, s[2:3], 0xfc
3916   s_load_dwordx8 s[8:15], s[2:3], s4
3917   s_load_dwordx16 s[88:103], s[2:3], s4
3918   s_dcache_inv_vol
3919   s_memtime s[4:5]
3920
3921 For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
3922
3923 SOP1
3924 ++++
3925
3926 .. code-block:: nasm
3927
3928   s_mov_b32 s1, s2
3929   s_mov_b64 s[0:1], 0x80000000
3930   s_cmov_b32 s1, 200
3931   s_wqm_b64 s[2:3], s[4:5]
3932   s_bcnt0_i32_b64 s1, s[2:3]
3933   s_swappc_b64 s[2:3], s[4:5]
3934   s_cbranch_join s[4:5]
3935
3936 For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
3937
3938 SOP2
3939 ++++
3940
3941 .. code-block:: nasm
3942
3943   s_add_u32 s1, s2, s3
3944   s_and_b64 s[2:3], s[4:5], s[6:7]
3945   s_cselect_b32 s1, s2, s3
3946   s_andn2_b32 s2, s4, s6
3947   s_lshr_b64 s[2:3], s[4:5], s6
3948   s_ashr_i32 s2, s4, s6
3949   s_bfm_b64 s[2:3], s4, s6
3950   s_bfe_i64 s[2:3], s[4:5], s6
3951   s_cbranch_g_fork s[4:5], s[6:7]
3952
3953 For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
3954
3955 SOPC
3956 ++++
3957
3958 .. code-block:: nasm
3959
3960   s_cmp_eq_i32 s1, s2
3961   s_bitcmp1_b32 s1, s2
3962   s_bitcmp0_b64 s[2:3], s4
3963   s_setvskip s3, s5
3964
3965 For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
3966
3967 SOPP
3968 ++++
3969
3970 .. code-block:: nasm
3971
3972   s_barrier
3973   s_nop 2
3974   s_endpgm
3975   s_waitcnt 0 ; Wait for all counters to be 0
3976   s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
3977   s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
3978   s_sethalt 9
3979   s_sleep 10
3980   s_sendmsg 0x1
3981   s_sendmsg sendmsg(MSG_INTERRUPT)
3982   s_trap 1
3983
3984 For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
3985
3986 Unless otherwise mentioned, little verification is performed on the operands
3987 of SOPP Instructions, so it is up to the programmer to be familiar with the
3988 range or acceptable values.
3989
3990 VALU
3991 ++++
3992
3993 For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
3994 the assembler will automatically use optimal encoding based on its operands.
3995 To force specific encoding, one can add a suffix to the opcode of the instruction:
3996
3997 * _e32 for 32-bit VOP1/VOP2/VOPC
3998 * _e64 for 64-bit VOP3
3999 * _dpp for VOP_DPP
4000 * _sdwa for VOP_SDWA
4001
4002 VOP1/VOP2/VOP3/VOPC examples:
4003
4004 .. code-block:: nasm
4005
4006   v_mov_b32 v1, v2
4007   v_mov_b32_e32 v1, v2
4008   v_nop
4009   v_cvt_f64_i32_e32 v[1:2], v2
4010   v_floor_f32_e32 v1, v2
4011   v_bfrev_b32_e32 v1, v2
4012   v_add_f32_e32 v1, v2, v3
4013   v_mul_i32_i24_e64 v1, v2, 3
4014   v_mul_i32_i24_e32 v1, -3, v3
4015   v_mul_i32_i24_e32 v1, -100, v3
4016   v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
4017   v_max_f16_e32 v1, v2, v3
4018
4019 VOP_DPP examples:
4020
4021 .. code-block:: nasm
4022
4023   v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
4024   v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
4025   v_mov_b32 v0, v0 wave_shl:1
4026   v_mov_b32 v0, v0 row_mirror
4027   v_mov_b32 v0, v0 row_bcast:31
4028   v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
4029   v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
4030   v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
4031
4032 VOP_SDWA examples:
4033
4034 .. code-block:: nasm
4035
4036   v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
4037   v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
4038   v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
4039   v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
4040   v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
4041
4042 For full list of supported instructions, refer to "Vector ALU instructions".
4043
4044 HSA Code Object Directives
4045 ~~~~~~~~~~~~~~~~~~~~~~~~~~
4046
4047 AMDGPU ABI defines auxiliary data in output code object. In assembly source,
4048 one can specify them with assembler directives.
4049
4050 .hsa_code_object_version major, minor
4051 +++++++++++++++++++++++++++++++++++++
4052
4053 *major* and *minor* are integers that specify the version of the HSA code
4054 object that will be generated by the assembler.
4055
4056 .hsa_code_object_isa [major, minor, stepping, vendor, arch]
4057 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4058
4059
4060 *major*, *minor*, and *stepping* are all integers that describe the instruction
4061 set architecture (ISA) version of the assembly program.
4062
4063 *vendor* and *arch* are quoted strings.  *vendor* should always be equal to
4064 "AMD" and *arch* should always be equal to "AMDGPU".
4065
4066 By default, the assembler will derive the ISA version, *vendor*, and *arch*
4067 from the value of the -mcpu option that is passed to the assembler.
4068
4069 .amdgpu_hsa_kernel (name)
4070 +++++++++++++++++++++++++
4071
4072 This directives specifies that the symbol with given name is a kernel entry point
4073 (label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
4074
4075 .amd_kernel_code_t
4076 ++++++++++++++++++
4077
4078 This directive marks the beginning of a list of key / value pairs that are used
4079 to specify the amd_kernel_code_t object that will be emitted by the assembler.
4080 The list must be terminated by the *.end_amd_kernel_code_t* directive.  For
4081 any amd_kernel_code_t values that are unspecified a default value will be
4082 used.  The default value for all keys is 0, with the following exceptions:
4083
4084 - *kernel_code_version_major* defaults to 1.
4085 - *machine_kind* defaults to 1.
4086 - *machine_version_major*, *machine_version_minor*, and
4087   *machine_version_stepping* are derived from the value of the -mcpu option
4088   that is passed to the assembler.
4089 - *kernel_code_entry_byte_offset* defaults to 256.
4090 - *wavefront_size* defaults to 6.
4091 - *kernarg_segment_alignment*, *group_segment_alignment*, and
4092   *private_segment_alignment* default to 4. Note that alignments are specified
4093   as a power of two, so a value of **n** means an alignment of 2^ **n**.
4094
4095 The *.amd_kernel_code_t* directive must be placed immediately after the
4096 function label and before any instructions.
4097
4098 For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
4099 comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
4100
4101 Here is an example of a minimal amd_kernel_code_t specification:
4102
4103 .. code-block:: none
4104
4105    .hsa_code_object_version 1,0
4106    .hsa_code_object_isa
4107
4108    .hsatext
4109    .globl  hello_world
4110    .p2align 8
4111    .amdgpu_hsa_kernel hello_world
4112
4113    hello_world:
4114
4115       .amd_kernel_code_t
4116          enable_sgpr_kernarg_segment_ptr = 1
4117          is_ptr64 = 1
4118          compute_pgm_rsrc1_vgprs = 0
4119          compute_pgm_rsrc1_sgprs = 0
4120          compute_pgm_rsrc2_user_sgpr = 2
4121          kernarg_segment_byte_size = 8
4122          wavefront_sgpr_count = 2
4123          workitem_vgpr_count = 3
4124      .end_amd_kernel_code_t
4125
4126      s_load_dwordx2 s[0:1], s[0:1] 0x0
4127      v_mov_b32 v0, 3.14159
4128      s_waitcnt lgkmcnt(0)
4129      v_mov_b32 v1, s0
4130      v_mov_b32 v2, s1
4131      flat_store_dword v[1:2], v0
4132      s_endpgm
4133    .Lfunc_end0:
4134         .size   hello_world, .Lfunc_end0-hello_world
4135
4136 Additional Documentation
4137 ========================
4138
4139 .. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
4140 .. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
4141 .. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
4142 .. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
4143 .. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
4144 .. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
4145 .. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
4146 .. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
4147 .. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
4148 .. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
4149 .. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
4150 .. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
4151 .. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
4152 .. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
4153 .. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
4154 .. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__