docs/AMDGPUUsage.rst

   1 =============================
   2 User Guide for AMDGPU Backend
   3 =============================
   4
   5 .. contents::
   6    :local:
   7
   8 Introduction
   9 ============
  10
  11 The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
  12 R600 family up until the current GCN families. It lives in the
  13 ``lib/Target/AMDGPU`` directory.
  14
  15 LLVM
  16 ====
  17
  18 .. _amdgpu-target-triples:
  19
  20 Target Triples
  21 --------------
  22
  23 Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
  24 specify the target triple:
  25
  26   .. table:: AMDGPU Architectures
  27      :name: amdgpu-architecture-table
  28
  29      ============ ==============================================================
  30      Architecture Description
  31      ============ ==============================================================
  32      ``r600``     AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
  33      ``amdgcn``   AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
  34      ============ ==============================================================
  35
  36   .. table:: AMDGPU Vendors
  37      :name: amdgpu-vendor-table
  38
  39      ============ ==============================================================
  40      Vendor       Description
  41      ============ ==============================================================
  42      ``amd``      Can be used for all AMD GPU usage.
  43      ``mesa3d``   Can be used if the OS is ``mesa3d``.
  44      ============ ==============================================================
  45
  46   .. table:: AMDGPU Operating Systems
  47      :name: amdgpu-os-table
  48
  49      ============== ============================================================
  50      OS             Description
  51      ============== ============================================================
  52      *<empty>*      Defaults to the *unknown* OS.
  53      ``amdhsa``     Compute kernels executed on HSA [HSA]_ compatible runtimes
  54                     such as AMD's ROCm [AMD-ROCm]_.
  55      ``amdpal``     Graphic shaders and compute kernels executed on AMD PAL
  56                     runtime.
  57      ``mesa3d``     Graphic shaders and compute kernels executed on Mesa 3D
  58                     runtime.
  59      ============== ============================================================
  60
  61   .. table:: AMDGPU Environments
  62      :name: amdgpu-environment-table
  63
  64      ============ ==============================================================
  65      Environment  Description
  66      ============ ==============================================================
  67      *<empty>*    Default.
  68      ============ ==============================================================
  69
  70 .. _amdgpu-processors:
  71
  72 Processors
  73 ----------
  74
  75 Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
  76 names from both the *Processor* and *Alternative Processor* can be used.
  77
  78   .. table:: AMDGPU Processors
  79      :name: amdgpu-processor-table
  80
  81      =========== =============== ============ ===== ========= ======= ==================
  82      Processor   Alternative     Target       dGPU/ Target    ROCm    Example
  83                  Processor       Triple       APU   Features  Support Products
  84                                  Architecture       Supported
  85                                                     [Default]
  86      =========== =============== ============ ===== ========= ======= ==================
  87      **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
  88      -----------------------------------------------------------------------------------
  89      ``r600``                    ``r600``     dGPU
  90      ``r630``                    ``r600``     dGPU
  91      ``rs880``                   ``r600``     dGPU
  92      ``rv670``                   ``r600``     dGPU
  93      **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
  94      -----------------------------------------------------------------------------------
  95      ``rv710``                   ``r600``     dGPU
  96      ``rv730``                   ``r600``     dGPU
  97      ``rv770``                   ``r600``     dGPU
  98      **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
  99      -----------------------------------------------------------------------------------
 100      ``cedar``                   ``r600``     dGPU
 101      ``cypress``                 ``r600``     dGPU
 102      ``juniper``                 ``r600``     dGPU
 103      ``redwood``                 ``r600``     dGPU
 104      ``sumo``                    ``r600``     dGPU
 105      **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
 106      -----------------------------------------------------------------------------------
 107      ``barts``                   ``r600``     dGPU
 108      ``caicos``                  ``r600``     dGPU
 109      ``cayman``                  ``r600``     dGPU
 110      ``turks``                   ``r600``     dGPU
 111      **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
 112      -----------------------------------------------------------------------------------
 113      ``gfx600``  - ``tahiti``    ``amdgcn``   dGPU
 114      ``gfx601``  - ``hainan``    ``amdgcn``   dGPU
 115                  - ``oland``
 116                  - ``pitcairn``
 117                  - ``verde``
 118      **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
 119      -----------------------------------------------------------------------------------
 120      ``gfx700``  - ``kaveri``    ``amdgcn``   APU                     - A6-7000
 121                                                                       - A6 Pro-7050B
 122                                                                       - A8-7100
 123                                                                       - A8 Pro-7150B
 124                                                                       - A10-7300
 125                                                                       - A10 Pro-7350B
 126                                                                       - FX-7500
 127                                                                       - A8-7200P
 128                                                                       - A10-7400P
 129                                                                       - FX-7600P
 130      ``gfx701``  - ``hawaii``    ``amdgcn``   dGPU            ROCm    - FirePro W8100
 131                                                                       - FirePro W9100
 132                                                                       - FirePro S9150
 133                                                                       - FirePro S9170
 134      ``gfx702``                  ``amdgcn``   dGPU            ROCm    - Radeon R9 290
 135                                                                       - Radeon R9 290x
 136                                                                       - Radeon R390
 137                                                                       - Radeon R390x
 138      ``gfx703``  - ``kabini``    ``amdgcn``   APU                     - E1-2100
 139                  - ``mullins``                                        - E1-2200
 140                                                                       - E1-2500
 141                                                                       - E2-3000
 142                                                                       - E2-3800
 143                                                                       - A4-5000
 144                                                                       - A4-5100
 145                                                                       - A6-5200
 146                                                                       - A4 Pro-3340B
 147      ``gfx704``  - ``bonaire``   ``amdgcn``   dGPU                    - Radeon HD 7790
 148                                                                       - Radeon HD 8770
 149                                                                       - R7 260
 150                                                                       - R7 260X
 151      **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
 152      -----------------------------------------------------------------------------------
 153      ``gfx801``  - ``carrizo``   ``amdgcn``   APU   - xnack           - A6-8500P
 154                                                       [on]            - Pro A6-8500B
 155                                                                       - A8-8600P
 156                                                                       - Pro A8-8600B
 157                                                                       - FX-8800P
 158                                                                       - Pro A12-8800B
 159      \                           ``amdgcn``   APU   - xnack   ROCm    - A10-8700P
 160                                                       [on]            - Pro A10-8700B
 161                                                                       - A10-8780P
 162      \                           ``amdgcn``   APU   - xnack           - A10-9600P
 163                                                       [on]            - A10-9630P
 164                                                                       - A12-9700P
 165                                                                       - A12-9730P
 166                                                                       - FX-9800P
 167                                                                       - FX-9830P
 168      \                           ``amdgcn``   APU   - xnack           - E2-9010
 169                                                       [on]            - A6-9210
 170                                                                       - A9-9410
 171      ``gfx802``  - ``iceland``   ``amdgcn``   dGPU  - xnack   ROCm    - FirePro S7150
 172                  - ``tonga``                          [off]           - FirePro S7100
 173                                                                       - FirePro W7100
 174                                                                       - Radeon R285
 175                                                                       - Radeon R9 380
 176                                                                       - Radeon R9 385
 177                                                                       - Mobile FirePro
 178                                                                         M7170
 179      ``gfx803``  - ``fiji``      ``amdgcn``   dGPU  - xnack   ROCm    - Radeon R9 Nano
 180                                                       [off]           - Radeon R9 Fury
 181                                                                       - Radeon R9 FuryX
 182                                                                       - Radeon Pro Duo
 183                                                                       - FirePro S9300x2
 184                                                                       - Radeon Instinct MI8
 185      \           - ``polaris10`` ``amdgcn``   dGPU  - xnack   ROCm    - Radeon RX 470
 186                                                       [off]           - Radeon RX 480
 187                                                                       - Radeon Instinct MI6
 188      \           - ``polaris11`` ``amdgcn``   dGPU  - xnack   ROCm    - Radeon RX 460
 189                                                       [off]
 190      ``gfx810``  - ``stoney``    ``amdgcn``   APU   - xnack
 191                                                       [on]
 192      **GCN GFX9** [AMD-GCN-GFX9]_
 193      -----------------------------------------------------------------------------------
 194      ``gfx900``                  ``amdgcn``   dGPU  - xnack   ROCm    - Radeon Vega
 195                                                       [off]             Frontier Edition
 196                                                                       - Radeon RX Vega 56
 197                                                                       - Radeon RX Vega 64
 198                                                                       - Radeon RX Vega 64
 199                                                                         Liquid
 200                                                                       - Radeon Instinct MI25
 201      ``gfx902``                  ``amdgcn``   APU   - xnack           - Ryzen 3 2200G
 202                                                       [on]            - Ryzen 5 2400G
 203      ``gfx904``                  ``amdgcn``   dGPU  - xnack           *TBA*
 204                                                       [off]
 205                                                                       .. TODO
 206                                                                          Add product
 207                                                                          names.
 208      ``gfx906``                  ``amdgcn``   dGPU  - xnack           *TBA*
 209                                                       [off]
 210                                                                       .. TODO
 211                                                                          Add product
 212                                                                          names.
 213      =========== =============== ============ ===== ========= ======= ==================
 214
 215 .. _amdgpu-target-features:
 216
 217 Target Features
 218 ---------------
 219
 220 Target features control how code is generated to support certain
 221 processor specific features. Not all target features are supported by
 222 all processors. The runtime must ensure that the features supported by
 223 the device used to execute the code match the features enabled when
 224 generating the code. A mismatch of features may result in incorrect
 225 execution, or a reduction in performance.
 226
 227 The target features supported by each processor, and the default value
 228 used if not specified explicitly, is listed in
 229 :ref:`amdgpu-processor-table`.
 230
 231 Use the ``clang -m[no-]<TargetFeature>`` option to specify the AMD GPU
 232 target features.
 233
 234 For example:
 235
 236 ``-mxnack``
 237   Enable the ``xnack`` feature.
 238 ``-mno-xnack``
 239   Disable the ``xnack`` feature.
 240
 241   .. table:: AMDGPU Target Features
 242      :name: amdgpu-target-feature-table
 243
 244      ============== ==================================================
 245      Target Feature Description
 246      ============== ==================================================
 247      -m[no-]xnack   Enable/disable generating code that has
 248                     memory clauses that are compatible with
 249                     having XNACK replay enabled.
 250
 251                     This is used for demand paging and page
 252                     migration. If XNACK replay is enabled in
 253                     the device, then if a page fault occurs
 254                     the code may execute incorrectly if the
 255                     ``xnack`` feature is not enabled. Executing
 256                     code that has the feature enabled on a
 257                     device that does not have XNACK replay
 258                     enabled will execute correctly, but may
 259                     be less performant than code with the
 260                     feature disabled.
 261      ============== ==================================================
 262
 263 .. _amdgpu-address-spaces:
 264
 265 Address Spaces
 266 --------------
 267
 268 The AMDGPU backend uses the following address space mappings.
 269
 270 The memory space names used in the table, aside from the region memory space, is
 271 from the OpenCL standard.
 272
 273 LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
 274
 275   .. table:: Address Space Mapping
 276      :name: amdgpu-address-space-mapping-table
 277
 278      ================== =================
 279      LLVM Address Space Memory Space
 280      ================== =================
 281      0                  Generic (Flat)
 282      1                  Global
 283      2                  Region (GDS)
 284      3                  Local (group/LDS)
 285      4                  Constant
 286      5                  Private (Scratch)
 287      6                  Constant 32-bit
 288      ================== =================
 289
 290 .. _amdgpu-memory-scopes:
 291
 292 Memory Scopes
 293 -------------
 294
 295 This section provides LLVM memory synchronization scopes supported by the AMDGPU
 296 backend memory model when the target triple OS is ``amdhsa`` (see
 297 :ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
 298
 299 The memory model supported is based on the HSA memory model [HSA]_ which is
 300 based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
 301 relation is transitive over the synchonizes-with relation independent of scope,
 302 and synchonizes-with allows the memory scope instances to be inclusive (see
 303 table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
 304
 305 This is different to the OpenCL [OpenCL]_ memory model which does not have scope
 306 inclusion and requires the memory scopes to exactly match. However, this
 307 is conservatively correct for OpenCL.
 308
 309   .. table:: AMDHSA LLVM Sync Scopes
 310      :name: amdgpu-amdhsa-llvm-sync-scopes-table
 311
 312      ================ ==========================================================
 313      LLVM Sync Scope  Description
 314      ================ ==========================================================
 315      *none*           The default: ``system``.
 316
 317                       Synchronizes with, and participates in modification and
 318                       seq_cst total orderings with, other operations (except
 319                       image operations) for all address spaces (except private,
 320                       or generic that accesses private) provided the other
 321                       operation's sync scope is:
 322
 323                       - ``system``.
 324                       - ``agent`` and executed by a thread on the same agent.
 325                       - ``workgroup`` and executed by a thread in the same
 326                         workgroup.
 327                       - ``wavefront`` and executed by a thread in the same
 328                         wavefront.
 329
 330      ``agent``        Synchronizes with, and participates in modification and
 331                       seq_cst total orderings with, other operations (except
 332                       image operations) for all address spaces (except private,
 333                       or generic that accesses private) provided the other
 334                       operation's sync scope is:
 335
 336                       - ``system`` or ``agent`` and executed by a thread on the
 337                         same agent.
 338                       - ``workgroup`` and executed by a thread in the same
 339                         workgroup.
 340                       - ``wavefront`` and executed by a thread in the same
 341                         wavefront.
 342
 343      ``workgroup``    Synchronizes with, and participates in modification and
 344                       seq_cst total orderings with, other operations (except
 345                       image operations) for all address spaces (except private,
 346                       or generic that accesses private) provided the other
 347                       operation's sync scope is:
 348
 349                       - ``system``, ``agent`` or ``workgroup`` and executed by a
 350                         thread in the same workgroup.
 351                       - ``wavefront`` and executed by a thread in the same
 352                         wavefront.
 353
 354      ``wavefront``    Synchronizes with, and participates in modification and
 355                       seq_cst total orderings with, other operations (except
 356                       image operations) for all address spaces (except private,
 357                       or generic that accesses private) provided the other
 358                       operation's sync scope is:
 359
 360                       - ``system``, ``agent``, ``workgroup`` or ``wavefront``
 361                         and executed by a thread in the same wavefront.
 362
 363      ``singlethread`` Only synchronizes with, and participates in modification
 364                       and seq_cst total orderings with, other operations (except
 365                       image operations) running in the same thread for all
 366                       address spaces (for example, in signal handlers).
 367      ================ ==========================================================
 368
 369 AMDGPU Intrinsics
 370 -----------------
 371
 372 The AMDGPU backend implements the following intrinsics.
 373
 374 *This section is WIP.*
 375
 376 .. TODO
 377    List AMDGPU intrinsics
 378
 379 Code Object
 380 ===========
 381
 382 The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
 383 can be linked by ``lld`` to produce a standard ELF shared code object which can
 384 be loaded and executed on an AMDGPU target.
 385
 386 Header
 387 ------
 388
 389 The AMDGPU backend uses the following ELF header:
 390
 391   .. table:: AMDGPU ELF Header
 392      :name: amdgpu-elf-header-table
 393
 394      ========================== ===============================
 395      Field                      Value
 396      ========================== ===============================
 397      ``e_ident[EI_CLASS]``      ``ELFCLASS64``
 398      ``e_ident[EI_DATA]``       ``ELFDATA2LSB``
 399      ``e_ident[EI_OSABI]``      - ``ELFOSABI_NONE``
 400                                 - ``ELFOSABI_AMDGPU_HSA``
 401                                 - ``ELFOSABI_AMDGPU_PAL``
 402                                 - ``ELFOSABI_AMDGPU_MESA3D``
 403      ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA``
 404                                 - ``ELFABIVERSION_AMDGPU_PAL``
 405                                 - ``ELFABIVERSION_AMDGPU_MESA3D``
 406      ``e_type``                 - ``ET_REL``
 407                                 - ``ET_DYN``
 408      ``e_machine``              ``EM_AMDGPU``
 409      ``e_entry``                0
 410      ``e_flags``                See :ref:`amdgpu-elf-header-e_flags-table`
 411      ========================== ===============================
 412
 413 ..
 414
 415   .. table:: AMDGPU ELF Header Enumeration Values
 416      :name: amdgpu-elf-header-enumeration-values-table
 417
 418      =============================== =====
 419      Name                            Value
 420      =============================== =====
 421      ``EM_AMDGPU``                   224
 422      ``ELFOSABI_NONE``               0
 423      ``ELFOSABI_AMDGPU_HSA``         64
 424      ``ELFOSABI_AMDGPU_PAL``         65
 425      ``ELFOSABI_AMDGPU_MESA3D``      66
 426      ``ELFABIVERSION_AMDGPU_HSA``    1
 427      ``ELFABIVERSION_AMDGPU_PAL``    0
 428      ``ELFABIVERSION_AMDGPU_MESA3D`` 0
 429      =============================== =====
 430
 431 ``e_ident[EI_CLASS]``
 432   The ELF class is:
 433
 434   * ``ELFCLASS32`` for ``r600`` architecture.
 435
 436   * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64
 437     bit applications.
 438
 439 ``e_ident[EI_DATA]``
 440   All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
 441
 442 ``e_ident[EI_OSABI]``
 443   One of the following AMD GPU architecture specific OS ABIs
 444   (see :ref:`amdgpu-os-table`):
 445
 446   * ``ELFOSABI_NONE`` for *unknown* OS.
 447
 448   * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
 449
 450   * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
 451
 452   * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
 453
 454 ``e_ident[EI_ABIVERSION]``
 455   The ABI version of the AMD GPU architecture specific OS ABI to which the code
 456   object conforms:
 457
 458   * ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
 459     runtime ABI.
 460
 461   * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
 462     runtime ABI.
 463
 464   * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
 465     3D runtime ABI.
 466
 467 ``e_type``
 468   Can be one of the following values:
 469
 470
 471   ``ET_REL``
 472     The type produced by the AMD GPU backend compiler as it is relocatable code
 473     object.
 474
 475   ``ET_DYN``
 476     The type produced by the linker as it is a shared code object.
 477
 478   The AMD HSA runtime loader requires a ``ET_DYN`` code object.
 479
 480 ``e_machine``
 481   The value ``EM_AMDGPU`` is used for the machine for all processors supported
 482   by the ``r600`` and ``amdgcn`` architectures (see
 483   :ref:`amdgpu-processor-table`). The specific processor is specified in the
 484   ``EF_AMDGPU_MACH`` bit field of the ``e_flags`` (see
 485   :ref:`amdgpu-elf-header-e_flags-table`).
 486
 487 ``e_entry``
 488   The entry point is 0 as the entry points for individual kernels must be
 489   selected in order to invoke them through AQL packets.
 490
 491 ``e_flags``
 492   The AMDGPU backend uses the following ELF header flags:
 493
 494   .. table:: AMDGPU ELF Header ``e_flags``
 495      :name: amdgpu-elf-header-e_flags-table
 496
 497      ================================= ========== =============================
 498      Name                              Value      Description
 499      ================================= ========== =============================
 500      **AMDGPU Processor Flag**                    See :ref:`amdgpu-processor-table`.
 501      -------------------------------------------- -----------------------------
 502      ``EF_AMDGPU_MACH``                0x000000ff AMDGPU processor selection
 503                                                   mask for
 504                                                   ``EF_AMDGPU_MACH_xxx`` values
 505                                                   defined in
 506                                                   :ref:`amdgpu-ef-amdgpu-mach-table`.
 507      ``EF_AMDGPU_XNACK``               0x00000100 Indicates if the ``xnack``
 508                                                   target feature is
 509                                                   enabled for all code
 510                                                   contained in the code object.
 511                                                   If the processor
 512                                                   does not support the
 513                                                   ``xnack`` target
 514                                                   feature then must
 515                                                   be 0.
 516                                                   See
 517                                                   :ref:`amdgpu-target-features`.
 518      ================================= ========== =============================
 519
 520   .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
 521      :name: amdgpu-ef-amdgpu-mach-table
 522
 523      ================================= ========== =============================
 524      Name                              Value      Description (see
 525                                                   :ref:`amdgpu-processor-table`)
 526      ================================= ========== =============================
 527      ``EF_AMDGPU_MACH_NONE``           0x000      *not specified*
 528      ``EF_AMDGPU_MACH_R600_R600``      0x001      ``r600``
 529      ``EF_AMDGPU_MACH_R600_R630``      0x002      ``r630``
 530      ``EF_AMDGPU_MACH_R600_RS880``     0x003      ``rs880``
 531      ``EF_AMDGPU_MACH_R600_RV670``     0x004      ``rv670``
 532      ``EF_AMDGPU_MACH_R600_RV710``     0x005      ``rv710``
 533      ``EF_AMDGPU_MACH_R600_RV730``     0x006      ``rv730``
 534      ``EF_AMDGPU_MACH_R600_RV770``     0x007      ``rv770``
 535      ``EF_AMDGPU_MACH_R600_CEDAR``     0x008      ``cedar``
 536      ``EF_AMDGPU_MACH_R600_CYPRESS``   0x009      ``cypress``
 537      ``EF_AMDGPU_MACH_R600_JUNIPER``   0x00a      ``juniper``
 538      ``EF_AMDGPU_MACH_R600_REDWOOD``   0x00b      ``redwood``
 539      ``EF_AMDGPU_MACH_R600_SUMO``      0x00c      ``sumo``
 540      ``EF_AMDGPU_MACH_R600_BARTS``     0x00d      ``barts``
 541      ``EF_AMDGPU_MACH_R600_CAICOS``    0x00e      ``caicos``
 542      ``EF_AMDGPU_MACH_R600_CAYMAN``    0x00f      ``cayman``
 543      ``EF_AMDGPU_MACH_R600_TURKS``     0x010      ``turks``
 544      *reserved*                        0x011 -    Reserved for ``r600``
 545                                        0x01f      architecture processors.
 546      ``EF_AMDGPU_MACH_AMDGCN_GFX600``  0x020      ``gfx600``
 547      ``EF_AMDGPU_MACH_AMDGCN_GFX601``  0x021      ``gfx601``
 548      ``EF_AMDGPU_MACH_AMDGCN_GFX700``  0x022      ``gfx700``
 549      ``EF_AMDGPU_MACH_AMDGCN_GFX701``  0x023      ``gfx701``
 550      ``EF_AMDGPU_MACH_AMDGCN_GFX702``  0x024      ``gfx702``
 551      ``EF_AMDGPU_MACH_AMDGCN_GFX703``  0x025      ``gfx703``
 552      ``EF_AMDGPU_MACH_AMDGCN_GFX704``  0x026      ``gfx704``
 553      *reserved*                        0x027      Reserved.
 554      ``EF_AMDGPU_MACH_AMDGCN_GFX801``  0x028      ``gfx801``
 555      ``EF_AMDGPU_MACH_AMDGCN_GFX802``  0x029      ``gfx802``
 556      ``EF_AMDGPU_MACH_AMDGCN_GFX803``  0x02a      ``gfx803``
 557      ``EF_AMDGPU_MACH_AMDGCN_GFX810``  0x02b      ``gfx810``
 558      ``EF_AMDGPU_MACH_AMDGCN_GFX900``  0x02c      ``gfx900``
 559      ``EF_AMDGPU_MACH_AMDGCN_GFX902``  0x02d      ``gfx902``
 560      ``EF_AMDGPU_MACH_AMDGCN_GFX904``  0x02e      ``gfx904``
 561      ``EF_AMDGPU_MACH_AMDGCN_GFX906``  0x02f      ``gfx906``
 562      *reserved*                        0x030      Reserved.
 563      ================================= ========== =============================
 564
 565 Sections
 566 --------
 567
 568 An AMDGPU target ELF code object has the standard ELF sections which include:
 569
 570   .. table:: AMDGPU ELF Sections
 571      :name: amdgpu-elf-sections-table
 572
 573      ================== ================ =================================
 574      Name               Type             Attributes
 575      ================== ================ =================================
 576      ``.bss``           ``SHT_NOBITS``   ``SHF_ALLOC`` + ``SHF_WRITE``
 577      ``.data``          ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
 578      ``.debug_``\ *\**  ``SHT_PROGBITS`` *none*
 579      ``.dynamic``       ``SHT_DYNAMIC``  ``SHF_ALLOC``
 580      ``.dynstr``        ``SHT_PROGBITS`` ``SHF_ALLOC``
 581      ``.dynsym``        ``SHT_PROGBITS`` ``SHF_ALLOC``
 582      ``.got``           ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
 583      ``.hash``          ``SHT_HASH``     ``SHF_ALLOC``
 584      ``.note``          ``SHT_NOTE``     *none*
 585      ``.rela``\ *name*  ``SHT_RELA``     *none*
 586      ``.rela.dyn``      ``SHT_RELA``     *none*
 587      ``.rodata``        ``SHT_PROGBITS`` ``SHF_ALLOC``
 588      ``.shstrtab``      ``SHT_STRTAB``   *none*
 589      ``.strtab``        ``SHT_STRTAB``   *none*
 590      ``.symtab``        ``SHT_SYMTAB``   *none*
 591      ``.text``          ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
 592      ================== ================ =================================
 593
 594 These sections have their standard meanings (see [ELF]_) and are only generated
 595 if needed.
 596
 597 ``.debug``\ *\**
 598   The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
 599   DWARF produced by the AMDGPU backend.
 600
 601 ``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
 602   The standard sections used by a dynamic loader.
 603
 604 ``.note``
 605   See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
 606   backend.
 607
 608 ``.rela``\ *name*, ``.rela.dyn``
 609   For relocatable code objects, *name* is the name of the section that the
 610   relocation records apply. For example, ``.rela.text`` is the section name for
 611   relocation records associated with the ``.text`` section.
 612
 613   For linked shared code objects, ``.rela.dyn`` contains all the relocation
 614   records from each of the relocatable code object's ``.rela``\ *name* sections.
 615
 616   See :ref:`amdgpu-relocation-records` for the relocation records supported by
 617   the AMDGPU backend.
 618
 619 ``.text``
 620   The executable machine code for the kernels and functions they call. Generated
 621   as position independent code. See :ref:`amdgpu-code-conventions` for
 622   information on conventions used in the isa generation.
 623
 624 .. _amdgpu-note-records:
 625
 626 Note Records
 627 ------------
 628
 629 As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero byte padding must
 630 be generated after the ``name`` field to ensure the ``desc`` field is 4 byte
 631 aligned. In addition, minimal zero byte padding must be generated to ensure the
 632 ``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` field of the
 633 ``.note`` section must be at least 4 to indicate at least 8 byte alignment.
 634
 635 The AMDGPU backend code object uses the following ELF note records in the
 636 ``.note`` section. The *Description* column specifies the layout of the note
 637 record's ``desc`` field. All fields are consecutive bytes. Note records with
 638 variable size strings have a corresponding ``*_size`` field that specifies the
 639 number of bytes, including the terminating null character, in the string. The
 640 string(s) come immediately after the preceding fields.
 641
 642 Additional note records can be present.
 643
 644   .. table:: AMDGPU ELF Note Records
 645      :name: amdgpu-elf-note-records-table
 646
 647      ===== ============================== ======================================
 648      Name  Type                           Description
 649      ===== ============================== ======================================
 650      "AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
 651      ===== ============================== ======================================
 652
 653 ..
 654
 655   .. table:: AMDGPU ELF Note Record Enumeration Values
 656      :name: amdgpu-elf-note-record-enumeration-values-table
 657
 658      ============================== =====
 659      Name                           Value
 660      ============================== =====
 661      *reserved*                       0-9
 662      ``NT_AMD_AMDGPU_HSA_METADATA``    10
 663      *reserved*                        11
 664      ============================== =====
 665
 666 ``NT_AMD_AMDGPU_HSA_METADATA``
 667   Specifies extensible metadata associated with the code objects executed on HSA
 668   [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
 669   the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
 670   :ref:`amdgpu-amdhsa-hsa-code-object-metadata` for the syntax of the code
 671   object metadata string.
 672
 673 .. _amdgpu-symbols:
 674
 675 Symbols
 676 -------
 677
 678 Symbols include the following:
 679
 680   .. table:: AMDGPU ELF Symbols
 681      :name: amdgpu-elf-symbols-table
 682
 683      ===================== ============== ============= ==================
 684      Name                  Type           Section       Description
 685      ===================== ============== ============= ==================
 686      *link-name*           ``STT_OBJECT`` - ``.data``   Global variable
 687                                           - ``.rodata``
 688                                           - ``.bss``
 689      *link-name*\ ``@kd``  ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
 690      *link-name*           ``STT_FUNC``   - ``.text``   Kernel entry point
 691      ===================== ============== ============= ==================
 692
 693 Global variable
 694   Global variables both used and defined by the compilation unit.
 695
 696   If the symbol is defined in the compilation unit then it is allocated in the
 697   appropriate section according to if it has initialized data or is readonly.
 698
 699   If the symbol is external then its section is ``STN_UNDEF`` and the loader
 700   will resolve relocations using the definition provided by another code object
 701   or explicitly defined by the runtime.
 702
 703   All global symbols, whether defined in the compilation unit or external, are
 704   accessed by the machine code indirectly through a GOT table entry. This
 705   allows them to be preemptable. The GOT table is only supported when the target
 706   triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
 707
 708   .. TODO
 709      Add description of linked shared object symbols. Seems undefined symbols
 710      are marked as STT_NOTYPE.
 711
 712 Kernel descriptor
 713   Every HSA kernel has an associated kernel descriptor. It is the address of the
 714   kernel descriptor that is used in the AQL dispatch packet used to invoke the
 715   kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
 716   defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
 717
 718 Kernel entry point
 719   Every HSA kernel also has a symbol for its machine code entry point.
 720
 721 .. _amdgpu-relocation-records:
 722
 723 Relocation Records
 724 ------------------
 725
 726 AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
 727 relocatable fields are:
 728
 729 ``word32``
 730   This specifies a 32-bit field occupying 4 bytes with arbitrary byte
 731   alignment. These values use the same byte order as other word values in the
 732   AMD GPU architecture.
 733
 734 ``word64``
 735   This specifies a 64-bit field occupying 8 bytes with arbitrary byte
 736   alignment. These values use the same byte order as other word values in the
 737   AMD GPU architecture.
 738
 739 Following notations are used for specifying relocation calculations:
 740
 741 **A**
 742   Represents the addend used to compute the value of the relocatable field.
 743
 744 **G**
 745   Represents the offset into the global offset table at which the relocation
 746   entry's symbol will reside during execution.
 747
 748 **GOT**
 749   Represents the address of the global offset table.
 750
 751 **P**
 752   Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
 753   of the storage unit being relocated (computed using ``r_offset``).
 754
 755 **S**
 756   Represents the value of the symbol whose index resides in the relocation
 757   entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``.
 758
 759 **B**
 760   Represents the base address of a loaded executable or shared object which is
 761   the difference between the ELF address and the actual load address. Relocations
 762   using this are only valid in executable or shared objects.
 763
 764 The following relocation types are supported:
 765
 766   .. table:: AMDGPU ELF Relocation Records
 767      :name: amdgpu-elf-relocation-records-table
 768
 769      ========================== ======= =====  ==========  ==============================
 770      Relocation Type            Kind    Value  Field       Calculation
 771      ========================== ======= =====  ==========  ==============================
 772      ``R_AMDGPU_NONE``                  0      *none*      *none*
 773      ``R_AMDGPU_ABS32_LO``      Static, 1      ``word32``  (S + A) & 0xFFFFFFFF
 774                                 Dynamic
 775      ``R_AMDGPU_ABS32_HI``      Static, 2      ``word32``  (S + A) >> 32
 776                                 Dynamic
 777      ``R_AMDGPU_ABS64``         Static, 3      ``word64``  S + A
 778                                 Dynamic
 779      ``R_AMDGPU_REL32``         Static  4      ``word32``  S + A - P
 780      ``R_AMDGPU_REL64``         Static  5      ``word64``  S + A - P
 781      ``R_AMDGPU_ABS32``         Static, 6      ``word32``  S + A
 782                                 Dynamic
 783      ``R_AMDGPU_GOTPCREL``      Static  7      ``word32``  G + GOT + A - P
 784      ``R_AMDGPU_GOTPCREL32_LO`` Static  8      ``word32``  (G + GOT + A - P) & 0xFFFFFFFF
 785      ``R_AMDGPU_GOTPCREL32_HI`` Static  9      ``word32``  (G + GOT + A - P) >> 32
 786      ``R_AMDGPU_REL32_LO``      Static  10     ``word32``  (S + A - P) & 0xFFFFFFFF
 787      ``R_AMDGPU_REL32_HI``      Static  11     ``word32``  (S + A - P) >> 32
 788      *reserved*                         12
 789      ``R_AMDGPU_RELATIVE64``    Dynamic 13     ``word64``  B + A
 790      ========================== ======= =====  ==========  ==============================
 791
 792 ``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by
 793 the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``.
 794
 795 There is no current OS loader support for 32 bit programs and so
 796 ``R_AMDGPU_ABS32`` is not used.
 797
 798 .. _amdgpu-dwarf:
 799
 800 DWARF
 801 -----
 802
 803 Standard DWARF [DWARF]_ Version 5 sections can be generated. These contain
 804 information that maps the code object executable code and data to the source
 805 language constructs. It can be used by tools such as debuggers and profilers.
 806
 807 Address Space Mapping
 808 ~~~~~~~~~~~~~~~~~~~~~
 809
 810 The following address space mapping is used:
 811
 812   .. table:: AMDGPU DWARF Address Space Mapping
 813      :name: amdgpu-dwarf-address-space-mapping-table
 814
 815      =================== =================
 816      DWARF Address Space Memory Space
 817      =================== =================
 818      1                   Private (Scratch)
 819      2                   Local (group/LDS)
 820      *omitted*           Global
 821      *omitted*           Constant
 822      *omitted*           Generic (Flat)
 823      *not supported*     Region (GDS)
 824      =================== =================
 825
 826 See :ref:`amdgpu-address-spaces` for information on the memory space terminology
 827 used in the table.
 828
 829 An ``address_class`` attribute is generated on pointer type DIEs to specify the
 830 DWARF address space of the value of the pointer when it is in the *private* or
 831 *local* address space. Otherwise the attribute is omitted.
 832
 833 An ``XDEREF`` operation is generated in location list expressions for variables
 834 that are allocated in the *private* and *local* address space. Otherwise no
 835 ``XDREF`` is omitted.
 836
 837 Register Mapping
 838 ~~~~~~~~~~~~~~~~
 839
 840 *This section is WIP.*
 841
 842 .. TODO
 843    Define DWARF register enumeration.
 844
 845    If want to present a wavefront state then should expose vector registers as
 846    64 wide (rather than per work-item view that LLVM uses). Either as separate
 847    registers, or a 64x4 byte single register. In either case use a new LANE op
 848    (akin to XDREF) to select the current lane usage in a location
 849    expression. This would also allow scalar register spilling to vector register
 850    lanes to be expressed (currently no debug information is being generated for
 851    spilling). If choose a wide single register approach then use LANE in
 852    conjunction with PIECE operation to select the dword part of the register for
 853    the current lane. If the separate register approach then use LANE to select
 854    the register.
 855
 856 Source Text
 857 ~~~~~~~~~~~
 858
 859 Source text for online-compiled programs (e.g. those compiled by the OpenCL
 860 runtime) may be embedded into the DWARF v5 line table using the ``clang
 861 -gembed-source`` option, described in table :ref:`amdgpu-debug-options`.
 862
 863 For example:
 864
 865 ``-gembed-source``
 866   Enable the embedded source DWARF v5 extension.
 867 ``-gno-embed-source``
 868   Disable the embedded source DWARF v5 extension.
 869
 870   .. table:: AMDGPU Debug Options
 871      :name: amdgpu-debug-options
 872
 873      ==================== ==================================================
 874      Debug Flag           Description
 875      ==================== ==================================================
 876      -g[no-]embed-source  Enable/disable embedding source text in DWARF
 877                           debug sections. Useful for environments where
 878                           source cannot be written to disk, such as
 879                           when performing online compilation.
 880      ==================== ==================================================
 881
 882 This option enables one extended content types in the DWARF v5 Line Number
 883 Program Header, which is used to encode embedded source.
 884
 885   .. table:: AMDGPU DWARF Line Number Program Header Extended Content Types
 886      :name: amdgpu-dwarf-extended-content-types
 887
 888      ============================  ======================
 889      Content Type                  Form
 890      ============================  ======================
 891      ``DW_LNCT_LLVM_source``       ``DW_FORM_line_strp``
 892      ============================  ======================
 893
 894 The source field will contain the UTF-8 encoded, null-terminated source text
 895 with ``'\n'`` line endings. When the source field is present, consumers can use
 896 the embedded source instead of attempting to discover the source on disk. When
 897 the source field is absent, consumers can access the file to get the source
 898 text.
 899
 900 The above content type appears in the ``file_name_entry_format`` field of the
 901 line table prologue, and its corresponding value appear in the ``file_names``
 902 field. The current encoding of the content type is documented in table
 903 :ref:`amdgpu-dwarf-extended-content-types-encoding`
 904
 905   .. table:: AMDGPU DWARF Line Number Program Header Extended Content Types Encoding
 906      :name: amdgpu-dwarf-extended-content-types-encoding
 907
 908      ============================  ====================
 909      Content Type                  Value
 910      ============================  ====================
 911      ``DW_LNCT_LLVM_source``       0x2001
 912      ============================  ====================
 913
 914 .. _amdgpu-code-conventions:
 915
 916 Code Conventions
 917 ================
 918
 919 This section provides code conventions used for each supported target triple OS
 920 (see :ref:`amdgpu-target-triples`).
 921
 922 AMDHSA
 923 ------
 924
 925 This section provides code conventions used when the target triple OS is
 926 ``amdhsa`` (see :ref:`amdgpu-target-triples`).
 927
 928 .. _amdgpu-amdhsa-hsa-code-object-metadata:
 929
 930 Code Object Target Identification
 931 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 932
 933 The AMDHSA OS uses the following syntax to specify the code object
 934 target as a single string:
 935
 936   ``<Architecture>-<Vendor>-<OS>-<Environment>-<Processor><Target Features>``
 937
 938 Where:
 939
 940   - ``<Architecture>``, ``<Vendor>``, ``<OS>`` and ``<Environment>``
 941     are the same as the *Target Triple* (see
 942     :ref:`amdgpu-target-triples`).
 943
 944   - ``<Processor>`` is the same as the *Processor* (see
 945     :ref:`amdgpu-processors`).
 946
 947   - ``<Target Features>`` is a list of the enabled *Target Features*
 948     (see :ref:`amdgpu-target-features`), each prefixed by a plus, that
 949     apply to *Processor*. The list must be in the same order as listed
 950     in the table :ref:`amdgpu-target-feature-table`. Note that *Target
 951     Features* must be included in the list if they are enabled even if
 952     that is the default for *Processor*.
 953
 954 For example:
 955
 956   ``"amdgcn-amd-amdhsa--gfx902+xnack"``
 957
 958 Code Object Metadata
 959 ~~~~~~~~~~~~~~~~~~~~
 960
 961 The code object metadata specifies extensible metadata associated with the code
 962 objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
 963 [AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record
 964 (see :ref:`amdgpu-note-records`) and is required when the target triple OS is
 965 ``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
 966 information necessary to support the ROCM kernel queries. For example, the
 967 segment sizes needed in a dispatch packet. In addition, a high level language
 968 runtime may require other information to be included. For example, the AMD
 969 OpenCL runtime records kernel argument information.
 970
 971 The metadata is specified as a YAML formatted string (see [YAML]_ and
 972 :doc:`YamlIO`).
 973
 974 .. TODO
 975    Is the string null terminated? It probably should not if YAML allows it to
 976    contain null characters, otherwise it should be.
 977
 978 The metadata is represented as a single YAML document comprised of the mapping
 979 defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
 980 referenced tables.
 981
 982 For boolean values, the string values of ``false`` and ``true`` are used for
 983 false and true respectively.
 984
 985 Additional information can be added to the mappings. To avoid conflicts, any
 986 non-AMD key names should be prefixed by "*vendor-name*.".
 987
 988   .. table:: AMDHSA Code Object Metadata Mapping
 989      :name: amdgpu-amdhsa-code-object-metadata-mapping-table
 990
 991      ========== ============== ========= =======================================
 992      String Key Value Type     Required? Description
 993      ========== ============== ========= =======================================
 994      "Version"  sequence of    Required  - The first integer is the major
 995                 2 integers                 version. Currently 1.
 996                                          - The second integer is the minor
 997                                            version. Currently 0.
 998      "Printf"   sequence of              Each string is encoded information
 999                 strings                  about a printf function call. The
1000                                          encoded information is organized as
1001                                          fields separated by colon (':'):
1002
1003                                          ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
1004
1005                                          where:
1006
1007                                          ``ID``
1008                                            A 32 bit integer as a unique id for
1009                                            each printf function call
1010
1011                                          ``N``
1012                                            A 32 bit integer equal to the number
1013                                            of arguments of printf function call
1014                                            minus 1
1015
1016                                          ``S[i]`` (where i = 0, 1, ... , N-1)
1017                                            32 bit integers for the size in bytes
1018                                            of the i-th FormatString argument of
1019                                            the printf function call
1020
1021                                          FormatString
1022                                            The format string passed to the
1023                                            printf function call.
1024      "Kernels"  sequence of    Required  Sequence of the mappings for each
1025                 mapping                  kernel in the code object. See
1026                                          :ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
1027                                          for the definition of the mapping.
1028      ========== ============== ========= =======================================
1029
1030 ..
1031
1032   .. table:: AMDHSA Code Object Kernel Metadata Mapping
1033      :name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
1034
1035      ================= ============== ========= ================================
1036      String Key        Value Type     Required? Description
1037      ================= ============== ========= ================================
1038      "Name"            string         Required  Source name of the kernel.
1039      "SymbolName"      string         Required  Name of the kernel
1040                                                 descriptor ELF symbol.
1041      "Language"        string                   Source language of the kernel.
1042                                                 Values include:
1043
1044                                                 - "OpenCL C"
1045                                                 - "OpenCL C++"
1046                                                 - "HCC"
1047                                                 - "OpenMP"
1048
1049      "LanguageVersion" sequence of              - The first integer is the major
1050                        2 integers                 version.
1051                                                 - The second integer is the
1052                                                   minor version.
1053      "Attrs"           mapping                  Mapping of kernel attributes.
1054                                                 See
1055                                                 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
1056                                                 for the mapping definition.
1057      "Args"            sequence of              Sequence of mappings of the
1058                        mapping                  kernel arguments. See
1059                                                 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
1060                                                 for the definition of the mapping.
1061      "CodeProps"       mapping                  Mapping of properties related to
1062                                                 the kernel code. See
1063                                                 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
1064                                                 for the mapping definition.
1065      ================= ============== ========= ================================
1066
1067 ..
1068
1069   .. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
1070      :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
1071
1072      =================== ============== ========= ==============================
1073      String Key          Value Type     Required? Description
1074      =================== ============== ========= ==============================
1075      "ReqdWorkGroupSize" sequence of              If not 0, 0, 0 then all values
1076                          3 integers               must be >=1 and the dispatch
1077                                                   work-group size X, Y, Z must
1078                                                   correspond to the specified
1079                                                   values. Defaults to 0, 0, 0.
1080
1081                                                   Corresponds to the OpenCL
1082                                                   ``reqd_work_group_size``
1083                                                   attribute.
1084      "WorkGroupSizeHint" sequence of              The dispatch work-group size
1085                          3 integers               X, Y, Z is likely to be the
1086                                                   specified values.
1087
1088                                                   Corresponds to the OpenCL
1089                                                   ``work_group_size_hint``
1090                                                   attribute.
1091      "VecTypeHint"       string                   The name of a scalar or vector
1092                                                   type.
1093
1094                                                   Corresponds to the OpenCL
1095                                                   ``vec_type_hint`` attribute.
1096
1097      "RuntimeHandle"     string                   The external symbol name
1098                                                   associated with a kernel.
1099                                                   OpenCL runtime allocates a
1100                                                   global buffer for the symbol
1101                                                   and saves the kernel's address
1102                                                   to it, which is used for
1103                                                   device side enqueueing. Only
1104                                                   available for device side
1105                                                   enqueued kernels.
1106      =================== ============== ========= ==============================
1107
1108 ..
1109
1110   .. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
1111      :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
1112
1113      ================= ============== ========= ================================
1114      String Key        Value Type     Required? Description
1115      ================= ============== ========= ================================
1116      "Name"            string                   Kernel argument name.
1117      "TypeName"        string                   Kernel argument type name.
1118      "Size"            integer        Required  Kernel argument size in bytes.
1119      "Align"           integer        Required  Kernel argument alignment in
1120                                                 bytes. Must be a power of two.
1121      "ValueKind"       string         Required  Kernel argument kind that
1122                                                 specifies how to set up the
1123                                                 corresponding argument.
1124                                                 Values include:
1125
1126                                                 "ByValue"
1127                                                   The argument is copied
1128                                                   directly into the kernarg.
1129
1130                                                 "GlobalBuffer"
1131                                                   A global address space pointer
1132                                                   to the buffer data is passed
1133                                                   in the kernarg.
1134
1135                                                 "DynamicSharedPointer"
1136                                                   A group address space pointer
1137                                                   to dynamically allocated LDS
1138                                                   is passed in the kernarg.
1139
1140                                                 "Sampler"
1141                                                   A global address space
1142                                                   pointer to a S# is passed in
1143                                                   the kernarg.
1144
1145                                                 "Image"
1146                                                   A global address space
1147                                                   pointer to a T# is passed in
1148                                                   the kernarg.
1149
1150                                                 "Pipe"
1151                                                   A global address space pointer
1152                                                   to an OpenCL pipe is passed in
1153                                                   the kernarg.
1154
1155                                                 "Queue"
1156                                                   A global address space pointer
1157                                                   to an OpenCL device enqueue
1158                                                   queue is passed in the
1159                                                   kernarg.
1160
1161                                                 "HiddenGlobalOffsetX"
1162                                                   The OpenCL grid dispatch
1163                                                   global offset for the X
1164                                                   dimension is passed in the
1165                                                   kernarg.
1166
1167                                                 "HiddenGlobalOffsetY"
1168                                                   The OpenCL grid dispatch
1169                                                   global offset for the Y
1170                                                   dimension is passed in the
1171                                                   kernarg.
1172
1173                                                 "HiddenGlobalOffsetZ"
1174                                                   The OpenCL grid dispatch
1175                                                   global offset for the Z
1176                                                   dimension is passed in the
1177                                                   kernarg.
1178
1179                                                 "HiddenNone"
1180                                                   An argument that is not used
1181                                                   by the kernel. Space needs to
1182                                                   be left for it, but it does
1183                                                   not need to be set up.
1184
1185                                                 "HiddenPrintfBuffer"
1186                                                   A global address space pointer
1187                                                   to the runtime printf buffer
1188                                                   is passed in kernarg.
1189
1190                                                 "HiddenDefaultQueue"
1191                                                   A global address space pointer
1192                                                   to the OpenCL device enqueue
1193                                                   queue that should be used by
1194                                                   the kernel by default is
1195                                                   passed in the kernarg.
1196
1197                                                 "HiddenCompletionAction"
1198                                                   A global address space pointer
1199                                                   to help link enqueued kernels into
1200                                                   the ancestor tree for determining
1201                                                   when the parent kernel has finished.
1202
1203      "ValueType"       string         Required  Kernel argument value type. Only
1204                                                 present if "ValueKind" is
1205                                                 "ByValue". For vector data
1206                                                 types, the value is for the
1207                                                 element type. Values include:
1208
1209                                                 - "Struct"
1210                                                 - "I8"
1211                                                 - "U8"
1212                                                 - "I16"
1213                                                 - "U16"
1214                                                 - "F16"
1215                                                 - "I32"
1216                                                 - "U32"
1217                                                 - "F32"
1218                                                 - "I64"
1219                                                 - "U64"
1220                                                 - "F64"
1221
1222                                                 .. TODO
1223                                                    How can it be determined if a
1224                                                    vector type, and what size
1225                                                    vector?
1226      "PointeeAlign"    integer                  Alignment in bytes of pointee
1227                                                 type for pointer type kernel
1228                                                 argument. Must be a power
1229                                                 of 2. Only present if
1230                                                 "ValueKind" is
1231                                                 "DynamicSharedPointer".
1232      "AddrSpaceQual"   string                   Kernel argument address space
1233                                                 qualifier. Only present if
1234                                                 "ValueKind" is "GlobalBuffer" or
1235                                                 "DynamicSharedPointer". Values
1236                                                 are:
1237
1238                                                 - "Private"
1239                                                 - "Global"
1240                                                 - "Constant"
1241                                                 - "Local"
1242                                                 - "Generic"
1243                                                 - "Region"
1244
1245                                                 .. TODO
1246                                                    Is GlobalBuffer only Global
1247                                                    or Constant? Is
1248                                                    DynamicSharedPointer always
1249                                                    Local? Can HCC allow Generic?
1250                                                    How can Private or Region
1251                                                    ever happen?
1252      "AccQual"         string                   Kernel argument access
1253                                                 qualifier. Only present if
1254                                                 "ValueKind" is "Image" or
1255                                                 "Pipe". Values
1256                                                 are:
1257
1258                                                 - "ReadOnly"
1259                                                 - "WriteOnly"
1260                                                 - "ReadWrite"
1261
1262                                                 .. TODO
1263                                                    Does this apply to
1264                                                    GlobalBuffer?
1265      "ActualAccQual"   string                   The actual memory accesses
1266                                                 performed by the kernel on the
1267                                                 kernel argument. Only present if
1268                                                 "ValueKind" is "GlobalBuffer",
1269                                                 "Image", or "Pipe". This may be
1270                                                 more restrictive than indicated
1271                                                 by "AccQual" to reflect what the
1272                                                 kernel actual does. If not
1273                                                 present then the runtime must
1274                                                 assume what is implied by
1275                                                 "AccQual" and "IsConst". Values
1276                                                 are:
1277
1278                                                 - "ReadOnly"
1279                                                 - "WriteOnly"
1280                                                 - "ReadWrite"
1281
1282      "IsConst"         boolean                  Indicates if the kernel argument
1283                                                 is const qualified. Only present
1284                                                 if "ValueKind" is
1285                                                 "GlobalBuffer".
1286
1287      "IsRestrict"      boolean                  Indicates if the kernel argument
1288                                                 is restrict qualified. Only
1289                                                 present if "ValueKind" is
1290                                                 "GlobalBuffer".
1291
1292      "IsVolatile"      boolean                  Indicates if the kernel argument
1293                                                 is volatile qualified. Only
1294                                                 present if "ValueKind" is
1295                                                 "GlobalBuffer".
1296
1297      "IsPipe"          boolean                  Indicates if the kernel argument
1298                                                 is pipe qualified. Only present
1299                                                 if "ValueKind" is "Pipe".
1300
1301                                                 .. TODO
1302                                                    Can GlobalBuffer be pipe
1303                                                    qualified?
1304      ================= ============== ========= ================================
1305
1306 ..
1307
1308   .. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
1309      :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
1310
1311      ============================ ============== ========= =====================
1312      String Key                   Value Type     Required? Description
1313      ============================ ============== ========= =====================
1314      "KernargSegmentSize"         integer        Required  The size in bytes of
1315                                                            the kernarg segment
1316                                                            that holds the values
1317                                                            of the arguments to
1318                                                            the kernel.
1319      "GroupSegmentFixedSize"      integer        Required  The amount of group
1320                                                            segment memory
1321                                                            required by a
1322                                                            work-group in
1323                                                            bytes. This does not
1324                                                            include any
1325                                                            dynamically allocated
1326                                                            group segment memory
1327                                                            that may be added
1328                                                            when the kernel is
1329                                                            dispatched.
1330      "PrivateSegmentFixedSize"    integer        Required  The amount of fixed
1331                                                            private address space
1332                                                            memory required for a
1333                                                            work-item in
1334                                                            bytes. If the kernel
1335                                                            uses a dynamic call
1336                                                            stack then additional
1337                                                            space must be added
1338                                                            to this value for the
1339                                                            call stack.
1340      "KernargSegmentAlign"        integer        Required  The maximum byte
1341                                                            alignment of
1342                                                            arguments in the
1343                                                            kernarg segment. Must
1344                                                            be a power of 2.
1345      "WavefrontSize"              integer        Required  Wavefront size. Must
1346                                                            be a power of 2.
1347      "NumSGPRs"                   integer        Required  Number of scalar
1348                                                            registers used by a
1349                                                            wavefront for
1350                                                            GFX6-GFX9. This
1351                                                            includes the special
1352                                                            SGPRs for VCC, Flat
1353                                                            Scratch (GFX7-GFX9)
1354                                                            and XNACK (for
1355                                                            GFX8-GFX9). It does
1356                                                            not include the 16
1357                                                            SGPR added if a trap
1358                                                            handler is
1359                                                            enabled. It is not
1360                                                            rounded up to the
1361                                                            allocation
1362                                                            granularity.
1363      "NumVGPRs"                   integer        Required  Number of vector
1364                                                            registers used by
1365                                                            each work-item for
1366                                                            GFX6-GFX9
1367      "MaxFlatWorkGroupSize"       integer        Required  Maximum flat
1368                                                            work-group size
1369                                                            supported by the
1370                                                            kernel in work-items.
1371                                                            Must be >=1 and
1372                                                            consistent with
1373                                                            ReqdWorkGroupSize if
1374                                                            not 0, 0, 0.
1375      "NumSpilledSGPRs"            integer                  Number of stores from
1376                                                            a scalar register to
1377                                                            a register allocator
1378                                                            created spill
1379                                                            location.
1380      "NumSpilledVGPRs"            integer                  Number of stores from
1381                                                            a vector register to
1382                                                            a register allocator
1383                                                            created spill
1384                                                            location.
1385      ============================ ============== ========= =====================
1386
1387 ..
1388
1389 Kernel Dispatch
1390 ~~~~~~~~~~~~~~~
1391
1392 The HSA architected queuing language (AQL) defines a user space memory interface
1393 that can be used to control the dispatch of kernels, in an agent independent
1394 way. An agent can have zero or more AQL queues created for it using the ROCm
1395 runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
1396 *HSA Platform System Architecture Specification* [HSA]_ for the AQL queue
1397 mechanics and packet layouts.
1398
1399 The packet processor of a kernel agent is responsible for detecting and
1400 dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
1401 packet processor is implemented by the hardware command processor (CP),
1402 asynchronous dispatch controller (ADC) and shader processor input controller
1403 (SPI).
1404
1405 The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
1406 mode driver to initialize and register the AQL queue with CP.
1407
1408 To dispatch a kernel the following actions are performed. This can occur in the
1409 CPU host program, or from an HSA kernel executing on a GPU.
1410
1411 1. A pointer to an AQL queue for the kernel agent on which the kernel is to be
1412    executed is obtained.
1413 2. A pointer to the kernel descriptor (see
1414    :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
1415    obtained. It must be for a kernel that is contained in a code object that that
1416    was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
1417    associated.
1418 3. Space is allocated for the kernel arguments using the ROCm runtime allocator
1419    for a memory region with the kernarg property for the kernel agent that will
1420    execute the kernel. It must be at least 16 byte aligned.
1421 4. Kernel argument values are assigned to the kernel argument memory
1422    allocation. The layout is defined in the *HSA Programmer's Language Reference*
1423    [HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
1424    memory in the same way constant memory is accessed. (Note that the HSA
1425    specification allows an implementation to copy the kernel argument contents to
1426    another location that is accessed by the kernel.)
1427 5. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
1428    api uses 64 bit atomic operations to reserve space in the AQL queue for the
1429    packet. The packet must be set up, and the final write must use an atomic
1430    store release to set the packet kind to ensure the packet contents are
1431    visible to the kernel agent. AQL defines a doorbell signal mechanism to
1432    notify the kernel agent that the AQL queue has been updated. These rules, and
1433    the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
1434    System Architecture Specification* [HSA]_.
1435 6. A kernel dispatch packet includes information about the actual dispatch,
1436    such as grid and work-group size, together with information from the code
1437    object about the kernel, such as segment sizes. The ROCm runtime queries on
1438    the kernel symbol can be used to obtain the code object values which are
1439    recorded in the :ref:`amdgpu-amdhsa-hsa-code-object-metadata`.
1440 7. CP executes micro-code and is responsible for detecting and setting up the
1441    GPU to execute the wavefronts of a kernel dispatch.
1442 8. CP ensures that when the a wavefront starts executing the kernel machine
1443    code, the scalar general purpose registers (SGPR) and vector general purpose
1444    registers (VGPR) are set up as required by the machine code. The required
1445    setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
1446    register state is defined in
1447    :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
1448 9. The prolog of the kernel machine code (see
1449    :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
1450    before continuing executing the machine code that corresponds to the kernel.
1451 10. When the kernel dispatch has completed execution, CP signals the completion
1452     signal specified in the kernel dispatch packet if not 0.
1453
1454 .. _amdgpu-amdhsa-memory-spaces:
1455
1456 Memory Spaces
1457 ~~~~~~~~~~~~~
1458
1459 The memory space properties are:
1460
1461   .. table:: AMDHSA Memory Spaces
1462      :name: amdgpu-amdhsa-memory-spaces-table
1463
1464      ================= =========== ======== ======= ==================
1465      Memory Space Name HSA Segment Hardware Address NULL Value
1466                        Name        Name     Size
1467      ================= =========== ======== ======= ==================
1468      Private           private     scratch  32      0x00000000
1469      Local             group       LDS      32      0xFFFFFFFF
1470      Global            global      global   64      0x0000000000000000
1471      Constant          constant    *same as 64      0x0000000000000000
1472                                    global*
1473      Generic           flat        flat     64      0x0000000000000000
1474      Region            N/A         GDS      32      *not implemented
1475                                                     for AMDHSA*
1476      ================= =========== ======== ======= ==================
1477
1478 The global and constant memory spaces both use global virtual addresses, which
1479 are the same virtual address space used by the CPU. However, some virtual
1480 addresses may only be accessible to the CPU, some only accessible by the GPU,
1481 and some by both.
1482
1483 Using the constant memory space indicates that the data will not change during
1484 the execution of the kernel. This allows scalar read instructions to be
1485 used. The vector and scalar L1 caches are invalidated of volatile data before
1486 each kernel dispatch execution to allow constant memory to change values between
1487 kernel dispatches.
1488
1489 The local memory space uses the hardware Local Data Store (LDS) which is
1490 automatically allocated when the hardware creates work-groups of wavefronts, and
1491 freed when all the wavefronts of a work-group have terminated. The data store
1492 (DS) instructions can be used to access it.
1493
1494 The private memory space uses the hardware scratch memory support. If the kernel
1495 uses scratch, then the hardware allocates memory that is accessed using
1496 wavefront lane dword (4 byte) interleaving. The mapping used from private
1497 address to physical address is:
1498
1499   ``wavefront-scratch-base +
1500   (private-address * wavefront-size * 4) +
1501   (wavefront-lane-id * 4)``
1502
1503 There are different ways that the wavefront scratch base address is determined
1504 by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
1505 memory can be accessed in an interleaved manner using buffer instruction with
1506 the scratch buffer descriptor and per wavefront scratch offset, by the scratch
1507 instructions, or by flat instructions. If each lane of a wavefront accesses the
1508 same private address, the interleaving results in adjacent dwords being accessed
1509 and hence requires fewer cache lines to be fetched. Multi-dword access is not
1510 supported except by flat and scratch instructions in GFX9.
1511
1512 The generic address space uses the hardware flat address support available in
1513 GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
1514 local appertures), that are outside the range of addressible global memory, to
1515 map from a flat address to a private or local address.
1516
1517 FLAT instructions can take a flat address and access global, private (scratch)
1518 and group (LDS) memory depending in if the address is within one of the
1519 apperture ranges. Flat access to scratch requires hardware aperture setup and
1520 setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
1521 access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
1522 (see :ref:`amdgpu-amdhsa-m0`).
1523
1524 To convert between a segment address and a flat address the base address of the
1525 appertures address can be used. For GFX7-GFX8 these are available in the
1526 :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
1527 Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
1528 GFX9 the appature base addresses are directly available as inline constant
1529 registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
1530 address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
1531 which makes it easier to convert from flat to segment or segment to flat.
1532
1533 Image and Samplers
1534 ~~~~~~~~~~~~~~~~~~
1535
1536 Image and sample handles created by the ROCm runtime are 64 bit addresses of a
1537 hardware 32 byte V# and 48 byte S# object respectively. In order to support the
1538 HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
1539 enumeration values for the queries that are not trivially deducible from the S#
1540 representation.
1541
1542 HSA Signals
1543 ~~~~~~~~~~~
1544
1545 HSA signal handles created by the ROCm runtime are 64 bit addresses of a
1546 structure allocated in memory accessible from both the CPU and GPU. The
1547 structure is defined by the ROCm runtime and subject to change between releases
1548 (see [AMD-ROCm-github]_).
1549
1550 .. _amdgpu-amdhsa-hsa-aql-queue:
1551
1552 HSA AQL Queue
1553 ~~~~~~~~~~~~~
1554
1555 The HSA AQL queue structure is defined by the ROCm runtime and subject to change
1556 between releases (see [AMD-ROCm-github]_). For some processors it contains
1557 fields needed to implement certain language features such as the flat address
1558 aperture bases. It also contains fields used by CP such as managing the
1559 allocation of scratch memory.
1560
1561 .. _amdgpu-amdhsa-kernel-descriptor:
1562
1563 Kernel Descriptor
1564 ~~~~~~~~~~~~~~~~~
1565
1566 A kernel descriptor consists of the information needed by CP to initiate the
1567 execution of a kernel, including the entry point address of the machine code
1568 that implements the kernel.
1569
1570 Kernel Descriptor for GFX6-GFX9
1571 +++++++++++++++++++++++++++++++
1572
1573 CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
1574
1575   .. table:: Kernel Descriptor for GFX6-GFX9
1576      :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
1577
1578      ======= ======= =============================== ============================
1579      Bits    Size    Field Name                      Description
1580      ======= ======= =============================== ============================
1581      31:0    4 bytes GroupSegmentFixedSize           The amount of fixed local
1582                                                      address space memory
1583                                                      required for a work-group
1584                                                      in bytes. This does not
1585                                                      include any dynamically
1586                                                      allocated local address
1587                                                      space memory that may be
1588                                                      added when the kernel is
1589                                                      dispatched.
1590      63:32   4 bytes PrivateSegmentFixedSize         The amount of fixed
1591                                                      private address space
1592                                                      memory required for a
1593                                                      work-item in bytes. If
1594                                                      is_dynamic_callstack is 1
1595                                                      then additional space must
1596                                                      be added to this value for
1597                                                      the call stack.
1598      127:64  8 bytes                                 Reserved, must be 0.
1599      191:128 8 bytes KernelCodeEntryByteOffset       Byte offset (possibly
1600                                                      negative) from base
1601                                                      address of kernel
1602                                                      descriptor to kernel's
1603                                                      entry point instruction
1604                                                      which must be 256 byte
1605                                                      aligned.
1606      383:192 24                                      Reserved, must be 0.
1607              bytes
1608      415:384 4 bytes ComputePgmRsrc1                 Compute Shader (CS)
1609                                                      program settings used by
1610                                                      CP to set up
1611                                                      ``COMPUTE_PGM_RSRC1``
1612                                                      configuration
1613                                                      register. See
1614                                                      :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
1615      447:416 4 bytes ComputePgmRsrc2                 Compute Shader (CS)
1616                                                      program settings used by
1617                                                      CP to set up
1618                                                      ``COMPUTE_PGM_RSRC2``
1619                                                      configuration
1620                                                      register. See
1621                                                      :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
1622      448     1 bit   EnableSGPRPrivateSegmentBuffer  Enable the setup of the
1623                                                      SGPR user data registers
1624                                                      (see
1625                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1626
1627                                                      The total number of SGPR
1628                                                      user data registers
1629                                                      requested must not exceed
1630                                                      16 and match value in
1631                                                      ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
1632                                                      Any requests beyond 16
1633                                                      will be ignored.
1634      449     1 bit   EnableSGPRDispatchPtr           *see above*
1635      450     1 bit   EnableSGPRQueuePtr              *see above*
1636      451     1 bit   EnableSGPRKernargSegmentPtr     *see above*
1637      452     1 bit   EnableSGPRDispatchID            *see above*
1638      453     1 bit   EnableSGPRFlatScratchInit       *see above*
1639      454     1 bit   EnableSGPRPrivateSegmentSize    *see above*
1640      455     1 bit   EnableSGPRGridWorkgroupCountX   Not implemented in CP and
1641                                                      should always be 0.
1642      456     1 bit   EnableSGPRGridWorkgroupCountY   Not implemented in CP and
1643                                                      should always be 0.
1644      457     1 bit   EnableSGPRGridWorkgroupCountZ   Not implemented in CP and
1645                                                      should always be 0.
1646      463:458 6 bits                                  Reserved, must be 0.
1647      511:464 6                                       Reserved, must be 0.
1648              bytes
1649      512     **Total size 64 bytes.**
1650      ======= ====================================================================
1651
1652 ..
1653
1654   .. table:: compute_pgm_rsrc1 for GFX6-GFX9
1655      :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table
1656
1657      ======= ======= =============================== ===========================================================================
1658      Bits    Size    Field Name                      Description
1659      ======= ======= =============================== ===========================================================================
1660      5:0     6 bits  GRANULATED_WORKITEM_VGPR_COUNT  Number of vector registers
1661                                                      used by each work-item,
1662                                                      granularity is device
1663                                                      specific:
1664
1665                                                      GFX6-GFX9
1666                                                        - max_vgpr 1..256
1667                                                        - roundup((max_vgpg + 1)
1668                                                          / 4) - 1
1669
1670                                                      Used by CP to set up
1671                                                      ``COMPUTE_PGM_RSRC1.VGPRS``.
1672      9:6     4 bits  GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar registers
1673                                                      used by a wavefront,
1674                                                      granularity is device
1675                                                      specific:
1676
1677                                                      GFX6-GFX8
1678                                                        - max_sgpr 1..112
1679                                                        - roundup((max_sgpg + 1)
1680                                                          / 8) - 1
1681                                                      GFX9
1682                                                        - max_sgpr 1..112
1683                                                        - roundup((max_sgpg + 1)
1684                                                          / 16) - 1
1685
1686                                                      Includes the special SGPRs
1687                                                      for VCC, Flat Scratch (for
1688                                                      GFX7 onwards) and XNACK
1689                                                      (for GFX8 onwards). It does
1690                                                      not include the 16 SGPR
1691                                                      added if a trap handler is
1692                                                      enabled.
1693
1694                                                      Used by CP to set up
1695                                                      ``COMPUTE_PGM_RSRC1.SGPRS``.
1696      11:10   2 bits  PRIORITY                        Must be 0.
1697
1698                                                      Start executing wavefront
1699                                                      at the specified priority.
1700
1701                                                      CP is responsible for
1702                                                      filling in
1703                                                      ``COMPUTE_PGM_RSRC1.PRIORITY``.
1704      13:12   2 bits  FLOAT_ROUND_MODE_32             Wavefront starts execution
1705                                                      with specified rounding
1706                                                      mode for single (32
1707                                                      bit) floating point
1708                                                      precision floating point
1709                                                      operations.
1710
1711                                                      Floating point rounding
1712                                                      mode values are defined in
1713                                                      :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1714
1715                                                      Used by CP to set up
1716                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1717      15:14   2 bits  FLOAT_ROUND_MODE_16_64          Wavefront starts execution
1718                                                      with specified rounding
1719                                                      denorm mode for half/double (16
1720                                                      and 64 bit) floating point
1721                                                      precision floating point
1722                                                      operations.
1723
1724                                                      Floating point rounding
1725                                                      mode values are defined in
1726                                                      :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1727
1728                                                      Used by CP to set up
1729                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1730      17:16   2 bits  FLOAT_DENORM_MODE_32            Wavefront starts execution
1731                                                      with specified denorm mode
1732                                                      for single (32
1733                                                      bit)  floating point
1734                                                      precision floating point
1735                                                      operations.
1736
1737                                                      Floating point denorm mode
1738                                                      values are defined in
1739                                                      :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1740
1741                                                      Used by CP to set up
1742                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1743      19:18   2 bits  FLOAT_DENORM_MODE_16_64         Wavefront starts execution
1744                                                      with specified denorm mode
1745                                                      for half/double (16
1746                                                      and 64 bit) floating point
1747                                                      precision floating point
1748                                                      operations.
1749
1750                                                      Floating point denorm mode
1751                                                      values are defined in
1752                                                      :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1753
1754                                                      Used by CP to set up
1755                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1756      20      1 bit   PRIV                            Must be 0.
1757
1758                                                      Start executing wavefront
1759                                                      in privilege trap handler
1760                                                      mode.
1761
1762                                                      CP is responsible for
1763                                                      filling in
1764                                                      ``COMPUTE_PGM_RSRC1.PRIV``.
1765      21      1 bit   ENABLE_DX10_CLAMP               Wavefront starts execution
1766                                                      with DX10 clamp mode
1767                                                      enabled. Used by the vector
1768                                                      ALU to force DX10 style
1769                                                      treatment of NaN's (when
1770                                                      set, clamp NaN to zero,
1771                                                      otherwise pass NaN
1772                                                      through).
1773
1774                                                      Used by CP to set up
1775                                                      ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
1776      22      1 bit   DEBUG_MODE                      Must be 0.
1777
1778                                                      Start executing wavefront
1779                                                      in single step mode.
1780
1781                                                      CP is responsible for
1782                                                      filling in
1783                                                      ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
1784      23      1 bit   ENABLE_IEEE_MODE                Wavefront starts execution
1785                                                      with IEEE mode
1786                                                      enabled. Floating point
1787                                                      opcodes that support
1788                                                      exception flag gathering
1789                                                      will quiet and propagate
1790                                                      signaling-NaN inputs per
1791                                                      IEEE 754-2008. Min_dx10 and
1792                                                      max_dx10 become IEEE
1793                                                      754-2008 compliant due to
1794                                                      signaling-NaN propagation
1795                                                      and quieting.
1796
1797                                                      Used by CP to set up
1798                                                      ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
1799      24      1 bit   BULKY                           Must be 0.
1800
1801                                                      Only one work-group allowed
1802                                                      to execute on a compute
1803                                                      unit.
1804
1805                                                      CP is responsible for
1806                                                      filling in
1807                                                      ``COMPUTE_PGM_RSRC1.BULKY``.
1808      25      1 bit   CDBG_USER                       Must be 0.
1809
1810                                                      Flag that can be used to
1811                                                      control debugging code.
1812
1813                                                      CP is responsible for
1814                                                      filling in
1815                                                      ``COMPUTE_PGM_RSRC1.CDBG_USER``.
1816      26      1 bit   FP16_OVFL                       GFX6-GFX8
1817                                                        Reserved, must be 0.
1818                                                      GFX9
1819                                                        Wavefront starts execution
1820                                                        with specified fp16 overflow
1821                                                        mode.
1822
1823                                                        - If 0, fp16 overflow generates
1824                                                          +/-INF values.
1825                                                        - If 1, fp16 overflow that is the
1826                                                          result of an +/-INF input value
1827                                                          or divide by 0 produces a +/-INF,
1828                                                          otherwise clamps computed
1829                                                          overflow to +/-MAX_FP16 as
1830                                                          appropriate.
1831
1832                                                        Used by CP to set up
1833                                                        ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
1834      31:27   5 bits                                  Reserved, must be 0.
1835      32      **Total size 4 bytes**
1836      ======= ===================================================================================================================
1837
1838 ..
1839
1840   .. table:: compute_pgm_rsrc2 for GFX6-GFX9
1841      :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
1842
1843      ======= ======= =============================== ===========================================================================
1844      Bits    Size    Field Name                      Description
1845      ======= ======= =============================== ===========================================================================
1846      0       1 bit   ENABLE_SGPR_PRIVATE_SEGMENT     Enable the setup of the
1847                      _WAVEFRONT_OFFSET               SGPR wavefront scratch offset
1848                                                      system register (see
1849                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1850
1851                                                      Used by CP to set up
1852                                                      ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
1853      5:1     5 bits  USER_SGPR_COUNT                 The total number of SGPR
1854                                                      user data registers
1855                                                      requested. This number must
1856                                                      match the number of user
1857                                                      data registers enabled.
1858
1859                                                      Used by CP to set up
1860                                                      ``COMPUTE_PGM_RSRC2.USER_SGPR``.
1861      6       1 bit   ENABLE_TRAP_HANDLER             Set to 1 if code contains a
1862                                                      TRAP instruction which
1863                                                      requires a trap handler to
1864                                                      be enabled.
1865
1866                                                      CP sets
1867                                                      ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``
1868                                                      if the runtime has
1869                                                      installed a trap handler
1870                                                      regardless of the setting
1871                                                      of this field.
1872      7       1 bit   ENABLE_SGPR_WORKGROUP_ID_X      Enable the setup of the
1873                                                      system SGPR register for
1874                                                      the work-group id in the X
1875                                                      dimension (see
1876                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1877
1878                                                      Used by CP to set up
1879                                                      ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
1880      8       1 bit   ENABLE_SGPR_WORKGROUP_ID_Y      Enable the setup of the
1881                                                      system SGPR register for
1882                                                      the work-group id in the Y
1883                                                      dimension (see
1884                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1885
1886                                                      Used by CP to set up
1887                                                      ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
1888      9       1 bit   ENABLE_SGPR_WORKGROUP_ID_Z      Enable the setup of the
1889                                                      system SGPR register for
1890                                                      the work-group id in the Z
1891                                                      dimension (see
1892                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1893
1894                                                      Used by CP to set up
1895                                                      ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
1896      10      1 bit   ENABLE_SGPR_WORKGROUP_INFO      Enable the setup of the
1897                                                      system SGPR register for
1898                                                      work-group information (see
1899                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1900
1901                                                      Used by CP to set up
1902                                                      ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
1903      12:11   2 bits  ENABLE_VGPR_WORKITEM_ID         Enable the setup of the
1904                                                      VGPR system registers used
1905                                                      for the work-item ID.
1906                                                      :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
1907                                                      defines the values.
1908
1909                                                      Used by CP to set up
1910                                                      ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
1911      13      1 bit   ENABLE_EXCEPTION_ADDRESS_WATCH  Must be 0.
1912
1913                                                      Wavefront starts execution
1914                                                      with address watch
1915                                                      exceptions enabled which
1916                                                      are generated when L1 has
1917                                                      witnessed a thread access
1918                                                      an *address of
1919                                                      interest*.
1920
1921                                                      CP is responsible for
1922                                                      filling in the address
1923                                                      watch bit in
1924                                                      ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1925                                                      according to what the
1926                                                      runtime requests.
1927      14      1 bit   ENABLE_EXCEPTION_MEMORY         Must be 0.
1928
1929                                                      Wavefront starts execution
1930                                                      with memory violation
1931                                                      exceptions exceptions
1932                                                      enabled which are generated
1933                                                      when a memory violation has
1934                                                      occurred for this wavefront from
1935                                                      L1 or LDS
1936                                                      (write-to-read-only-memory,
1937                                                      mis-aligned atomic, LDS
1938                                                      address out of range,
1939                                                      illegal address, etc.).
1940
1941                                                      CP sets the memory
1942                                                      violation bit in
1943                                                      ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1944                                                      according to what the
1945                                                      runtime requests.
1946      23:15   9 bits  GRANULATED_LDS_SIZE             Must be 0.
1947
1948                                                      CP uses the rounded value
1949                                                      from the dispatch packet,
1950                                                      not this value, as the
1951                                                      dispatch may contain
1952                                                      dynamically allocated group
1953                                                      segment memory. CP writes
1954                                                      directly to
1955                                                      ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
1956
1957                                                      Amount of group segment
1958                                                      (LDS) to allocate for each
1959                                                      work-group. Granularity is
1960                                                      device specific:
1961
1962                                                      GFX6:
1963                                                        roundup(lds-size / (64 * 4))
1964                                                      GFX7-GFX9:
1965                                                        roundup(lds-size / (128 * 4))
1966
1967      24      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    Wavefront starts execution
1968                      _INVALID_OPERATION              with specified exceptions
1969                                                      enabled.
1970
1971                                                      Used by CP to set up
1972                                                      ``COMPUTE_PGM_RSRC2.EXCP_EN``
1973                                                      (set from bits 0..6).
1974
1975                                                      IEEE 754 FP Invalid
1976                                                      Operation
1977      25      1 bit   ENABLE_EXCEPTION_FP_DENORMAL    FP Denormal one or more
1978                      _SOURCE                         input operands is a
1979                                                      denormal number
1980      26      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Division by
1981                      _DIVISION_BY_ZERO               Zero
1982      27      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP FP Overflow
1983                      _OVERFLOW
1984      28      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Underflow
1985                      _UNDERFLOW
1986      29      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Inexact
1987                      _INEXACT
1988      30      1 bit   ENABLE_EXCEPTION_INT_DIVIDE_BY  Integer Division by Zero
1989                      _ZERO                           (rcp_iflag_f32 instruction
1990                                                      only)
1991      31      1 bit                                   Reserved, must be 0.
1992      32      **Total size 4 bytes.**
1993      ======= ===================================================================================================================
1994
1995 ..
1996
1997   .. table:: Floating Point Rounding Mode Enumeration Values
1998      :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
1999
2000      ====================================== ===== ==============================
2001      Enumeration Name                       Value Description
2002      ====================================== ===== ==============================
2003      AMDGPU_FLOAT_ROUND_MODE_NEAR_EVEN      0     Round Ties To Even
2004      AMDGPU_FLOAT_ROUND_MODE_PLUS_INFINITY  1     Round Toward +infinity
2005      AMDGPU_FLOAT_ROUND_MODE_MINUS_INFINITY 2     Round Toward -infinity
2006      AMDGPU_FLOAT_ROUND_MODE_ZERO           3     Round Toward 0
2007      ====================================== ===== ==============================
2008
2009 ..
2010
2011   .. table:: Floating Point Denorm Mode Enumeration Values
2012      :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
2013
2014      ====================================== ===== ==============================
2015      Enumeration Name                       Value Description
2016      ====================================== ===== ==============================
2017      AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC_DST 0     Flush Source and Destination
2018                                                   Denorms
2019      AMDGPU_FLOAT_DENORM_MODE_FLUSH_DST     1     Flush Output Denorms
2020      AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC     2     Flush Source Denorms
2021      AMDGPU_FLOAT_DENORM_MODE_FLUSH_NONE    3     No Flush
2022      ====================================== ===== ==============================
2023
2024 ..
2025
2026   .. table:: System VGPR Work-Item ID Enumeration Values
2027      :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
2028
2029      ======================================== ===== ============================
2030      Enumeration Name                         Value Description
2031      ======================================== ===== ============================
2032      AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X         0     Set work-item X dimension
2033                                                     ID.
2034      AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y       1     Set work-item X and Y
2035                                                     dimensions ID.
2036      AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y_Z     2     Set work-item X, Y and Z
2037                                                     dimensions ID.
2038      AMDGPU_SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3     Undefined.
2039      ======================================== ===== ============================
2040
2041 .. _amdgpu-amdhsa-initial-kernel-execution-state:
2042
2043 Initial Kernel Execution State
2044 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2045
2046 This section defines the register state that will be set up by the packet
2047 processor prior to the start of execution of every wavefront. This is limited by
2048 the constraints of the hardware controllers of CP/ADC/SPI.
2049
2050 The order of the SGPR registers is defined, but the compiler can specify which
2051 ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
2052 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2053 for enabled registers are dense starting at SGPR0: the first enabled register is
2054 SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
2055 an SGPR number.
2056
2057 The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
2058 all wavefronts of the grid. It is possible to specify more than 16 User SGPRs using
2059 the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
2060 initialized. These are then immediately followed by the System SGPRs that are
2061 set up by ADC/SPI and can have different values for each wavefront of the grid
2062 dispatch.
2063
2064 SGPR register initial state is defined in
2065 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
2066
2067   .. table:: SGPR Register Set Up Order
2068      :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
2069
2070      ========== ========================== ====== ==============================
2071      SGPR Order Name                       Number Description
2072                 (kernel descriptor enable  of
2073                 field)                     SGPRs
2074      ========== ========================== ====== ==============================
2075      First      Private Segment Buffer     4      V# that can be used, together
2076                 (enable_sgpr_private              with Scratch Wavefront Offset
2077                 _segment_buffer)                  as an offset, to access the
2078                                                   private memory space using a
2079                                                   segment address.
2080
2081                                                   CP uses the value provided by
2082                                                   the runtime.
2083      then       Dispatch Ptr               2      64 bit address of AQL dispatch
2084                 (enable_sgpr_dispatch_ptr)        packet for kernel dispatch
2085                                                   actually executing.
2086      then       Queue Ptr                  2      64 bit address of amd_queue_t
2087                 (enable_sgpr_queue_ptr)           object for AQL queue on which
2088                                                   the dispatch packet was
2089                                                   queued.
2090      then       Kernarg Segment Ptr        2      64 bit address of Kernarg
2091                 (enable_sgpr_kernarg              segment. This is directly
2092                 _segment_ptr)                     copied from the
2093                                                   kernarg_address in the kernel
2094                                                   dispatch packet.
2095
2096                                                   Having CP load it once avoids
2097                                                   loading it at the beginning of
2098                                                   every wavefront.
2099      then       Dispatch Id                2      64 bit Dispatch ID of the
2100                 (enable_sgpr_dispatch_id)         dispatch packet being
2101                                                   executed.
2102      then       Flat Scratch Init          2      This is 2 SGPRs:
2103                 (enable_sgpr_flat_scratch
2104                 _init)                            GFX6
2105                                                     Not supported.
2106                                                   GFX7-GFX8
2107                                                     The first SGPR is a 32 bit
2108                                                     byte offset from
2109                                                     ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2110                                                     to per SPI base of memory
2111                                                     for scratch for the queue
2112                                                     executing the kernel
2113                                                     dispatch. CP obtains this
2114                                                     from the runtime. (The
2115                                                     Scratch Segment Buffer base
2116                                                     address is
2117                                                     ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2118                                                     plus this offset.) The value
2119                                                     of Scratch Wavefront Offset must
2120                                                     be added to this offset by
2121                                                     the kernel machine code,
2122                                                     right shifted by 8, and
2123                                                     moved to the FLAT_SCRATCH_HI
2124                                                     SGPR register.
2125                                                     FLAT_SCRATCH_HI corresponds
2126                                                     to SGPRn-4 on GFX7, and
2127                                                     SGPRn-6 on GFX8 (where SGPRn
2128                                                     is the highest numbered SGPR
2129                                                     allocated to the wavefront).
2130                                                     FLAT_SCRATCH_HI is
2131                                                     multiplied by 256 (as it is
2132                                                     in units of 256 bytes) and
2133                                                     added to
2134                                                     ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2135                                                     to calculate the per wavefront
2136                                                     FLAT SCRATCH BASE in flat
2137                                                     memory instructions that
2138                                                     access the scratch
2139                                                     apperture.
2140
2141                                                     The second SGPR is 32 bit
2142                                                     byte size of a single
2143                                                     work-item's scratch memory
2144                                                     usage. CP obtains this from
2145                                                     the runtime, and it is
2146                                                     always a multiple of DWORD.
2147                                                     CP checks that the value in
2148                                                     the kernel dispatch packet
2149                                                     Private Segment Byte Size is
2150                                                     not larger, and requests the
2151                                                     runtime to increase the
2152                                                     queue's scratch size if
2153                                                     necessary. The kernel code
2154                                                     must move it to
2155                                                     FLAT_SCRATCH_LO which is
2156                                                     SGPRn-3 on GFX7 and SGPRn-5
2157                                                     on GFX8. FLAT_SCRATCH_LO is
2158                                                     used as the FLAT SCRATCH
2159                                                     SIZE in flat memory
2160                                                     instructions. Having CP load
2161                                                     it once avoids loading it at
2162                                                     the beginning of every
2163                                                     wavefront.
2164                                                   GFX9
2165                                                     This is the
2166                                                     64 bit base address of the
2167                                                     per SPI scratch backing
2168                                                     memory managed by SPI for
2169                                                     the queue executing the
2170                                                     kernel dispatch. CP obtains
2171                                                     this from the runtime (and
2172                                                     divides it if there are
2173                                                     multiple Shader Arrays each
2174                                                     with its own SPI). The value
2175                                                     of Scratch Wavefront Offset must
2176                                                     be added by the kernel
2177                                                     machine code and the result
2178                                                     moved to the FLAT_SCRATCH
2179                                                     SGPR which is SGPRn-6 and
2180                                                     SGPRn-5. It is used as the
2181                                                     FLAT SCRATCH BASE in flat
2182                                                     memory instructions.
2183      then       Private Segment Size       1      The 32 bit byte size of a
2184                                                   (enable_sgpr_private single
2185                                                   work-item's
2186                                                   scratch_segment_size) memory
2187                                                   allocation. This is the
2188                                                   value from the kernel
2189                                                   dispatch packet Private
2190                                                   Segment Byte Size rounded up
2191                                                   by CP to a multiple of
2192                                                   DWORD.
2193
2194                                                   Having CP load it once avoids
2195                                                   loading it at the beginning of
2196                                                   every wavefront.
2197
2198                                                   This is not used for
2199                                                   GFX7-GFX8 since it is the same
2200                                                   value as the second SGPR of
2201                                                   Flat Scratch Init. However, it
2202                                                   may be needed for GFX9 which
2203                                                   changes the meaning of the
2204                                                   Flat Scratch Init value.
2205      then       Grid Work-Group Count X    1      32 bit count of the number of
2206                 (enable_sgpr_grid                 work-groups in the X dimension
2207                 _workgroup_count_X)               for the grid being
2208                                                   executed. Computed from the
2209                                                   fields in the kernel dispatch
2210                                                   packet as ((grid_size.x +
2211                                                   workgroup_size.x - 1) /
2212                                                   workgroup_size.x).
2213      then       Grid Work-Group Count Y    1      32 bit count of the number of
2214                 (enable_sgpr_grid                 work-groups in the Y dimension
2215                 _workgroup_count_Y &&             for the grid being
2216                 less than 16 previous             executed. Computed from the
2217                 SGPRs)                            fields in the kernel dispatch
2218                                                   packet as ((grid_size.y +
2219                                                   workgroup_size.y - 1) /
2220                                                   workgroupSize.y).
2221
2222                                                   Only initialized if <16
2223                                                   previous SGPRs initialized.
2224      then       Grid Work-Group Count Z    1      32 bit count of the number of
2225                 (enable_sgpr_grid                 work-groups in the Z dimension
2226                 _workgroup_count_Z &&             for the grid being
2227                 less than 16 previous             executed. Computed from the
2228                 SGPRs)                            fields in the kernel dispatch
2229                                                   packet as ((grid_size.z +
2230                                                   workgroup_size.z - 1) /
2231                                                   workgroupSize.z).
2232
2233                                                   Only initialized if <16
2234                                                   previous SGPRs initialized.
2235      then       Work-Group Id X            1      32 bit work-group id in X
2236                 (enable_sgpr_workgroup_id         dimension of grid for
2237                 _X)                               wavefront.
2238      then       Work-Group Id Y            1      32 bit work-group id in Y
2239                 (enable_sgpr_workgroup_id         dimension of grid for
2240                 _Y)                               wavefront.
2241      then       Work-Group Id Z            1      32 bit work-group id in Z
2242                 (enable_sgpr_workgroup_id         dimension of grid for
2243                 _Z)                               wavefront.
2244      then       Work-Group Info            1      {first_wavefront, 14'b0000,
2245                 (enable_sgpr_workgroup            ordered_append_term[10:0],
2246                 _info)                            threadgroup_size_in_wavefronts[5:0]}
2247      then       Scratch Wavefront Offset   1      32 bit byte offset from base
2248                 (enable_sgpr_private              of scratch base of queue
2249                 _segment_wavefront_offset)        executing the kernel
2250                                                   dispatch. Must be used as an
2251                                                   offset with Private
2252                                                   segment address when using
2253                                                   Scratch Segment Buffer. It
2254                                                   must be used to set up FLAT
2255                                                   SCRATCH for flat addressing
2256                                                   (see
2257                                                   :ref:`amdgpu-amdhsa-flat-scratch`).
2258      ========== ========================== ====== ==============================
2259
2260 The order of the VGPR registers is defined, but the compiler can specify which
2261 ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
2262 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2263 for enabled registers are dense starting at VGPR0: the first enabled register is
2264 VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
2265 VGPR number.
2266
2267 VGPR register initial state is defined in
2268 :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
2269
2270   .. table:: VGPR Register Set Up Order
2271      :name: amdgpu-amdhsa-vgpr-register-set-up-order-table
2272
2273      ========== ========================== ====== ==============================
2274      VGPR Order Name                       Number Description
2275                 (kernel descriptor enable  of
2276                 field)                     VGPRs
2277      ========== ========================== ====== ==============================
2278      First      Work-Item Id X             1      32 bit work item id in X
2279                 (Always initialized)              dimension of work-group for
2280                                                   wavefront lane.
2281      then       Work-Item Id Y             1      32 bit work item id in Y
2282                 (enable_vgpr_workitem_id          dimension of work-group for
2283                 > 0)                              wavefront lane.
2284      then       Work-Item Id Z             1      32 bit work item id in Z
2285                 (enable_vgpr_workitem_id          dimension of work-group for
2286                 > 1)                              wavefront lane.
2287      ========== ========================== ====== ==============================
2288
2289 The setting of registers is done by GPU CP/ADC/SPI hardware as follows:
2290
2291 1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
2292    registers.
2293 2. Work-group Id registers X, Y, Z are set by ADC which supports any
2294    combination including none.
2295 3. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why
2296    its value cannot included with the flat scratch init value which is per queue.
2297 4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
2298    or (X, Y, Z).
2299
2300 Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
2301 value to the hardware required SGPRn-3 and SGPRn-4 respectively.
2302
2303 The global segment can be accessed either using buffer instructions (GFX6 which
2304 has V# 64 bit address support), flat instructions (GFX7-GFX9), or global
2305 instructions (GFX9).
2306
2307 If buffer operations are used then the compiler can generate a V# with the
2308 following properties:
2309
2310 * base address of 0
2311 * no swizzle
2312 * ATC: 1 if IOMMU present (such as APU)
2313 * ptr64: 1
2314 * MTYPE set to support memory coherence that matches the runtime (such as CC for
2315   APU and NC for dGPU).
2316
2317 .. _amdgpu-amdhsa-kernel-prolog:
2318
2319 Kernel Prolog
2320 ~~~~~~~~~~~~~
2321
2322 .. _amdgpu-amdhsa-m0:
2323
2324 M0
2325 ++
2326
2327 GFX6-GFX8
2328   The M0 register must be initialized with a value at least the total LDS size
2329   if the kernel may access LDS via DS or flat operations. Total LDS size is
2330   available in dispatch packet. For M0, it is also possible to use maximum
2331   possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
2332   GFX7-GFX8).
2333 GFX9
2334   The M0 register is not used for range checking LDS accesses and so does not
2335   need to be initialized in the prolog.
2336
2337 .. _amdgpu-amdhsa-flat-scratch:
2338
2339 Flat Scratch
2340 ++++++++++++
2341
2342 If the kernel may use flat operations to access scratch memory, the prolog code
2343 must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
2344 are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wavefront
2345 Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
2346
2347 GFX6
2348   Flat scratch is not supported.
2349
2350 GFX7-GFX8
2351   1. The low word of Flat Scratch Init is 32 bit byte offset from
2352      ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
2353      being managed by SPI for the queue executing the kernel dispatch. This is
2354      the same value used in the Scratch Segment Buffer V# base address. The
2355      prolog must add the value of Scratch Wavefront Offset to get the wavefront's byte
2356      scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
2357      FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
2358      by 8 before moving into FLAT_SCRATCH_LO.
2359   2. The second word of Flat Scratch Init is 32 bit byte size of a single
2360      work-items scratch memory usage. This is directly loaded from the kernel
2361      dispatch packet Private Segment Byte Size and rounded up to a multiple of
2362      DWORD. Having CP load it once avoids loading it at the beginning of every
2363      wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
2364      SIZE.
2365
2366 GFX9
2367   The Flat Scratch Init is the 64 bit address of the base of scratch backing
2368   memory being managed by SPI for the queue executing the kernel dispatch. The
2369   prolog must add the value of Scratch Wavefront Offset and moved to the FLAT_SCRATCH
2370   pair for use as the flat scratch base in flat memory instructions.
2371
2372 .. _amdgpu-amdhsa-memory-model:
2373
2374 Memory Model
2375 ~~~~~~~~~~~~
2376
2377 This section describes the mapping of LLVM memory model onto AMDGPU machine code
2378 (see :ref:`memmodel`). *The implementation is WIP.*
2379
2380 .. TODO
2381    Update when implementation complete.
2382
2383 The AMDGPU backend supports the memory synchronization scopes specified in
2384 :ref:`amdgpu-memory-scopes`.
2385
2386 The code sequences used to implement the memory model are defined in table
2387 :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
2388
2389 The sequences specify the order of instructions that a single thread must
2390 execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
2391 to other memory instructions executed by the same thread. This allows them to be
2392 moved earlier or later which can allow them to be combined with other instances
2393 of the same instruction, or hoisted/sunk out of loops to improve
2394 performance. Only the instructions related to the memory model are given;
2395 additional ``s_waitcnt`` instructions are required to ensure registers are
2396 defined before being used. These may be able to be combined with the memory
2397 model ``s_waitcnt`` instructions as described above.
2398
2399 The AMDGPU backend supports the following memory models:
2400
2401   HSA Memory Model [HSA]_
2402     The HSA memory model uses a single happens-before relation for all address
2403     spaces (see :ref:`amdgpu-address-spaces`).
2404   OpenCL Memory Model [OpenCL]_
2405     The OpenCL memory model which has separate happens-before relations for the
2406     global and local address spaces. Only a fence specifying both global and
2407     local address space, and seq_cst instructions join the relationships. Since
2408     the LLVM ``memfence`` instruction does not allow an address space to be
2409     specified the OpenCL fence has to convervatively assume both local and
2410     global address space was specified. However, optimizations can often be
2411     done to eliminate the additional ``s_waitcnt`` instructions when there are
2412     no intervening memory instructions which access the corresponding address
2413     space. The code sequences in the table indicate what can be omitted for the
2414     OpenCL memory. The target triple environment is used to determine if the
2415     source language is OpenCL (see :ref:`amdgpu-opencl`).
2416
2417 ``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
2418 operations.
2419
2420 ``buffer/global/flat_load/store/atomic`` instructions to global memory are
2421 termed vector memory operations.
2422
2423 For GFX6-GFX9:
2424
2425 * Each agent has multiple compute units (CU).
2426 * Each CU has multiple SIMDs that execute wavefronts.
2427 * The wavefronts for a single work-group are executed in the same CU but may be
2428   executed by different SIMDs.
2429 * Each CU has a single LDS memory shared by the wavefronts of the work-groups
2430   executing on it.
2431 * All LDS operations of a CU are performed as wavefront wide operations in a
2432   global order and involve no caching. Completion is reported to a wavefront in
2433   execution order.
2434 * The LDS memory has multiple request queues shared by the SIMDs of a
2435   CU. Therefore, the LDS operations performed by different wavefronts of a work-group
2436   can be reordered relative to each other, which can result in reordering the
2437   visibility of vector memory operations with respect to LDS operations of other
2438   wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
2439   ensure synchronization between LDS operations and vector memory operations
2440   between wavefronts of a work-group, but not between operations performed by the
2441   same wavefront.
2442 * The vector memory operations are performed as wavefront wide operations and
2443   completion is reported to a wavefront in execution order. The exception is
2444   that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
2445   vector memory order if they access LDS memory, and out of LDS operation order
2446   if they access global memory.
2447 * The vector memory operations access a single vector L1 cache shared by all
2448   SIMDs a CU. Therefore, no special action is required for coherence between the
2449   lanes of a single wavefront, or for coherence between wavefronts in the same
2450   work-group. A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts
2451   executing in different work-groups as they may be executing on different CUs.
2452 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
2453   on a group of CUs. The scalar and vector L1 caches are not coherent. However,
2454   scalar operations are used in a restricted way so do not impact the memory
2455   model. See :ref:`amdgpu-amdhsa-memory-spaces`.
2456 * The vector and scalar memory operations use an L2 cache shared by all CUs on
2457   the same agent.
2458 * The L2 cache has independent channels to service disjoint ranges of virtual
2459   addresses.
2460 * Each CU has a separate request queue per channel. Therefore, the vector and
2461   scalar memory operations performed by wavefronts executing in different work-groups
2462   (which may be executing on different CUs) of an agent can be reordered
2463   relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
2464   synchronization between vector memory operations of different CUs. It ensures a
2465   previous vector memory operation has completed before executing a subsequent
2466   vector memory or LDS operation and so can be used to meet the requirements of
2467   acquire and release.
2468 * The L2 cache can be kept coherent with other agents on some targets, or ranges
2469   of virtual addresses can be set up to bypass it to ensure system coherence.
2470
2471 Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-GFX8),
2472 or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
2473 memory, atomic memory orderings are not meaningful and all accesses are treated
2474 as non-atomic.
2475
2476 Constant address space uses ``buffer/global_load`` instructions (or equivalent
2477 scalar memory instructions). Since the constant address space contents do not
2478 change during the execution of a kernel dispatch it is not legal to perform
2479 stores, and atomic memory orderings are not meaningful and all access are
2480 treated as non-atomic.
2481
2482 A memory synchronization scope wider than work-group is not meaningful for the
2483 group (LDS) address space and is treated as work-group.
2484
2485 The memory model does not support the region address space which is treated as
2486 non-atomic.
2487
2488 Acquire memory ordering is not meaningful on store atomic instructions and is
2489 treated as non-atomic.
2490
2491 Release memory ordering is not meaningful on load atomic instructions and is
2492 treated a non-atomic.
2493
2494 Acquire-release memory ordering is not meaningful on load or store atomic
2495 instructions and is treated as acquire and release respectively.
2496
2497 AMDGPU backend only uses scalar memory operations to access memory that is
2498 proven to not change during the execution of the kernel dispatch. This includes
2499 constant address space and global address space for program scope const
2500 variables. Therefore the kernel machine code does not have to maintain the
2501 scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
2502 and vector L1 caches are invalidated between kernel dispatches by CP since
2503 constant address space data may change between kernel dispatch executions. See
2504 :ref:`amdgpu-amdhsa-memory-spaces`.
2505
2506 The one execption is if scalar writes are used to spill SGPR registers. In this
2507 case the AMDGPU backend ensures the memory location used to spill is never
2508 accessed by vector memory operations at the same time. If scalar writes are used
2509 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
2510 return since the locations may be used for vector memory instructions by a
2511 future wavefront that uses the same scratch area, or a function call that creates a
2512 frame at the same address, respectively. There is no need for a ``s_dcache_inv``
2513 as all scalar writes are write-before-read in the same thread.
2514
2515 Scratch backing memory (which is used for the private address space)
2516 is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private
2517 address space is only accessed by a single thread, and is always
2518 write-before-read, there is never a need to invalidate these entries from the L1
2519 cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the
2520 volatile cache lines.
2521
2522 On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
2523 to invalidate the L2 cache. This also causes it to be treated as
2524 non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC
2525 (cache coherent) and so the L2 cache will coherent with the CPU and other
2526 agents.
2527
2528   .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
2529      :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
2530
2531      ============ ============ ============== ========== ===============================
2532      LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code
2533                   Ordering     Sync Scope     Address
2534                                               Space
2535      ============ ============ ============== ========== ===============================
2536      **Non-Atomic**
2537      -----------------------------------------------------------------------------------
2538      load         *none*       *none*         - global   - !volatile & !nontemporal
2539                                               - generic
2540                                               - private    1. buffer/global/flat_load
2541                                               - constant
2542                                                          - volatile & !nontemporal
2543
2544                                                            1. buffer/global/flat_load
2545                                                               glc=1
2546
2547                                                          - nontemporal
2548
2549                                                            1. buffer/global/flat_load
2550                                                               glc=1 slc=1
2551
2552      load         *none*       *none*         - local    1. ds_load
2553      store        *none*       *none*         - global   - !nontemporal
2554                                               - generic
2555                                               - private    1. buffer/global/flat_store
2556                                               - constant
2557                                                          - nontemporal
2558
2559                                                            1. buffer/global/flat_stote
2560                                                               glc=1 slc=1
2561
2562      store        *none*       *none*         - local    1. ds_store
2563      **Unordered Atomic**
2564      -----------------------------------------------------------------------------------
2565      load atomic  unordered    *any*          *any*      *Same as non-atomic*.
2566      store atomic unordered    *any*          *any*      *Same as non-atomic*.
2567      atomicrmw    unordered    *any*          *any*      *Same as monotonic
2568                                                          atomic*.
2569      **Monotonic Atomic**
2570      -----------------------------------------------------------------------------------
2571      load atomic  monotonic    - singlethread - global   1. buffer/global/flat_load
2572                                - wavefront    - generic
2573                                - workgroup
2574      load atomic  monotonic    - singlethread - local    1. ds_load
2575                                - wavefront
2576                                - workgroup
2577      load atomic  monotonic    - agent        - global   1. buffer/global/flat_load
2578                                - system       - generic     glc=1
2579      store atomic monotonic    - singlethread - global   1. buffer/global/flat_store
2580                                - wavefront    - generic
2581                                - workgroup
2582                                - agent
2583                                - system
2584      store atomic monotonic    - singlethread - local    1. ds_store
2585                                - wavefront
2586                                - workgroup
2587      atomicrmw    monotonic    - singlethread - global   1. buffer/global/flat_atomic
2588                                - wavefront    - generic
2589                                - workgroup
2590                                - agent
2591                                - system
2592      atomicrmw    monotonic    - singlethread - local    1. ds_atomic
2593                                - wavefront
2594                                - workgroup
2595      **Acquire Atomic**
2596      -----------------------------------------------------------------------------------
2597      load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load
2598                                - wavefront    - local
2599                                               - generic
2600      load atomic  acquire      - workgroup    - global   1. buffer/global/flat_load
2601      load atomic  acquire      - workgroup    - local    1. ds_load
2602                                                          2. s_waitcnt lgkmcnt(0)
2603
2604                                                            - If OpenCL, omit.
2605                                                            - Must happen before
2606                                                              any following
2607                                                              global/generic
2608                                                              load/load
2609                                                              atomic/store/store
2610                                                              atomic/atomicrmw.
2611                                                            - Ensures any
2612                                                              following global
2613                                                              data read is no
2614                                                              older than the load
2615                                                              atomic value being
2616                                                              acquired.
2617      load atomic  acquire      - workgroup    - generic  1. flat_load
2618                                                          2. s_waitcnt lgkmcnt(0)
2619
2620                                                            - If OpenCL, omit.
2621                                                            - Must happen before
2622                                                              any following
2623                                                              global/generic
2624                                                              load/load
2625                                                              atomic/store/store
2626                                                              atomic/atomicrmw.
2627                                                            - Ensures any
2628                                                              following global
2629                                                              data read is no
2630                                                              older than the load
2631                                                              atomic value being
2632                                                              acquired.
2633      load atomic  acquire      - agent        - global   1. buffer/global/flat_load
2634                                - system                     glc=1
2635                                                          2. s_waitcnt vmcnt(0)
2636
2637                                                            - Must happen before
2638                                                              following
2639                                                              buffer_wbinvl1_vol.
2640                                                            - Ensures the load
2641                                                              has completed
2642                                                              before invalidating
2643                                                              the cache.
2644
2645                                                          3. buffer_wbinvl1_vol
2646
2647                                                            - Must happen before
2648                                                              any following
2649                                                              global/generic
2650                                                              load/load
2651                                                              atomic/atomicrmw.
2652                                                            - Ensures that
2653                                                              following
2654                                                              loads will not see
2655                                                              stale global data.
2656
2657      load atomic  acquire      - agent        - generic  1. flat_load glc=1
2658                                - system                  2. s_waitcnt vmcnt(0) &
2659                                                             lgkmcnt(0)
2660
2661                                                            - If OpenCL omit
2662                                                              lgkmcnt(0).
2663                                                            - Must happen before
2664                                                              following
2665                                                              buffer_wbinvl1_vol.
2666                                                            - Ensures the flat_load
2667                                                              has completed
2668                                                              before invalidating
2669                                                              the cache.
2670
2671                                                          3. buffer_wbinvl1_vol
2672
2673                                                            - Must happen before
2674                                                              any following
2675                                                              global/generic
2676                                                              load/load
2677                                                              atomic/atomicrmw.
2678                                                            - Ensures that
2679                                                              following loads
2680                                                              will not see stale
2681                                                              global data.
2682
2683      atomicrmw    acquire      - singlethread - global   1. buffer/global/ds/flat_atomic
2684                                - wavefront    - local
2685                                               - generic
2686      atomicrmw    acquire      - workgroup    - global   1. buffer/global/flat_atomic
2687      atomicrmw    acquire      - workgroup    - local    1. ds_atomic
2688                                                          2. waitcnt lgkmcnt(0)
2689
2690                                                            - If OpenCL, omit.
2691                                                            - Must happen before
2692                                                              any following
2693                                                              global/generic
2694                                                              load/load
2695                                                              atomic/store/store
2696                                                              atomic/atomicrmw.
2697                                                            - Ensures any
2698                                                              following global
2699                                                              data read is no
2700                                                              older than the
2701                                                              atomicrmw value
2702                                                              being acquired.
2703
2704      atomicrmw    acquire      - workgroup    - generic  1. flat_atomic
2705                                                          2. waitcnt lgkmcnt(0)
2706
2707                                                            - If OpenCL, omit.
2708                                                            - Must happen before
2709                                                              any following
2710                                                              global/generic
2711                                                              load/load
2712                                                              atomic/store/store
2713                                                              atomic/atomicrmw.
2714                                                            - Ensures any
2715                                                              following global
2716                                                              data read is no
2717                                                              older than the
2718                                                              atomicrmw value
2719                                                              being acquired.
2720
2721      atomicrmw    acquire      - agent        - global   1. buffer/global/flat_atomic
2722                                - system                  2. s_waitcnt vmcnt(0)
2723
2724                                                            - Must happen before
2725                                                              following
2726                                                              buffer_wbinvl1_vol.
2727                                                            - Ensures the
2728                                                              atomicrmw has
2729                                                              completed before
2730                                                              invalidating the
2731                                                              cache.
2732
2733                                                          3. buffer_wbinvl1_vol
2734
2735                                                            - Must happen before
2736                                                              any following
2737                                                              global/generic
2738                                                              load/load
2739                                                              atomic/atomicrmw.
2740                                                            - Ensures that
2741                                                              following loads
2742                                                              will not see stale
2743                                                              global data.
2744
2745      atomicrmw    acquire      - agent        - generic  1. flat_atomic
2746                                - system                  2. s_waitcnt vmcnt(0) &
2747                                                             lgkmcnt(0)
2748
2749                                                            - If OpenCL, omit
2750                                                              lgkmcnt(0).
2751                                                            - Must happen before
2752                                                              following
2753                                                              buffer_wbinvl1_vol.
2754                                                            - Ensures the
2755                                                              atomicrmw has
2756                                                              completed before
2757                                                              invalidating the
2758                                                              cache.
2759
2760                                                          3. buffer_wbinvl1_vol
2761
2762                                                            - Must happen before
2763                                                              any following
2764                                                              global/generic
2765                                                              load/load
2766                                                              atomic/atomicrmw.
2767                                                            - Ensures that
2768                                                              following loads
2769                                                              will not see stale
2770                                                              global data.
2771
2772      fence        acquire      - singlethread *none*     *none*
2773                                - wavefront
2774      fence        acquire      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
2775
2776                                                            - If OpenCL and
2777                                                              address space is
2778                                                              not generic, omit.
2779                                                            - However, since LLVM
2780                                                              currently has no
2781                                                              address space on
2782                                                              the fence need to
2783                                                              conservatively
2784                                                              always generate. If
2785                                                              fence had an
2786                                                              address space then
2787                                                              set to address
2788                                                              space of OpenCL
2789                                                              fence flag, or to
2790                                                              generic if both
2791                                                              local and global
2792                                                              flags are
2793                                                              specified.
2794                                                            - Must happen after
2795                                                              any preceding
2796                                                              local/generic load
2797                                                              atomic/atomicrmw
2798                                                              with an equal or
2799                                                              wider sync scope
2800                                                              and memory ordering
2801                                                              stronger than
2802                                                              unordered (this is
2803                                                              termed the
2804                                                              fence-paired-atomic).
2805                                                            - Must happen before
2806                                                              any following
2807                                                              global/generic
2808                                                              load/load
2809                                                              atomic/store/store
2810                                                              atomic/atomicrmw.
2811                                                            - Ensures any
2812                                                              following global
2813                                                              data read is no
2814                                                              older than the
2815                                                              value read by the
2816                                                              fence-paired-atomic.
2817
2818      fence        acquire      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
2819                                - system                     vmcnt(0)
2820
2821                                                            - If OpenCL and
2822                                                              address space is
2823                                                              not generic, omit
2824                                                              lgkmcnt(0).
2825                                                            - However, since LLVM
2826                                                              currently has no
2827                                                              address space on
2828                                                              the fence need to
2829                                                              conservatively
2830                                                              always generate
2831                                                              (see comment for
2832                                                              previous fence).
2833                                                            - Could be split into
2834                                                              separate s_waitcnt
2835                                                              vmcnt(0) and
2836                                                              s_waitcnt
2837                                                              lgkmcnt(0) to allow
2838                                                              them to be
2839                                                              independently moved
2840                                                              according to the
2841                                                              following rules.
2842                                                            - s_waitcnt vmcnt(0)
2843                                                              must happen after
2844                                                              any preceding
2845                                                              global/generic load
2846                                                              atomic/atomicrmw
2847                                                              with an equal or
2848                                                              wider sync scope
2849                                                              and memory ordering
2850                                                              stronger than
2851                                                              unordered (this is
2852                                                              termed the
2853                                                              fence-paired-atomic).
2854                                                            - s_waitcnt lgkmcnt(0)
2855                                                              must happen after
2856                                                              any preceding
2857                                                              local/generic load
2858                                                              atomic/atomicrmw
2859                                                              with an equal or
2860                                                              wider sync scope
2861                                                              and memory ordering
2862                                                              stronger than
2863                                                              unordered (this is
2864                                                              termed the
2865                                                              fence-paired-atomic).
2866                                                            - Must happen before
2867                                                              the following
2868                                                              buffer_wbinvl1_vol.
2869                                                            - Ensures that the
2870                                                              fence-paired atomic
2871                                                              has completed
2872                                                              before invalidating
2873                                                              the
2874                                                              cache. Therefore
2875                                                              any following
2876                                                              locations read must
2877                                                              be no older than
2878                                                              the value read by
2879                                                              the
2880                                                              fence-paired-atomic.
2881
2882                                                          2. buffer_wbinvl1_vol
2883
2884                                                            - Must happen before any
2885                                                              following global/generic
2886                                                              load/load
2887                                                              atomic/store/store
2888                                                              atomic/atomicrmw.
2889                                                            - Ensures that
2890                                                              following loads
2891                                                              will not see stale
2892                                                              global data.
2893
2894      **Release Atomic**
2895      -----------------------------------------------------------------------------------
2896      store atomic release      - singlethread - global   1. buffer/global/ds/flat_store
2897                                - wavefront    - local
2898                                               - generic
2899      store atomic release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
2900
2901                                                            - If OpenCL, omit.
2902                                                            - Must happen after
2903                                                              any preceding
2904                                                              local/generic
2905                                                              load/store/load
2906                                                              atomic/store
2907                                                              atomic/atomicrmw.
2908                                                            - Must happen before
2909                                                              the following
2910                                                              store.
2911                                                            - Ensures that all
2912                                                              memory operations
2913                                                              to local have
2914                                                              completed before
2915                                                              performing the
2916                                                              store that is being
2917                                                              released.
2918
2919                                                          2. buffer/global/flat_store
2920      store atomic release      - workgroup    - local    1. ds_store
2921      store atomic release      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)
2922
2923                                                            - If OpenCL, omit.
2924                                                            - Must happen after
2925                                                              any preceding
2926                                                              local/generic
2927                                                              load/store/load
2928                                                              atomic/store
2929                                                              atomic/atomicrmw.
2930                                                            - Must happen before
2931                                                              the following
2932                                                              store.
2933                                                            - Ensures that all
2934                                                              memory operations
2935                                                              to local have
2936                                                              completed before
2937                                                              performing the
2938                                                              store that is being
2939                                                              released.
2940
2941                                                          2. flat_store
2942      store atomic release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
2943                                - system       - generic     vmcnt(0)
2944
2945                                                            - If OpenCL, omit
2946                                                              lgkmcnt(0).
2947                                                            - Could be split into
2948                                                              separate s_waitcnt
2949                                                              vmcnt(0) and
2950                                                              s_waitcnt
2951                                                              lgkmcnt(0) to allow
2952                                                              them to be
2953                                                              independently moved
2954                                                              according to the
2955                                                              following rules.
2956                                                            - s_waitcnt vmcnt(0)
2957                                                              must happen after
2958                                                              any preceding
2959                                                              global/generic
2960                                                              load/store/load
2961                                                              atomic/store
2962                                                              atomic/atomicrmw.
2963                                                            - s_waitcnt lgkmcnt(0)
2964                                                              must happen after
2965                                                              any preceding
2966                                                              local/generic
2967                                                              load/store/load
2968                                                              atomic/store
2969                                                              atomic/atomicrmw.
2970                                                            - Must happen before
2971                                                              the following
2972                                                              store.
2973                                                            - Ensures that all
2974                                                              memory operations
2975                                                              to memory have
2976                                                              completed before
2977                                                              performing the
2978                                                              store that is being
2979                                                              released.
2980
2981                                                          2. buffer/global/ds/flat_store
2982      atomicrmw    release      - singlethread - global   1. buffer/global/ds/flat_atomic
2983                                - wavefront    - local
2984                                               - generic
2985      atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
2986
2987                                                            - If OpenCL, omit.
2988                                                            - Must happen after
2989                                                              any preceding
2990                                                              local/generic
2991                                                              load/store/load
2992                                                              atomic/store
2993                                                              atomic/atomicrmw.
2994                                                            - Must happen before
2995                                                              the following
2996                                                              atomicrmw.
2997                                                            - Ensures that all
2998                                                              memory operations
2999                                                              to local have
3000                                                              completed before
3001                                                              performing the
3002                                                              atomicrmw that is
3003                                                              being released.
3004
3005                                                          2. buffer/global/flat_atomic
3006      atomicrmw    release      - workgroup    - local    1. ds_atomic
3007      atomicrmw    release      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)
3008
3009                                                            - If OpenCL, omit.
3010                                                            - Must happen after
3011                                                              any preceding
3012                                                              local/generic
3013                                                              load/store/load
3014                                                              atomic/store
3015                                                              atomic/atomicrmw.
3016                                                            - Must happen before
3017                                                              the following
3018                                                              atomicrmw.
3019                                                            - Ensures that all
3020                                                              memory operations
3021                                                              to local have
3022                                                              completed before
3023                                                              performing the
3024                                                              atomicrmw that is
3025                                                              being released.
3026
3027                                                          2. flat_atomic
3028      atomicrmw    release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
3029                                - system       - generic     vmcnt(0)
3030
3031                                                            - If OpenCL, omit
3032                                                              lgkmcnt(0).
3033                                                            - Could be split into
3034                                                              separate s_waitcnt
3035                                                              vmcnt(0) and
3036                                                              s_waitcnt
3037                                                              lgkmcnt(0) to allow
3038                                                              them to be
3039                                                              independently moved
3040                                                              according to the
3041                                                              following rules.
3042                                                            - s_waitcnt vmcnt(0)
3043                                                              must happen after
3044                                                              any preceding
3045                                                              global/generic
3046                                                              load/store/load
3047                                                              atomic/store
3048                                                              atomic/atomicrmw.
3049                                                            - s_waitcnt lgkmcnt(0)
3050                                                              must happen after
3051                                                              any preceding
3052                                                              local/generic
3053                                                              load/store/load
3054                                                              atomic/store
3055                                                              atomic/atomicrmw.
3056                                                            - Must happen before
3057                                                              the following
3058                                                              atomicrmw.
3059                                                            - Ensures that all
3060                                                              memory operations
3061                                                              to global and local
3062                                                              have completed
3063                                                              before performing
3064                                                              the atomicrmw that
3065                                                              is being released.
3066
3067                                                          2. buffer/global/ds/flat_atomic
3068      fence        release      - singlethread *none*     *none*
3069                                - wavefront
3070      fence        release      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
3071
3072                                                            - If OpenCL and
3073                                                              address space is
3074                                                              not generic, omit.
3075                                                            - However, since LLVM
3076                                                              currently has no
3077                                                              address space on
3078                                                              the fence need to
3079                                                              conservatively
3080                                                              always generate. If
3081                                                              fence had an
3082                                                              address space then
3083                                                              set to address
3084                                                              space of OpenCL
3085                                                              fence flag, or to
3086                                                              generic if both
3087                                                              local and global
3088                                                              flags are
3089                                                              specified.
3090                                                            - Must happen after
3091                                                              any preceding
3092                                                              local/generic
3093                                                              load/load
3094                                                              atomic/store/store
3095                                                              atomic/atomicrmw.
3096                                                            - Must happen before
3097                                                              any following store
3098                                                              atomic/atomicrmw
3099                                                              with an equal or
3100                                                              wider sync scope
3101                                                              and memory ordering
3102                                                              stronger than
3103                                                              unordered (this is
3104                                                              termed the
3105                                                              fence-paired-atomic).
3106                                                            - Ensures that all
3107                                                              memory operations
3108                                                              to local have
3109                                                              completed before
3110                                                              performing the
3111                                                              following
3112                                                              fence-paired-atomic.
3113
3114      fence        release      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
3115                                - system                     vmcnt(0)
3116
3117                                                            - If OpenCL and
3118                                                              address space is
3119                                                              not generic, omit
3120                                                              lgkmcnt(0).
3121                                                            - If OpenCL and
3122                                                              address space is
3123                                                              local, omit
3124                                                              vmcnt(0).
3125                                                            - However, since LLVM
3126                                                              currently has no
3127                                                              address space on
3128                                                              the fence need to
3129                                                              conservatively
3130                                                              always generate. If
3131                                                              fence had an
3132                                                              address space then
3133                                                              set to address
3134                                                              space of OpenCL
3135                                                              fence flag, or to
3136                                                              generic if both
3137                                                              local and global
3138                                                              flags are
3139                                                              specified.
3140                                                            - Could be split into
3141                                                              separate s_waitcnt
3142                                                              vmcnt(0) and
3143                                                              s_waitcnt
3144                                                              lgkmcnt(0) to allow
3145                                                              them to be
3146                                                              independently moved
3147                                                              according to the
3148                                                              following rules.
3149                                                            - s_waitcnt vmcnt(0)
3150                                                              must happen after
3151                                                              any preceding
3152                                                              global/generic
3153                                                              load/store/load
3154                                                              atomic/store
3155                                                              atomic/atomicrmw.
3156                                                            - s_waitcnt lgkmcnt(0)
3157                                                              must happen after
3158                                                              any preceding
3159                                                              local/generic
3160                                                              load/store/load
3161                                                              atomic/store
3162                                                              atomic/atomicrmw.
3163                                                            - Must happen before
3164                                                              any following store
3165                                                              atomic/atomicrmw
3166                                                              with an equal or
3167                                                              wider sync scope
3168                                                              and memory ordering
3169                                                              stronger than
3170                                                              unordered (this is
3171                                                              termed the
3172                                                              fence-paired-atomic).
3173                                                            - Ensures that all
3174                                                              memory operations
3175                                                              have
3176                                                              completed before
3177                                                              performing the
3178                                                              following
3179                                                              fence-paired-atomic.
3180
3181      **Acquire-Release Atomic**
3182      -----------------------------------------------------------------------------------
3183      atomicrmw    acq_rel      - singlethread - global   1. buffer/global/ds/flat_atomic
3184                                - wavefront    - local
3185                                               - generic
3186      atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
3187
3188                                                            - If OpenCL, omit.
3189                                                            - Must happen after
3190                                                              any preceding
3191                                                              local/generic
3192                                                              load/store/load
3193                                                              atomic/store
3194                                                              atomic/atomicrmw.
3195                                                            - Must happen before
3196                                                              the following
3197                                                              atomicrmw.
3198                                                            - Ensures that all
3199                                                              memory operations
3200                                                              to local have
3201                                                              completed before
3202                                                              performing the
3203                                                              atomicrmw that is
3204                                                              being released.
3205
3206                                                          2. buffer/global/flat_atomic
3207      atomicrmw    acq_rel      - workgroup    - local    1. ds_atomic
3208                                                          2. s_waitcnt lgkmcnt(0)
3209
3210                                                            - If OpenCL, omit.
3211                                                            - Must happen before
3212                                                              any following
3213                                                              global/generic
3214                                                              load/load
3215                                                              atomic/store/store
3216                                                              atomic/atomicrmw.
3217                                                            - Ensures any
3218                                                              following global
3219                                                              data read is no
3220                                                              older than the load
3221                                                              atomic value being
3222                                                              acquired.
3223
3224      atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)
3225
3226                                                            - If OpenCL, omit.
3227                                                            - Must happen after
3228                                                              any preceding
3229                                                              local/generic
3230                                                              load/store/load
3231                                                              atomic/store
3232                                                              atomic/atomicrmw.
3233                                                            - Must happen before
3234                                                              the following
3235                                                              atomicrmw.
3236                                                            - Ensures that all
3237                                                              memory operations
3238                                                              to local have
3239                                                              completed before
3240                                                              performing the
3241                                                              atomicrmw that is
3242                                                              being released.
3243
3244                                                          2. flat_atomic
3245                                                          3. s_waitcnt lgkmcnt(0)
3246
3247                                                            - If OpenCL, omit.
3248                                                            - Must happen before
3249                                                              any following
3250                                                              global/generic
3251                                                              load/load
3252                                                              atomic/store/store
3253                                                              atomic/atomicrmw.
3254                                                            - Ensures any
3255                                                              following global
3256                                                              data read is no
3257                                                              older than the load
3258                                                              atomic value being
3259                                                              acquired.
3260
3261      atomicrmw    acq_rel      - agent        - global   1. s_waitcnt lgkmcnt(0) &
3262                                - system                     vmcnt(0)
3263
3264                                                            - If OpenCL, omit
3265                                                              lgkmcnt(0).
3266                                                            - Could be split into
3267                                                              separate s_waitcnt
3268                                                              vmcnt(0) and
3269                                                              s_waitcnt
3270                                                              lgkmcnt(0) to allow
3271                                                              them to be
3272                                                              independently moved
3273                                                              according to the
3274                                                              following rules.
3275                                                            - s_waitcnt vmcnt(0)
3276                                                              must happen after
3277                                                              any preceding
3278                                                              global/generic
3279                                                              load/store/load
3280                                                              atomic/store
3281                                                              atomic/atomicrmw.
3282                                                            - s_waitcnt lgkmcnt(0)
3283                                                              must happen after
3284                                                              any preceding
3285                                                              local/generic
3286                                                              load/store/load
3287                                                              atomic/store
3288                                                              atomic/atomicrmw.
3289                                                            - Must happen before
3290                                                              the following
3291                                                              atomicrmw.
3292                                                            - Ensures that all
3293                                                              memory operations
3294                                                              to global have
3295                                                              completed before
3296                                                              performing the
3297                                                              atomicrmw that is
3298                                                              being released.
3299
3300                                                          2. buffer/global/flat_atomic
3301                                                          3. s_waitcnt vmcnt(0)
3302
3303                                                            - Must happen before
3304                                                              following
3305                                                              buffer_wbinvl1_vol.
3306                                                            - Ensures the
3307                                                              atomicrmw has
3308                                                              completed before
3309                                                              invalidating the
3310                                                              cache.
3311
3312                                                          4. buffer_wbinvl1_vol
3313
3314                                                            - Must happen before
3315                                                              any following
3316                                                              global/generic
3317                                                              load/load
3318                                                              atomic/atomicrmw.
3319                                                            - Ensures that
3320                                                              following loads
3321                                                              will not see stale
3322                                                              global data.
3323
3324      atomicrmw    acq_rel      - agent        - generic  1. s_waitcnt lgkmcnt(0) &
3325                                - system                     vmcnt(0)
3326
3327                                                            - If OpenCL, omit
3328                                                              lgkmcnt(0).
3329                                                            - Could be split into
3330                                                              separate s_waitcnt
3331                                                              vmcnt(0) and
3332                                                              s_waitcnt
3333                                                              lgkmcnt(0) to allow
3334                                                              them to be
3335                                                              independently moved
3336                                                              according to the
3337                                                              following rules.
3338                                                            - s_waitcnt vmcnt(0)
3339                                                              must happen after
3340                                                              any preceding
3341                                                              global/generic
3342                                                              load/store/load
3343                                                              atomic/store
3344                                                              atomic/atomicrmw.
3345                                                            - s_waitcnt lgkmcnt(0)
3346                                                              must happen after
3347                                                              any preceding
3348                                                              local/generic
3349                                                              load/store/load
3350                                                              atomic/store
3351                                                              atomic/atomicrmw.
3352                                                            - Must happen before
3353                                                              the following
3354                                                              atomicrmw.
3355                                                            - Ensures that all
3356                                                              memory operations
3357                                                              to global have
3358                                                              completed before
3359                                                              performing the
3360                                                              atomicrmw that is
3361                                                              being released.
3362
3363                                                          2. flat_atomic
3364                                                          3. s_waitcnt vmcnt(0) &
3365                                                             lgkmcnt(0)
3366
3367                                                            - If OpenCL, omit
3368                                                              lgkmcnt(0).
3369                                                            - Must happen before
3370                                                              following
3371                                                              buffer_wbinvl1_vol.
3372                                                            - Ensures the
3373                                                              atomicrmw has
3374                                                              completed before
3375                                                              invalidating the
3376                                                              cache.
3377
3378                                                          4. buffer_wbinvl1_vol
3379
3380                                                            - Must happen before
3381                                                              any following
3382                                                              global/generic
3383                                                              load/load
3384                                                              atomic/atomicrmw.
3385                                                            - Ensures that
3386                                                              following loads
3387                                                              will not see stale
3388                                                              global data.
3389
3390      fence        acq_rel      - singlethread *none*     *none*
3391                                - wavefront
3392      fence        acq_rel      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
3393
3394                                                            - If OpenCL and
3395                                                              address space is
3396                                                              not generic, omit.
3397                                                            - However,
3398                                                              since LLVM
3399                                                              currently has no
3400                                                              address space on
3401                                                              the fence need to
3402                                                              conservatively
3403                                                              always generate
3404                                                              (see comment for
3405                                                              previous fence).
3406                                                            - Must happen after
3407                                                              any preceding
3408                                                              local/generic
3409                                                              load/load
3410                                                              atomic/store/store
3411                                                              atomic/atomicrmw.
3412                                                            - Must happen before
3413                                                              any following
3414                                                              global/generic
3415                                                              load/load
3416                                                              atomic/store/store
3417                                                              atomic/atomicrmw.
3418                                                            - Ensures that all
3419                                                              memory operations
3420                                                              to local have
3421                                                              completed before
3422                                                              performing any
3423                                                              following global
3424                                                              memory operations.
3425                                                            - Ensures that the
3426                                                              preceding
3427                                                              local/generic load
3428                                                              atomic/atomicrmw
3429                                                              with an equal or
3430                                                              wider sync scope
3431                                                              and memory ordering
3432                                                              stronger than
3433                                                              unordered (this is
3434                                                              termed the
3435                                                              acquire-fence-paired-atomic
3436                                                              ) has completed
3437                                                              before following
3438                                                              global memory
3439                                                              operations. This
3440                                                              satisfies the
3441                                                              requirements of
3442                                                              acquire.
3443                                                            - Ensures that all
3444                                                              previous memory
3445                                                              operations have
3446                                                              completed before a
3447                                                              following
3448                                                              local/generic store
3449                                                              atomic/atomicrmw
3450                                                              with an equal or
3451                                                              wider sync scope
3452                                                              and memory ordering
3453                                                              stronger than
3454                                                              unordered (this is
3455                                                              termed the
3456                                                              release-fence-paired-atomic
3457                                                              ). This satisfies the
3458                                                              requirements of
3459                                                              release.
3460
3461      fence        acq_rel      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
3462                                - system                     vmcnt(0)
3463
3464                                                            - If OpenCL and
3465                                                              address space is
3466                                                              not generic, omit
3467                                                              lgkmcnt(0).
3468                                                            - However, since LLVM
3469                                                              currently has no
3470                                                              address space on
3471                                                              the fence need to
3472                                                              conservatively
3473                                                              always generate
3474                                                              (see comment for
3475                                                              previous fence).
3476                                                            - Could be split into
3477                                                              separate s_waitcnt
3478                                                              vmcnt(0) and
3479                                                              s_waitcnt
3480                                                              lgkmcnt(0) to allow
3481                                                              them to be
3482                                                              independently moved
3483                                                              according to the
3484                                                              following rules.
3485                                                            - s_waitcnt vmcnt(0)
3486                                                              must happen after
3487                                                              any preceding
3488                                                              global/generic
3489                                                              load/store/load
3490                                                              atomic/store
3491                                                              atomic/atomicrmw.
3492                                                            - s_waitcnt lgkmcnt(0)
3493                                                              must happen after
3494                                                              any preceding
3495                                                              local/generic
3496                                                              load/store/load
3497                                                              atomic/store
3498                                                              atomic/atomicrmw.
3499                                                            - Must happen before
3500                                                              the following
3501                                                              buffer_wbinvl1_vol.
3502                                                            - Ensures that the
3503                                                              preceding
3504                                                              global/local/generic
3505                                                              load
3506                                                              atomic/atomicrmw
3507                                                              with an equal or
3508                                                              wider sync scope
3509                                                              and memory ordering
3510                                                              stronger than
3511                                                              unordered (this is
3512                                                              termed the
3513                                                              acquire-fence-paired-atomic
3514                                                              ) has completed
3515                                                              before invalidating
3516                                                              the cache. This
3517                                                              satisfies the
3518                                                              requirements of
3519                                                              acquire.
3520                                                            - Ensures that all
3521                                                              previous memory
3522                                                              operations have
3523                                                              completed before a
3524                                                              following
3525                                                              global/local/generic
3526                                                              store
3527                                                              atomic/atomicrmw
3528                                                              with an equal or
3529                                                              wider sync scope
3530                                                              and memory ordering
3531                                                              stronger than
3532                                                              unordered (this is
3533                                                              termed the
3534                                                              release-fence-paired-atomic
3535                                                              ). This satisfies the
3536                                                              requirements of
3537                                                              release.
3538
3539                                                          2. buffer_wbinvl1_vol
3540
3541                                                            - Must happen before
3542                                                              any following
3543                                                              global/generic
3544                                                              load/load
3545                                                              atomic/store/store
3546                                                              atomic/atomicrmw.
3547                                                            - Ensures that
3548                                                              following loads
3549                                                              will not see stale
3550                                                              global data. This
3551                                                              satisfies the
3552                                                              requirements of
3553                                                              acquire.
3554
3555      **Sequential Consistent Atomic**
3556      -----------------------------------------------------------------------------------
3557      load atomic  seq_cst      - singlethread - global   *Same as corresponding
3558                                - wavefront    - local    load atomic acquire,
3559                                               - generic  except must generated
3560                                                          all instructions even
3561                                                          for OpenCL.*
3562      load atomic  seq_cst      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
3563                                               - generic
3564                                                            - Must
3565                                                              happen after
3566                                                              preceding
3567                                                              global/generic load
3568                                                              atomic/store
3569                                                              atomic/atomicrmw
3570                                                              with memory
3571                                                              ordering of seq_cst
3572                                                              and with equal or
3573                                                              wider sync scope.
3574                                                              (Note that seq_cst
3575                                                              fences have their
3576                                                              own s_waitcnt
3577                                                              lgkmcnt(0) and so do
3578                                                              not need to be
3579                                                              considered.)
3580                                                            - Ensures any
3581                                                              preceding
3582                                                              sequential
3583                                                              consistent local
3584                                                              memory instructions
3585                                                              have completed
3586                                                              before executing
3587                                                              this sequentially
3588                                                              consistent
3589                                                              instruction. This
3590                                                              prevents reordering
3591                                                              a seq_cst store
3592                                                              followed by a
3593                                                              seq_cst load. (Note
3594                                                              that seq_cst is
3595                                                              stronger than
3596                                                              acquire/release as
3597                                                              the reordering of
3598                                                              load acquire
3599                                                              followed by a store
3600                                                              release is
3601                                                              prevented by the
3602                                                              waitcnt of
3603                                                              the release, but
3604                                                              there is nothing
3605                                                              preventing a store
3606                                                              release followed by
3607                                                              load acquire from
3608                                                              competing out of
3609                                                              order.)
3610
3611                                                          2. *Following
3612                                                             instructions same as
3613                                                             corresponding load
3614                                                             atomic acquire,
3615                                                             except must generated
3616                                                             all instructions even
3617                                                             for OpenCL.*
3618      load atomic  seq_cst      - workgroup    - local    *Same as corresponding
3619                                                          load atomic acquire,
3620                                                          except must generated
3621                                                          all instructions even
3622                                                          for OpenCL.*
3623      load atomic  seq_cst      - agent        - global   1. s_waitcnt lgkmcnt(0) &
3624                                - system       - generic     vmcnt(0)
3625
3626                                                            - Could be split into
3627                                                              separate s_waitcnt
3628                                                              vmcnt(0)
3629                                                              and s_waitcnt
3630                                                              lgkmcnt(0) to allow
3631                                                              them to be
3632                                                              independently moved
3633                                                              according to the
3634                                                              following rules.
3635                                                            - waitcnt lgkmcnt(0)
3636                                                              must happen after
3637                                                              preceding
3638                                                              global/generic load
3639                                                              atomic/store
3640                                                              atomic/atomicrmw
3641                                                              with memory
3642                                                              ordering of seq_cst
3643                                                              and with equal or
3644                                                              wider sync scope.
3645                                                              (Note that seq_cst
3646                                                              fences have their
3647                                                              own s_waitcnt
3648                                                              lgkmcnt(0) and so do
3649                                                              not need to be
3650                                                              considered.)
3651                                                            - waitcnt vmcnt(0)
3652                                                              must happen after
3653                                                              preceding
3654                                                              global/generic load
3655                                                              atomic/store
3656                                                              atomic/atomicrmw
3657                                                              with memory
3658                                                              ordering of seq_cst
3659                                                              and with equal or
3660                                                              wider sync scope.
3661                                                              (Note that seq_cst
3662                                                              fences have their
3663                                                              own s_waitcnt
3664                                                              vmcnt(0) and so do
3665                                                              not need to be
3666                                                              considered.)
3667                                                            - Ensures any
3668                                                              preceding
3669                                                              sequential
3670                                                              consistent global
3671                                                              memory instructions
3672                                                              have completed
3673                                                              before executing
3674                                                              this sequentially
3675                                                              consistent
3676                                                              instruction. This
3677                                                              prevents reordering
3678                                                              a seq_cst store
3679                                                              followed by a
3680                                                              seq_cst load. (Note
3681                                                              that seq_cst is
3682                                                              stronger than
3683                                                              acquire/release as
3684                                                              the reordering of
3685                                                              load acquire
3686                                                              followed by a store
3687                                                              release is
3688                                                              prevented by the
3689                                                              waitcnt of
3690                                                              the release, but
3691                                                              there is nothing
3692                                                              preventing a store
3693                                                              release followed by
3694                                                              load acquire from
3695                                                              competing out of
3696                                                              order.)
3697
3698                                                          2. *Following
3699                                                             instructions same as
3700                                                             corresponding load
3701                                                             atomic acquire,
3702                                                             except must generated
3703                                                             all instructions even
3704                                                             for OpenCL.*
3705      store atomic seq_cst      - singlethread - global   *Same as corresponding
3706                                - wavefront    - local    store atomic release,
3707                                - workgroup    - generic  except must generated
3708                                                          all instructions even
3709                                                          for OpenCL.*
3710      store atomic seq_cst      - agent        - global   *Same as corresponding
3711                                - system       - generic  store atomic release,
3712                                                          except must generated
3713                                                          all instructions even
3714                                                          for OpenCL.*
3715      atomicrmw    seq_cst      - singlethread - global   *Same as corresponding
3716                                - wavefront    - local    atomicrmw acq_rel,
3717                                - workgroup    - generic  except must generated
3718                                                          all instructions even
3719                                                          for OpenCL.*
3720      atomicrmw    seq_cst      - agent        - global   *Same as corresponding
3721                                - system       - generic  atomicrmw acq_rel,
3722                                                          except must generated
3723                                                          all instructions even
3724                                                          for OpenCL.*
3725      fence        seq_cst      - singlethread *none*     *Same as corresponding
3726                                - wavefront               fence acq_rel,
3727                                - workgroup               except must generated
3728                                - agent                   all instructions even
3729                                - system                  for OpenCL.*
3730      ============ ============ ============== ========== ===============================
3731
3732 The memory order also adds the single thread optimization constrains defined in
3733 table
3734 :ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
3735
3736   .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
3737      :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
3738
3739      ============ ==============================================================
3740      LLVM Memory  Optimization Constraints
3741      Ordering
3742      ============ ==============================================================
3743      unordered    *none*
3744      monotonic    *none*
3745      acquire      - If a load atomic/atomicrmw then no following load/load
3746                     atomic/store/ store atomic/atomicrmw/fence instruction can
3747                     be moved before the acquire.
3748                   - If a fence then same as load atomic, plus no preceding
3749                     associated fence-paired-atomic can be moved after the fence.
3750      release      - If a store atomic/atomicrmw then no preceding load/load
3751                     atomic/store/ store atomic/atomicrmw/fence instruction can
3752                     be moved after the release.
3753                   - If a fence then same as store atomic, plus no following
3754                     associated fence-paired-atomic can be moved before the
3755                     fence.
3756      acq_rel      Same constraints as both acquire and release.
3757      seq_cst      - If a load atomic then same constraints as acquire, plus no
3758                     preceding sequentially consistent load atomic/store
3759                     atomic/atomicrmw/fence instruction can be moved after the
3760                     seq_cst.
3761                   - If a store atomic then the same constraints as release, plus
3762                     no following sequentially consistent load atomic/store
3763                     atomic/atomicrmw/fence instruction can be moved before the
3764                     seq_cst.
3765                   - If an atomicrmw/fence then same constraints as acq_rel.
3766      ============ ==============================================================
3767
3768 Trap Handler ABI
3769 ~~~~~~~~~~~~~~~~
3770
3771 For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
3772 (such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
3773 the ``s_trap`` instruction with the following usage:
3774
3775   .. table:: AMDGPU Trap Handler for AMDHSA OS
3776      :name: amdgpu-trap-handler-for-amdhsa-os-table
3777
3778      =================== =============== =============== =======================
3779      Usage               Code Sequence   Trap Handler    Description
3780                                          Inputs
3781      =================== =============== =============== =======================
3782      reserved            ``s_trap 0x00``                 Reserved by hardware.
3783      ``debugtrap(arg)``  ``s_trap 0x01`` ``SGPR0-1``:    Reserved for HSA
3784                                            ``queue_ptr`` ``debugtrap``
3785                                          ``VGPR0``:      intrinsic (not
3786                                            ``arg``       implemented).
3787      ``llvm.trap``       ``s_trap 0x02`` ``SGPR0-1``:    Causes dispatch to be
3788                                            ``queue_ptr`` terminated and its
3789                                                          associated queue put
3790                                                          into the error state.
3791      ``llvm.debugtrap``  ``s_trap 0x03``                 - If debugger not
3792                                                            installed then
3793                                                            behaves as a
3794                                                            no-operation. The
3795                                                            trap handler is
3796                                                            entered and
3797                                                            immediately returns
3798                                                            to continue
3799                                                            execution of the
3800                                                            wavefront.
3801                                                          - If the debugger is
3802                                                            installed, causes
3803                                                            the debug trap to be
3804                                                            reported by the
3805                                                            debugger and the
3806                                                            wavefront is put in
3807                                                            the halt state until
3808                                                            resumed by the
3809                                                            debugger.
3810      reserved            ``s_trap 0x04``                 Reserved.
3811      reserved            ``s_trap 0x05``                 Reserved.
3812      reserved            ``s_trap 0x06``                 Reserved.
3813      debugger breakpoint ``s_trap 0x07``                 Reserved for debugger
3814                                                          breakpoints.
3815      reserved            ``s_trap 0x08``                 Reserved.
3816      reserved            ``s_trap 0xfe``                 Reserved.
3817      reserved            ``s_trap 0xff``                 Reserved.
3818      =================== =============== =============== =======================
3819
3820 AMDPAL
3821 ------
3822
3823 This section provides code conventions used when the target triple OS is
3824 ``amdpal`` (see :ref:`amdgpu-target-triples`) for passing runtime parameters
3825 from the application/runtime to each invocation of a hardware shader. These
3826 parameters include both generic, application-controlled parameters called
3827 *user data* as well as system-generated parameters that are a product of the
3828 draw or dispatch execution.
3829
3830 User Data
3831 ~~~~~~~~~
3832
3833 Each hardware stage has a set of 32-bit *user data registers* which can be
3834 written from a command buffer and then loaded into SGPRs when waves are launched
3835 via a subsequent dispatch or draw operation. This is the way most arguments are
3836 passed from the application/runtime to a hardware shader.
3837
3838 Compute User Data
3839 ~~~~~~~~~~~~~~~~~
3840
3841 Compute shader user data mappings are simpler than graphics shaders, and have a
3842 fixed mapping.
3843
3844 Note that there are always 10 available *user data entries* in registers -
3845 entries beyond that limit must be fetched from memory (via the spill table
3846 pointer) by the shader.
3847
3848   .. table:: PAL Compute Shader User Data Registers
3849      :name: pal-compute-user-data-registers
3850
3851      ============= ================================
3852      User Register Description
3853      ============= ================================
3854      0             Global Internal Table (32-bit pointer)
3855      1             Per-Shader Internal Table (32-bit pointer)
3856      2 - 11        Application-Controlled User Data (10 32-bit values)
3857      12            Spill Table (32-bit pointer)
3858      13 - 14       Thread Group Count (64-bit pointer)
3859      15            GDS Range
3860      ============= ================================
3861
3862 Graphics User Data
3863 ~~~~~~~~~~~~~~~~~~
3864
3865 Graphics pipelines support a much more flexible user data mapping:
3866
3867   .. table:: PAL Graphics Shader User Data Registers
3868      :name: pal-graphics-user-data-registers
3869
3870      ============= ================================
3871      User Register Description
3872      ============= ================================
3873      0             Global Internal Table (32-bit pointer)
3874      +             Per-Shader Internal Table (32-bit pointer)
3875      + 1-15        Application Controlled User Data
3876                    (1-15 Contiguous 32-bit Values in Registers)
3877      +             Spill Table (32-bit pointer)
3878      +             Draw Index (First Stage Only)
3879      +             Vertex Offset (First Stage Only)
3880      +             Instance Offset (First Stage Only)
3881      ============= ================================
3882
3883   The placement of the global internal table remains fixed in the first *user
3884   data SGPR register*. Otherwise all parameters are optional, and can be mapped
3885   to any desired *user data SGPR register*, with the following regstrictions:
3886
3887   * Draw Index, Vertex Offset, and Instance Offset can only be used by the first
3888     activehardware stage in a graphics pipeline (i.e. where the API vertex
3889     shader runs).
3890
3891   * Application-controlled user data must be mapped into a contiguous range of
3892     user data registers.
3893
3894   * The application-controlled user data range supports compaction remapping, so
3895     only *entries* that are actually consumed by the shader must be assigned to
3896     corresponding *registers*. Note that in order to support an efficient runtime
3897     implementation, the remapping must pack *registers* in the same order as
3898     *entries*, with unused *entries* removed.
3899
3900 .. _pal_global_internal_table:
3901
3902 Global Internal Table
3903 ~~~~~~~~~~~~~~~~~~~~~
3904
3905 The global internal table is a table of *shader resource descriptors* (SRDs) that
3906 define how certain engine-wide, runtime-managed resources should be accessed
3907 from a shader. The majority of these resources have HW-defined formats, and it
3908 is up to the compiler to write/read data as required by the target hardware.
3909
3910 The following table illustrates the required format:
3911
3912   .. table:: PAL Global Internal Table
3913      :name: pal-git-table
3914
3915      ============= ================================
3916      Offset        Description
3917      ============= ================================
3918      0-3           Graphics Scratch SRD
3919      4-7           Compute Scratch SRD
3920      8-11          ES/GS Ring Output SRD
3921      12-15         ES/GS Ring Input SRD
3922      16-19         GS/VS Ring Output #0
3923      20-23         GS/VS Ring Output #1
3924      24-27         GS/VS Ring Output #2
3925      28-31         GS/VS Ring Output #3
3926      32-35         GS/VS Ring Input SRD
3927      36-39         Tessellation Factor Buffer SRD
3928      40-43         Off-Chip LDS Buffer SRD
3929      44-47         Off-Chip Param Cache Buffer SRD
3930      48-51         Sample Position Buffer SRD
3931      52            vaRange::ShadowDescriptorTable High Bits
3932      ============= ================================
3933
3934   The pointer to the global internal table passed to the shader as user data
3935   is a 32-bit pointer. The top 32 bits should be assumed to be the same as
3936   the top 32 bits of the pipeline, so the shader may use the program
3937   counter's top 32 bits.
3938
3939 Unspecified OS
3940 --------------
3941
3942 This section provides code conventions used when the target triple OS is
3943 empty (see :ref:`amdgpu-target-triples`).
3944
3945 Trap Handler ABI
3946 ~~~~~~~~~~~~~~~~
3947
3948 For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
3949 not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
3950 instructions are handled as follows:
3951
3952   .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
3953      :name: amdgpu-trap-handler-for-non-amdhsa-os-table
3954
3955      =============== =============== ===========================================
3956      Usage           Code Sequence   Description
3957      =============== =============== ===========================================
3958      llvm.trap       s_endpgm        Causes wavefront to be terminated.
3959      llvm.debugtrap  *none*          Compiler warning given that there is no
3960                                      trap handler installed.
3961      =============== =============== ===========================================
3962
3963 Source Languages
3964 ================
3965
3966 .. _amdgpu-opencl:
3967
3968 OpenCL
3969 ------
3970
3971 When the language is OpenCL the following differences occur:
3972
3973 1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
3974 2. The AMDGPU backend appends additional arguments to the kernel's explicit
3975    arguments for the AMDHSA OS (see
3976    :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
3977 3. Additional metadata is generated
3978    (see :ref:`amdgpu-amdhsa-hsa-code-object-metadata`).
3979
3980   .. table:: OpenCL kernel implicit arguments appended for AMDHSA OS
3981      :name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table
3982
3983      ======== ==== ========= ===========================================
3984      Position Byte Byte      Description
3985               Size Alignment
3986      ======== ==== ========= ===========================================
3987      1        8    8         OpenCL Global Offset X
3988      2        8    8         OpenCL Global Offset Y
3989      3        8    8         OpenCL Global Offset Z
3990      4        8    8         OpenCL address of printf buffer
3991      5        8    8         OpenCL address of virtual queue used by
3992                              enqueue_kernel.
3993      6        8    8         OpenCL address of AqlWrap struct used by
3994                              enqueue_kernel.
3995      ======== ==== ========= ===========================================
3996
3997 .. _amdgpu-hcc:
3998
3999 HCC
4000 ---
4001
4002 When the language is HCC the following differences occur:
4003
4004 1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
4005
4006 Assembler
4007 ---------
4008
4009 AMDGPU backend has LLVM-MC based assembler which is currently in development.
4010 It supports AMDGCN GFX6-GFX9.
4011
4012 This section describes general syntax for instructions and operands.
4013
4014 Instructions
4015 ~~~~~~~~~~~~
4016
4017 .. toctree::
4018    :hidden:
4019
4020    AMDGPUAsmGFX7
4021    AMDGPUAsmGFX8
4022    AMDGPUAsmGFX9
4023    AMDGPUOperandSyntax
4024
4025 An instruction has the following syntax:
4026
4027     *<opcode> <operand0>, <operand1>,... <modifier0> <modifier1>...*
4028
4029 Note that operands are normally comma-separated while modifiers are space-separated.
4030
4031 The order of operands and modifiers is fixed. Most modifiers are optional and may be omitted.
4032
4033 See detailed instruction syntax description for :doc:`GFX7<AMDGPUAsmGFX7>`,
4034 :doc:`GFX8<AMDGPUAsmGFX8>` and :doc:`GFX9<AMDGPUAsmGFX9>`.
4035
4036 Note that features under development are not included in this description.
4037
4038 For more information about instructions, their semantics and supported combinations of
4039 operands, refer to one of instruction set architecture manuals
4040 [AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_ and [AMD-GCN-GFX9]_.
4041
4042 Operands
4043 ~~~~~~~~
4044
4045 The following syntax for register operands is supported:
4046
4047 * SGPR registers: s0, ... or s[0], ...
4048 * VGPR registers: v0, ... or v[0], ...
4049 * TTMP registers: ttmp0, ... or ttmp[0], ...
4050 * Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
4051 * Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
4052 * Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
4053 * Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
4054 * Register index expressions: v[2*2], s[1-1:2-1]
4055 * 'off' indicates that an operand is not enabled
4056
4057 Modifiers
4058 ~~~~~~~~~
4059
4060 Detailed description of modifiers may be found :doc:`here<AMDGPUOperandSyntax>`.
4061
4062 Instruction Examples
4063 ~~~~~~~~~~~~~~~~~~~~
4064
4065 DS
4066 ++
4067
4068 .. code-block:: nasm
4069
4070   ds_add_u32 v2, v4 offset:16
4071   ds_write_src2_b64 v2 offset0:4 offset1:8
4072   ds_cmpst_f32 v2, v4, v6
4073   ds_min_rtn_f64 v[8:9], v2, v[4:5]
4074
4075
4076 For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
4077
4078 FLAT
4079 ++++
4080
4081 .. code-block:: nasm
4082
4083   flat_load_dword v1, v[3:4]
4084   flat_store_dwordx3 v[3:4], v[5:7]
4085   flat_atomic_swap v1, v[3:4], v5 glc
4086   flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
4087   flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
4088
4089 For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
4090
4091 MUBUF
4092 +++++
4093
4094 .. code-block:: nasm
4095
4096   buffer_load_dword v1, off, s[4:7], s1
4097   buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
4098   buffer_store_format_xy v[1:2], off, s[4:7], s1
4099   buffer_wbinvl1
4100   buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
4101
4102 For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
4103
4104 SMRD/SMEM
4105 +++++++++
4106
4107 .. code-block:: nasm
4108
4109   s_load_dword s1, s[2:3], 0xfc
4110   s_load_dwordx8 s[8:15], s[2:3], s4
4111   s_load_dwordx16 s[88:103], s[2:3], s4
4112   s_dcache_inv_vol
4113   s_memtime s[4:5]
4114
4115 For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
4116
4117 SOP1
4118 ++++
4119
4120 .. code-block:: nasm
4121
4122   s_mov_b32 s1, s2
4123   s_mov_b64 s[0:1], 0x80000000
4124   s_cmov_b32 s1, 200
4125   s_wqm_b64 s[2:3], s[4:5]
4126   s_bcnt0_i32_b64 s1, s[2:3]
4127   s_swappc_b64 s[2:3], s[4:5]
4128   s_cbranch_join s[4:5]
4129
4130 For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
4131
4132 SOP2
4133 ++++
4134
4135 .. code-block:: nasm
4136
4137   s_add_u32 s1, s2, s3
4138   s_and_b64 s[2:3], s[4:5], s[6:7]
4139   s_cselect_b32 s1, s2, s3
4140   s_andn2_b32 s2, s4, s6
4141   s_lshr_b64 s[2:3], s[4:5], s6
4142   s_ashr_i32 s2, s4, s6
4143   s_bfm_b64 s[2:3], s4, s6
4144   s_bfe_i64 s[2:3], s[4:5], s6
4145   s_cbranch_g_fork s[4:5], s[6:7]
4146
4147 For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
4148
4149 SOPC
4150 ++++
4151
4152 .. code-block:: nasm
4153
4154   s_cmp_eq_i32 s1, s2
4155   s_bitcmp1_b32 s1, s2
4156   s_bitcmp0_b64 s[2:3], s4
4157   s_setvskip s3, s5
4158
4159 For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
4160
4161 SOPP
4162 ++++
4163
4164 .. code-block:: nasm
4165
4166   s_barrier
4167   s_nop 2
4168   s_endpgm
4169   s_waitcnt 0 ; Wait for all counters to be 0
4170   s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
4171   s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
4172   s_sethalt 9
4173   s_sleep 10
4174   s_sendmsg 0x1
4175   s_sendmsg sendmsg(MSG_INTERRUPT)
4176   s_trap 1
4177
4178 For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
4179
4180 Unless otherwise mentioned, little verification is performed on the operands
4181 of SOPP Instructions, so it is up to the programmer to be familiar with the
4182 range or acceptable values.
4183
4184 VALU
4185 ++++
4186
4187 For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
4188 the assembler will automatically use optimal encoding based on its operands.
4189 To force specific encoding, one can add a suffix to the opcode of the instruction:
4190
4191 * _e32 for 32-bit VOP1/VOP2/VOPC
4192 * _e64 for 64-bit VOP3
4193 * _dpp for VOP_DPP
4194 * _sdwa for VOP_SDWA
4195
4196 VOP1/VOP2/VOP3/VOPC examples:
4197
4198 .. code-block:: nasm
4199
4200   v_mov_b32 v1, v2
4201   v_mov_b32_e32 v1, v2
4202   v_nop
4203   v_cvt_f64_i32_e32 v[1:2], v2
4204   v_floor_f32_e32 v1, v2
4205   v_bfrev_b32_e32 v1, v2
4206   v_add_f32_e32 v1, v2, v3
4207   v_mul_i32_i24_e64 v1, v2, 3
4208   v_mul_i32_i24_e32 v1, -3, v3
4209   v_mul_i32_i24_e32 v1, -100, v3
4210   v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
4211   v_max_f16_e32 v1, v2, v3
4212
4213 VOP_DPP examples:
4214
4215 .. code-block:: nasm
4216
4217   v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
4218   v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
4219   v_mov_b32 v0, v0 wave_shl:1
4220   v_mov_b32 v0, v0 row_mirror
4221   v_mov_b32 v0, v0 row_bcast:31
4222   v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
4223   v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
4224   v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
4225
4226 VOP_SDWA examples:
4227
4228 .. code-block:: nasm
4229
4230   v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
4231   v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
4232   v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
4233   v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
4234   v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
4235
4236 For full list of supported instructions, refer to "Vector ALU instructions".
4237
4238 HSA Code Object Directives
4239 ~~~~~~~~~~~~~~~~~~~~~~~~~~
4240
4241 AMDGPU ABI defines auxiliary data in output code object. In assembly source,
4242 one can specify them with assembler directives.
4243
4244 .hsa_code_object_version major, minor
4245 +++++++++++++++++++++++++++++++++++++
4246
4247 *major* and *minor* are integers that specify the version of the HSA code
4248 object that will be generated by the assembler.
4249
4250 .hsa_code_object_isa [major, minor, stepping, vendor, arch]
4251 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4252
4253
4254 *major*, *minor*, and *stepping* are all integers that describe the instruction
4255 set architecture (ISA) version of the assembly program.
4256
4257 *vendor* and *arch* are quoted strings.  *vendor* should always be equal to
4258 "AMD" and *arch* should always be equal to "AMDGPU".
4259
4260 By default, the assembler will derive the ISA version, *vendor*, and *arch*
4261 from the value of the -mcpu option that is passed to the assembler.
4262
4263 .amdgpu_hsa_kernel (name)
4264 +++++++++++++++++++++++++
4265
4266 This directives specifies that the symbol with given name is a kernel entry point
4267 (label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
4268
4269 .amd_kernel_code_t
4270 ++++++++++++++++++
4271
4272 This directive marks the beginning of a list of key / value pairs that are used
4273 to specify the amd_kernel_code_t object that will be emitted by the assembler.
4274 The list must be terminated by the *.end_amd_kernel_code_t* directive.  For
4275 any amd_kernel_code_t values that are unspecified a default value will be
4276 used.  The default value for all keys is 0, with the following exceptions:
4277
4278 - *kernel_code_version_major* defaults to 1.
4279 - *machine_kind* defaults to 1.
4280 - *machine_version_major*, *machine_version_minor*, and
4281   *machine_version_stepping* are derived from the value of the -mcpu option
4282   that is passed to the assembler.
4283 - *kernel_code_entry_byte_offset* defaults to 256.
4284 - *wavefront_size* defaults to 6.
4285 - *kernarg_segment_alignment*, *group_segment_alignment*, and
4286   *private_segment_alignment* default to 4. Note that alignments are specified
4287   as a power of two, so a value of **n** means an alignment of 2^ **n**.
4288
4289 The *.amd_kernel_code_t* directive must be placed immediately after the
4290 function label and before any instructions.
4291
4292 For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
4293 comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
4294
4295 Here is an example of a minimal amd_kernel_code_t specification:
4296
4297 .. code-block:: none
4298
4299    .hsa_code_object_version 1,0
4300    .hsa_code_object_isa
4301
4302    .hsatext
4303    .globl  hello_world
4304    .p2align 8
4305    .amdgpu_hsa_kernel hello_world
4306
4307    hello_world:
4308
4309       .amd_kernel_code_t
4310          enable_sgpr_kernarg_segment_ptr = 1
4311          is_ptr64 = 1
4312          compute_pgm_rsrc1_vgprs = 0
4313          compute_pgm_rsrc1_sgprs = 0
4314          compute_pgm_rsrc2_user_sgpr = 2
4315          kernarg_segment_byte_size = 8
4316          wavefront_sgpr_count = 2
4317          workitem_vgpr_count = 3
4318      .end_amd_kernel_code_t
4319
4320      s_load_dwordx2 s[0:1], s[0:1] 0x0
4321      v_mov_b32 v0, 3.14159
4322      s_waitcnt lgkmcnt(0)
4323      v_mov_b32 v1, s0
4324      v_mov_b32 v2, s1
4325      flat_store_dword v[1:2], v0
4326      s_endpgm
4327    .Lfunc_end0:
4328         .size   hello_world, .Lfunc_end0-hello_world
4329
4330 Additional Documentation
4331 ========================
4332
4333 .. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
4334 .. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
4335 .. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
4336 .. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
4337 .. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
4338 .. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
4339 .. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
4340 .. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
4341 .. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
4342 .. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
4343 .. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
4344 .. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
4345 .. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
4346 .. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
4347 .. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
4348 .. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__