docs/AMDGPUUsage.rst

   1 =============================
   2 User Guide for AMDGPU Backend
   3 =============================
   4
   5 .. contents::
   6    :local:
   7
   8 Introduction
   9 ============
  10
  11 The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
  12 R600 family up until the current GCN families. It lives in the
  13 ``lib/Target/AMDGPU`` directory.
  14
  15 LLVM
  16 ====
  17
  18 .. _amdgpu-target-triples:
  19
  20 Target Triples
  21 --------------
  22
  23 Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
  24 specify the target triple:
  25
  26   .. table:: AMDGPU Target Triples
  27      :name: amdgpu-target-triples-table
  28
  29      ============ ======== ========= ===========
  30      Architecture Vendor   OS        Environment
  31      ============ ======== ========= ===========
  32      r600         amd      <empty>   <empty>
  33      amdgcn       amd      <empty>   <empty>
  34      amdgcn       amd      amdhsa    <empty>
  35      amdgcn       amd      amdhsa    opencl
  36      amdgcn       amd      amdhsa    amdgizcl
  37      amdgcn       amd      amdhsa    amdgiz
  38      amdgcn       amd      amdhsa    hcc
  39      ============ ======== ========= ===========
  40
  41 ``r600-amd--``
  42   Supports AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders executed on
  43   the MESA runtime.
  44
  45 ``amdgcn-amd--``
  46   Supports AMD GPUs GCN 6 onwards for graphics and compute shaders executed on
  47   the MESA runtime.
  48
  49 ``amdgcn-amd-amdhsa-``
  50   Supports AMD GCN GPUs GFX6 onwards for compute kernels executed on HSA [HSA]_
  51   compatible runtimes such as AMD's ROCm [AMD-ROCm]_.
  52
  53 ``amdgcn-amd-amdhsa-opencl``
  54   Supports AMD GCN GPUs GFX6 onwards for OpenCL compute kernels executed on HSA
  55   [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. See
  56   :ref:`amdgpu-opencl`.
  57
  58 ``amdgcn-amd-amdhsa-amdgizcl``
  59   Same as ``amdgcn-amd-amdhsa-opencl`` except a different address space mapping
  60   is used (see :ref:`amdgpu-address-spaces`).
  61
  62 ``amdgcn-amd-amdhsa-amdgiz``
  63   Same as ``amdgcn-amd-amdhsa-`` except a different address space mapping is
  64   used (see :ref:`amdgpu-address-spaces`).
  65
  66 ``amdgcn-amd-amdhsa-hcc``
  67   Supports AMD GCN GPUs GFX6 onwards for AMD HC language compute kernels
  68   executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. See
  69   :ref:`amdgpu-hcc`.
  70
  71 .. _amdgpu-processors:
  72
  73 Processors
  74 ----------
  75
  76 Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
  77 names from both the *Processor* and *Alternative Processor* can be used.
  78
  79   .. table:: AMDGPU Processors
  80      :name: amdgpu-processors-table
  81
  82      ========== =========== ============ ===== ======= ==================
  83      Processor  Alternative Target       dGPU/ Runtime Example
  84                 Processor   Triple       APU   Support Products
  85                             Architecture
  86      ========== =========== ============ ===== ======= ==================
  87      **R600** [AMD-R6xx]_
  88      --------------------------------------------------------------------
  89      r600                   r600         dGPU
  90      r630                   r600         dGPU
  91      rs880                  r600         dGPU
  92      rv670                  r600         dGPU
  93      **R700** [AMD-R7xx]_
  94      --------------------------------------------------------------------
  95      rv710                  r600         dGPU
  96      rv730                  r600         dGPU
  97      rv770                  r600         dGPU
  98      **Evergreen** [AMD-Evergreen]_
  99      --------------------------------------------------------------------
 100      cedar                  r600         dGPU
 101      redwood                r600         dGPU
 102      sumo                   r600         dGPU
 103      juniper                r600         dGPU
 104      cypress                r600         dGPU
 105      **Northern Islands** [AMD-Cayman-Trinity]_
 106      --------------------------------------------------------------------
 107      barts                  r600         dGPU
 108      turks                  r600         dGPU
 109      caicos                 r600         dGPU
 110      cayman                 r600         dGPU
 111      **GCN GFX6 (Southern Islands (SI))** [AMD-Souther-Islands]_
 112      --------------------------------------------------------------------
 113      gfx600     - SI        amdgcn       dGPU
 114                 - tahiti
 115      gfx601     - pitcairn  amdgcn       dGPU
 116                 - verde
 117                 - oland
 118                 - hainan
 119      **GCN GFX7 (Sea Islands (CI))** [AMD-Sea-Islands]_
 120      --------------------------------------------------------------------
 121      gfx700     - bonaire   amdgcn       dGPU          - Radeon HD 7790
 122                                                        - Radeon HD 8770
 123                                                        - R7 260
 124                                                        - R7 260X
 125      \          - kaveri    amdgcn       APU           - A6-7000
 126                                                        - A6 Pro-7050B
 127                                                        - A8-7100
 128                                                        - A8 Pro-7150B
 129                                                        - A10-7300
 130                                                        - A10 Pro-7350B
 131                                                        - FX-7500
 132                                                        - A8-7200P
 133                                                        - A10-7400P
 134                                                        - FX-7600P
 135      gfx701     - hawaii    amdgcn       dGPU  ROCm    - FirePro W8100
 136                                                        - FirePro W9100
 137                                                        - FirePro S9150
 138                                                        - FirePro S9170
 139      gfx702                              dGPU  ROCm    - Radeon R9 290
 140                                                        - Radeon R9 290x
 141                                                        - Radeon R390
 142                                                        - Radeon R390x
 143      gfx703     - kabini    amdgcn       APU           - E1-2100
 144                 - mullins                              - E1-2200
 145                                                        - E1-2500
 146                                                        - E2-3000
 147                                                        - E2-3800
 148                                                        - A4-5000
 149                                                        - A4-5100
 150                                                        - A6-5200
 151                                                        - A4 Pro-3340B
 152      **GCN GFX8 (Volcanic Islands (VI))** [AMD-Volcanic-Islands]_
 153      --------------------------------------------------------------------
 154      gfx800     - iceland   amdgcn       dGPU          - FirePro S7150
 155                                                        - FirePro S7100
 156                                                        - FirePro W7100
 157                                                        - Radeon R285
 158                                                        - Radeon R9 380
 159                                                        - Radeon R9 385
 160                                                        - Mobile FirePro
 161                                                          M7170
 162      gfx801     - carrizo   amdgcn       APU           - A6-8500P
 163                                                        - Pro A6-8500B
 164                                                        - A8-8600P
 165                                                        - Pro A8-8600B
 166                                                        - FX-8800P
 167                                                        - Pro A12-8800B
 168      \                      amdgcn       APU   ROCm    - A10-8700P
 169                                                        - Pro A10-8700B
 170                                                        - A10-8780P
 171      \                      amdgcn       APU           - A10-9600P
 172                                                        - A10-9630P
 173                                                        - A12-9700P
 174                                                        - A12-9730P
 175                                                        - FX-9800P
 176                                                        - FX-9830P
 177      \                      amdgcn       APU           - E2-9010
 178                                                        - A6-9210
 179                                                        - A9-9410
 180      gfx802     - tonga     amdgcn       dGPU  ROCm    Same as gfx800
 181      gfx803     - fiji      amdgcn       dGPU  ROCm    - Radeon R9 Nano
 182                                                        - Radeon R9 Fury
 183                                                        - Radeon R9 FuryX
 184                                                        - Radeon Pro Duo
 185                                                        - FirePro S9300x2
 186      \          - polaris10 amdgcn       dGPU  ROCm    - Radeon RX 470
 187                                                        - Radeon RX 480
 188      \          - polaris11 amdgcn       dGPU  ROCm    - Radeon RX 460
 189      gfx804                 amdgcn       dGPU          Same as gfx803
 190      gfx810     - stoney    amdgcn       APU
 191      **GCN GFX9**
 192      --------------------------------------------------------------------
 193      gfx900                 amdgcn       dGPU          - FirePro W9500
 194                                                        - FirePro S9500
 195                                                        - FirePro S9500x2
 196      gfx901                 amdgcn       dGPU  ROCm    Same as gfx900
 197                                                        except XNACK is
 198                                                        enabled
 199      gfx902                 amdgcn       APU           *TBA*
 200
 201                                                        .. TODO
 202                                                           Add product
 203                                                           names.
 204      gfx903                 amdgcn       APU           Same as gfx902
 205                                                        except XNACK is
 206                                                        enabled
 207      ========== =========== ============ ===== ======= ==================
 208
 209 .. _amdgpu-address-spaces:
 210
 211 Address Spaces
 212 --------------
 213
 214 The AMDGPU backend uses the following address space mappings.
 215
 216 The memory space names used in the table, aside from the region memory space, is
 217 from the OpenCL standard.
 218
 219 LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
 220
 221   .. table:: Address Space Mapping
 222      :name: amdgpu-address-space-mapping-table
 223
 224      ================== ================= ================= ================= =================
 225      LLVM Address Space Memory Space
 226      ------------------ -----------------------------------------------------------------------
 227      \                  Current Default   amdgiz/amdgizcl   hcc               Future Default
 228      ================== ================= ================= ================= =================
 229      0                  Private (Scratch) Generic (Flat)    Generic (Flat)    Generic (Flat)
 230      1                  Global            Global            Global            Global
 231      2                  Constant          Constant          Constant          Region (GDS)
 232      3                  Local (group/LDS) Local (group/LDS) Local (group/LDS) Local (group/LDS)
 233      4                  Generic (Flat)    Region (GDS)      Region (GDS)      Constant
 234      5                  Region (GDS)      Private (Scratch) Private (Scratch) Private (Scratch)
 235      ================== ================= ================= ================= =================
 236
 237 Current Default
 238   This is the current default address space mapping used for all languages
 239   except hcc. This will shortly be deprecated.
 240
 241 amdgiz/amdgizcl
 242   This is the current address space mapping used when ``amdgiz`` or ``amdgizcl``
 243   is specified as the target triple environment value.
 244
 245 hcc
 246   This is the current address space mapping used when ``hcc`` is specified as
 247   the target triple environment value.This will shortly be deprecated.
 248
 249 Future Default
 250   This will shortly be the only address space mapping for all languages using
 251   AMDGPU backend.
 252
 253 .. _amdgpu-memory-scopes:
 254
 255 Memory Scopes
 256 -------------
 257
 258 This section provides LLVM memory synchronization scopes supported by the AMDGPU
 259 backend memory model when the target triple OS is ``amdhsa`` (see
 260 :ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
 261
 262 The memory model supported is based on the HSA memory model [HSA]_ which is
 263 based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
 264 relation is transitive over the synchonizes-with relation independent of scope,
 265 and synchonizes-with allows the memory scope instances to be inclusive (see
 266 table :ref:`amdgpu-amdhsa-llvm-sync-scopes-amdhsa-table`).
 267
 268 This is different to the OpenCL [OpenCL]_ memory model which does not have scope
 269 inclusion and requires the memory scopes to exactly match. However, this
 270 is conservatively correct for OpenCL.
 271
 272   .. table:: AMDHSA LLVM Sync Scopes for AMDHSA
 273      :name: amdgpu-amdhsa-llvm-sync-scopes-amdhsa-table
 274
 275      ================ ==========================================================
 276      LLVM Sync Scope  Description
 277      ================ ==========================================================
 278      *none*           The default: ``system``.
 279
 280                       Synchronizes with, and participates in modification and
 281                       seq_cst total orderings with, other operations (except
 282                       image operations) for all address spaces (except private,
 283                       or generic that accesses private) provided the other
 284                       operation's sync scope is:
 285
 286                       - ``system``.
 287                       - ``agent`` and executed by a thread on the same agent.
 288                       - ``workgroup`` and executed by a thread in the same
 289                         workgroup.
 290                       - ``wavefront`` and executed by a thread in the same
 291                         wavefront.
 292
 293      ``agent``        Synchronizes with, and participates in modification and
 294                       seq_cst total orderings with, other operations (except
 295                       image operations) for all address spaces (except private,
 296                       or generic that accesses private) provided the other
 297                       operation's sync scope is:
 298
 299                       - ``system`` or ``agent`` and executed by a thread on the
 300                         same agent.
 301                       - ``workgroup`` and executed by a thread in the same
 302                         workgroup.
 303                       - ``wavefront`` and executed by a thread in the same
 304                         wavefront.
 305
 306      ``workgroup``    Synchronizes with, and participates in modification and
 307                       seq_cst total orderings with, other operations (except
 308                       image operations) for all address spaces (except private,
 309                       or generic that accesses private) provided the other
 310                       operation's sync scope is:
 311
 312                       - ``system``, ``agent`` or ``workgroup`` and executed by a
 313                         thread in the same workgroup.
 314                       - ``wavefront`` and executed by a thread in the same
 315                         wavefront.
 316
 317      ``wavefront``    Synchronizes with, and participates in modification and
 318                       seq_cst total orderings with, other operations (except
 319                       image operations) for all address spaces (except private,
 320                       or generic that accesses private) provided the other
 321                       operation's sync scope is:
 322
 323                       - ``system``, ``agent``, ``workgroup`` or ``wavefront``
 324                         and executed by a thread in the same wavefront.
 325
 326      ``singlethread`` Only synchronizes with, and participates in modification
 327                       and seq_cst total orderings with, other operations (except
 328                       image operations) running in the same thread for all
 329                       address spaces (for example, in signal handlers).
 330      ================ ==========================================================
 331
 332 AMDGPU Intrinsics
 333 -----------------
 334
 335 The AMDGPU backend implements the following intrinsics.
 336
 337 *This section is WIP.*
 338
 339 .. TODO
 340    List AMDGPU intrinsics
 341
 342 Code Object
 343 ===========
 344
 345 The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
 346 can be linked by ``lld`` to produce a standard ELF shared code object which can
 347 be loaded and executed on an AMDGPU target.
 348
 349 Header
 350 ------
 351
 352 The AMDGPU backend uses the following ELF header:
 353
 354   .. table:: AMDGPU ELF Header
 355      :name: amdgpu-elf-header-table
 356
 357      ========================== =========================
 358      Field                      Value
 359      ========================== =========================
 360      ``e_ident[EI_CLASS]``      ``ELFCLASS64``
 361      ``e_ident[EI_DATA]``       ``ELFDATA2LSB``
 362      ``e_ident[EI_OSABI]``      ``ELFOSABI_AMDGPU_HSA``
 363      ``e_ident[EI_ABIVERSION]`` ``ELFABIVERSION_AMDGPU_HSA``
 364      ``e_type``                 ``ET_REL`` or ``ET_DYN``
 365      ``e_machine``              ``EM_AMDGPU``
 366      ``e_entry``                0
 367      ``e_flags``                0
 368      ========================== =========================
 369
 370 ..
 371
 372   .. table:: AMDGPU ELF Header Enumeration Values
 373      :name: amdgpu-elf-header-enumeration-values-table
 374
 375      ============================ =====
 376      Name                         Value
 377      ============================ =====
 378      ``EM_AMDGPU``                224
 379      ``ELFOSABI_AMDGPU_HSA``      64
 380      ``ELFABIVERSION_AMDGPU_HSA`` 1
 381      ============================ =====
 382
 383 ``e_ident[EI_CLASS]``
 384   The ELF class is always ``ELFCLASS64``. The AMDGPU backend only supports 64 bit
 385   applications.
 386
 387 ``e_ident[EI_DATA]``
 388   All AMDGPU targets use ELFDATA2LSB for little-endian byte ordering.
 389
 390 ``e_ident[EI_OSABI]``
 391   The AMD GPU architecture specific OS ABI of ``ELFOSABI_AMDGPU_HSA`` is used to
 392   specify that the code object conforms to the AMD HSA runtime ABI [HSA]_.
 393
 394 ``e_ident[EI_ABIVERSION]``
 395   The AMD GPU architecture specific OS ABI version of
 396   ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA runtime
 397   ABI to which the code object conforms.
 398
 399 ``e_type``
 400   Can be one of the following values:
 401
 402
 403   ``ET_REL``
 404     The type produced by the AMD GPU backend compiler as it is relocatable code
 405     object.
 406
 407   ``ET_DYN``
 408     The type produced by the linker as it is a shared code object.
 409
 410   The AMD HSA runtime loader requires a ``ET_DYN`` code object.
 411
 412 ``e_machine``
 413   The value ``EM_AMDGPU`` is used for the machine for all members of the AMD GPU
 414   architecture family. The specific member is specified in the
 415   ``NT_AMD_AMDGPU_ISA`` entry in the ``.note`` section (see
 416   :ref:`amdgpu-note-records`).
 417
 418 ``e_entry``
 419   The entry point is 0 as the entry points for individual kernels must be
 420   selected in order to invoke them through AQL packets.
 421
 422 ``e_flags``
 423   The value is 0 as no flags are used.
 424
 425 Sections
 426 --------
 427
 428 An AMDGPU target ELF code object has the standard ELF sections which include:
 429
 430   .. table:: AMDGPU ELF Sections
 431      :name: amdgpu-elf-sections-table
 432
 433      ================== ================ =================================
 434      Name               Type             Attributes
 435      ================== ================ =================================
 436      ``.bss``           ``SHT_NOBITS``   ``SHF_ALLOC`` + ``SHF_WRITE``
 437      ``.data``          ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
 438      ``.debug_``\ *\**  ``SHT_PROGBITS`` *none*
 439      ``.dynamic``       ``SHT_DYNAMIC``  ``SHF_ALLOC``
 440      ``.dynstr``        ``SHT_PROGBITS`` ``SHF_ALLOC``
 441      ``.dynsym``        ``SHT_PROGBITS`` ``SHF_ALLOC``
 442      ``.got``           ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
 443      ``.hash``          ``SHT_HASH``     ``SHF_ALLOC``
 444      ``.note``          ``SHT_NOTE``     *none*
 445      ``.rela``\ *name*  ``SHT_RELA``     *none*
 446      ``.rela.dyn``      ``SHT_RELA``     *none*
 447      ``.rodata``        ``SHT_PROGBITS`` ``SHF_ALLOC``
 448      ``.shstrtab``      ``SHT_STRTAB``   *none*
 449      ``.strtab``        ``SHT_STRTAB``   *none*
 450      ``.symtab``        ``SHT_SYMTAB``   *none*
 451      ``.text``          ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
 452      ================== ================ =================================
 453
 454 These sections have their standard meanings (see [ELF]_) and are only generated
 455 if needed.
 456
 457 ``.debug``\ *\**
 458   The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
 459   DWARF produced by the AMDGPU backend.
 460
 461 ``.dynamic``, ``.dynstr``, ``.dynstr``, ``.hash``
 462   The standard sections used by a dynamic loader.
 463
 464 ``.note``
 465   See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
 466   backend.
 467
 468 ``.rela``\ *name*, ``.rela.dyn``
 469   For relocatable code objects, *name* is the name of the section that the
 470   relocation records apply. For example, ``.rela.text`` is the section name for
 471   relocation records associated with the ``.text`` section.
 472
 473   For linked shared code objects, ``.rela.dyn`` contains all the relocation
 474   records from each of the relocatable code object's ``.rela``\ *name* sections.
 475
 476   See :ref:`amdgpu-relocation-records` for the relocation records supported by
 477   the AMDGPU backend.
 478
 479 ``.text``
 480   The executable machine code for the kernels and functions they call. Generated
 481   as position independent code. See :ref:`amdgpu-code-conventions` for
 482   information on conventions used in the isa generation.
 483
 484 .. _amdgpu-note-records:
 485
 486 Note Records
 487 ------------
 488
 489 As required by ``ELFCLASS64``, minimal zero byte padding must be generated after
 490 the ``name`` field to ensure the ``desc`` field is 4 byte aligned. In addition,
 491 minimal zero byte padding must be generated to ensure the ``desc`` field size is
 492 a multiple of 4 bytes. The ``sh_addralign`` field of the ``.note`` section must
 493 be at least 4 to indicate at least 8 byte alignment.
 494
 495 The AMDGPU backend code object uses the following ELF note records in the
 496 ``.note`` section. The *Description* column specifies the layout of the note
 497 record’s ``desc`` field. All fields are consecutive bytes. Note records with
 498 variable size strings have a corresponding ``*_size`` field that specifies the
 499 number of bytes, including the terminating null character, in the string. The
 500 string(s) come immediately after the preceding fields.
 501
 502 Additional note records can be present.
 503
 504   .. table:: AMDGPU ELF Note Records
 505      :name: amdgpu-elf-note-records-table
 506
 507      ===== ========================== ==========================================
 508      Name  Type                       Description
 509      ===== ========================== ==========================================
 510      "AMD" ``NT_AMD_AMDGPU_METADATA`` <metadata null terminated string>
 511      "AMD" ``NT_AMD_AMDGPU_ISA``      <isa name null terminated string>
 512      ===== ========================== ==========================================
 513
 514 ..
 515
 516   .. table:: AMDGPU ELF Note Record Enumeration Values
 517      :name: amdgpu-elf-note-record-enumeration-values-table
 518
 519      ============================= =====
 520      Name                          Value
 521      ============================= =====
 522      *reserved*                    0-9
 523      ``NT_AMD_AMDGPU_METADATA``    10
 524      ``NT_AMD_AMDGPU_ISA``         11
 525      ============================= =====
 526
 527 ``NT_AMD_AMDGPU_ISA``
 528   Specifies the instruction set architecture used by the machine code contained
 529   in the code object.
 530
 531   This note record is required for code objects containing machine code for
 532   processors matching the ``amdgcn`` architecture in table
 533   :ref:`amdgpu-processors`.
 534
 535   The null terminated string has the following syntax:
 536
 537     *architecture*\ ``-``\ *vendor*\ ``-``\ *os*\ ``-``\ *environment*\ ``-``\ *processor*
 538
 539   where:
 540
 541     *architecture*
 542       The architecture from table :ref:`amdgpu-target-triples-table`.
 543
 544       This is always ``amdgcn`` when the target triple OS is ``amdhsa`` (see
 545       :ref:`amdgpu-target-triples`).
 546
 547     *vendor*
 548       The vendor from table :ref:`amdgpu-target-triples-table`.
 549
 550       For the AMDGPU backend this is always ``amd``.
 551
 552     *os*
 553       The OS from table :ref:`amdgpu-target-triples-table`.
 554
 555     *environment*
 556       An environment from table :ref:`amdgpu-target-triples-table`, or blank if
 557       the environment has no affect on the execution of the code object.
 558
 559       For the AMDGPU backend this is currently always blank.
 560     *processor*
 561       The processor from table :ref:`amdgpu-processors-table`.
 562
 563   For example:
 564
 565     ``amdgcn-amd-amdhsa--gfx901``
 566
 567 ``NT_AMD_AMDGPU_METADATA``
 568   Specifies extensible metadata associated with the code object. See
 569   :ref:`amdgpu-code-object-metadata` for the syntax of the code object metadata
 570   string.
 571
 572   This note record is required and must contain the minimum information
 573   necessary to support the ROCM kernel queries. For example, the segment sizes
 574   needed in a dispatch packet. In addition, a high level language runtime may
 575   require other information to be included. For example, the AMD OpenCL runtime
 576   records kernel argument information.
 577
 578   .. TODO
 579      Is the string null terminated? It probably should not if YAML allows it to
 580      contain null characters, otherwise it should be.
 581
 582 .. _amdgpu-code-object-metadata:
 583
 584 Code Object Metadata
 585 --------------------
 586
 587 The code object metadata is specified by the ``NT_AMD_AMDHSA_METADATA`` note
 588 record (see :ref:`amdgpu-note-records`).
 589
 590 The metadata is specified as a YAML formatted string (see [YAML]_ and
 591 :doc:`YamlIO`).
 592
 593 The metadata is represented as a single YAML document comprised of the mapping
 594 defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
 595 referenced tables.
 596
 597 For boolean values, the string values of ``false`` and ``true`` are used for
 598 false and true respectively.
 599
 600 Additional information can be added to the mappings. To avoid conflicts, any
 601 non-AMD key names should be prefixed by "*vendor-name*.".
 602
 603   .. table:: AMDHSA Code Object Metadata Mapping
 604      :name: amdgpu-amdhsa-code-object-metadata-mapping-table
 605
 606      ========== ============== ========= =======================================
 607      String Key Value Type     Required? Description
 608      ========== ============== ========= =======================================
 609      "Version"  sequence of    Required  - The first integer is the major
 610                 2 integers                 version. Currently 1.
 611                                          - The second integer is the minor
 612                                            version. Currently 0.
 613      "Printf"   sequence of              Each string is encoded information
 614                 strings                  about a printf function call. The
 615                                          encoded information is organized as
 616                                          fields separated by colon (':'):
 617
 618                                          ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
 619
 620                                          where:
 621
 622                                          ``ID``
 623                                            A 32 bit integer as a unique id for
 624                                            each printf function call
 625
 626                                          ``N``
 627                                            A 32 bit integer equal to the number
 628                                            of arguments of printf function call
 629                                            minus 1
 630
 631                                          ``S[i]`` (where i = 0, 1, ... , N-1)
 632                                            32 bit integers for the size in bytes
 633                                            of the i-th FormatString argument of
 634                                            the printf function call
 635
 636                                          FormatString
 637                                            The format string passed to the
 638                                            printf function call.
 639      "Kernels"  sequence of    Required  Sequence of the mappings for each
 640                 mapping                  kernel in the code object. See
 641                                          :ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
 642                                          for the definition of the mapping.
 643      ========== ============== ========= =======================================
 644
 645 ..
 646
 647   .. table:: AMDHSA Code Object Kernel Metadata Mapping
 648      :name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
 649
 650      ================= ============== ========= ================================
 651      String Key        Value Type     Required? Description
 652      ================= ============== ========= ================================
 653      "Name"            string         Required  Source name of the kernel.
 654      "SymbolName"      string         Required  Name of the kernel
 655                                                 descriptor ELF symbol.
 656      "Language"        string                   Source language of the kernel.
 657                                                 Values include:
 658
 659                                                 - "OpenCL C"
 660                                                 - "OpenCL C++"
 661                                                 - "HCC"
 662                                                 - "OpenMP"
 663
 664      "LanguageVersion" sequence of              - The first integer is the major
 665                        2 integers                 version.
 666                                                 - The second integer is the
 667                                                   minor version.
 668      "Attrs"           mapping                  Mapping of kernel attributes.
 669                                                 See
 670                                                 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
 671                                                 for the mapping definition.
 672      "Arguments"       sequence of              Sequence of mappings of the
 673                        mapping                  kernel arguments. See
 674                                                 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
 675                                                 for the definition of the mapping.
 676      "CodeProps"       mapping                  Mapping of properties related to
 677                                                 the kernel code. See
 678                                                 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
 679                                                 for the mapping definition.
 680      "DebugProps"      mapping                  Mapping of properties related to
 681                                                 the kernel debugging. See
 682                                                 :ref:`amdgpu-amdhsa-code-object-kernel-debug-properties-metadata-mapping-table`
 683                                                 for the mapping definition.
 684      ================= ============== ========= ================================
 685
 686 ..
 687
 688   .. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
 689      :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
 690
 691      =================== ============== ========= ==============================
 692      String Key          Value Type     Required? Description
 693      =================== ============== ========= ==============================
 694      "ReqdWorkGroupSize" sequence of              The dispatch work-group size
 695                          3 integers               X, Y, Z must correspond to the
 696                                                   specified values.
 697
 698                                                   Corresponds to the OpenCL
 699                                                   ``reqd_work_group_size``
 700                                                   attribute.
 701      "WorkGroupSizeHint" sequence of              The dispatch work-group size
 702                          3 integers               X, Y, Z is likely to be the
 703                                                   specified values.
 704
 705                                                   Corresponds to the OpenCL
 706                                                   ``work_group_size_hint``
 707                                                   attribute.
 708      "VecTypeHint"       string                   The name of a scalar or vector
 709                                                   type.
 710
 711                                                   Corresponds to the OpenCL
 712                                                   ``vec_type_hint`` attribute.
 713      =================== ============== ========= ==============================
 714
 715 ..
 716
 717   .. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
 718      :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
 719
 720      ================= ============== ========= ================================
 721      String Key        Value Type     Required? Description
 722      ================= ============== ========= ================================
 723      "Name"            string                   Kernel argument name.
 724      "TypeName"        string                   Kernel argument type name.
 725      "Size"            integer        Required  Kernel argument size in bytes.
 726      "Align"           integer        Required  Kernel argument alignment in
 727                                                 bytes. Must be a power of two.
 728      "ValueKind"       string         Required  Kernel argument kind that
 729                                                 specifies how to set up the
 730                                                 corresponding argument.
 731                                                 Values include:
 732
 733                                                 "ByValue"
 734                                                   The argument is copied
 735                                                   directly into the kernarg.
 736
 737                                                 "GlobalBuffer"
 738                                                   A global address space pointer
 739                                                   to the buffer data is passed
 740                                                   in the kernarg.
 741
 742                                                 "DynamicSharedPointer"
 743                                                   A group address space pointer
 744                                                   to dynamically allocated LDS
 745                                                   is passed in the kernarg.
 746
 747                                                 "Sampler"
 748                                                   A global address space
 749                                                   pointer to a S# is passed in
 750                                                   the kernarg.
 751
 752                                                 "Image"
 753                                                   A global address space
 754                                                   pointer to a T# is passed in
 755                                                   the kernarg.
 756
 757                                                 "Pipe"
 758                                                   A global address space pointer
 759                                                   to an OpenCL pipe is passed in
 760                                                   the kernarg.
 761
 762                                                 "Queue"
 763                                                   A global address space pointer
 764                                                   to an OpenCL device enqueue
 765                                                   queue is passed in the
 766                                                   kernarg.
 767
 768                                                 "HiddenGlobalOffsetX"
 769                                                   The OpenCL grid dispatch
 770                                                   global offset for the X
 771                                                   dimension is passed in the
 772                                                   kernarg.
 773
 774                                                 "HiddenGlobalOffsetY"
 775                                                   The OpenCL grid dispatch
 776                                                   global offset for the Y
 777                                                   dimension is passed in the
 778                                                   kernarg.
 779
 780                                                 "HiddenGlobalOffsetZ"
 781                                                   The OpenCL grid dispatch
 782                                                   global offset for the Z
 783                                                   dimension is passed in the
 784                                                   kernarg.
 785
 786                                                 "HiddenNone"
 787                                                   An argument that is not used
 788                                                   by the kernel. Space needs to
 789                                                   be left for it, but it does
 790                                                   not need to be set up.
 791
 792                                                 "HiddenPrintfBuffer"
 793                                                   A global address space pointer
 794                                                   to the runtime printf buffer
 795                                                   is passed in kernarg.
 796
 797                                                 "HiddenDefaultQueue"
 798                                                   A global address space pointer
 799                                                   to the OpenCL device enqueue
 800                                                   queue that should be used by
 801                                                   the kernel by default is
 802                                                   passed in the kernarg.
 803
 804                                                 "HiddenCompletionAction"
 805                                                   *TBD*
 806
 807                                                   .. TODO
 808                                                      Add description.
 809
 810      "ValueType"       string         Required  Kernel argument value type. Only
 811                                                 present if "ValueKind" is
 812                                                 "ByValue". For vector data
 813                                                 types, the value is for the
 814                                                 element type. Values include:
 815
 816                                                 - "Struct"
 817                                                 - "I8"
 818                                                 - "U8"
 819                                                 - "I16"
 820                                                 - "U16"
 821                                                 - "F16"
 822                                                 - "I32"
 823                                                 - "U32"
 824                                                 - "F32"
 825                                                 - "I64"
 826                                                 - "U64"
 827                                                 - "F64"
 828
 829                                                 .. TODO
 830                                                    How can it be determined if a
 831                                                    vector type, and what size
 832                                                    vector?
 833      "PointeeAlign"    integer                  Alignment in bytes of pointee
 834                                                 type for pointer type kernel
 835                                                 argument. Must be a power
 836                                                 of 2. Only present if
 837                                                 "ValueKind" is
 838                                                 "DynamicSharedPointer".
 839      "AddrSpaceQual"   string                   Kernel argument address space
 840                                                 qualifier. Only present if
 841                                                 "ValueKind" is "GlobalBuffer" or
 842                                                 "DynamicSharedPointer". Values
 843                                                 are:
 844
 845                                                 - "Private"
 846                                                 - "Global"
 847                                                 - "Constant"
 848                                                 - "Local"
 849                                                 - "Generic"
 850                                                 - "Region"
 851
 852                                                 .. TODO
 853                                                    Is GlobalBuffer only Global
 854                                                    or Constant? Is
 855                                                    DynamicSharedPointer always
 856                                                    Local? Can HCC allow Generic?
 857                                                    How can Private or Region
 858                                                    ever happen?
 859      "AccQual"         string                   Kernel argument access
 860                                                 qualifier. Only present if
 861                                                 "ValueKind" is "Image" or
 862                                                 "Pipe". Values
 863                                                 are:
 864
 865                                                 - "ReadOnly"
 866                                                 - "WriteOnly"
 867                                                 - "ReadWrite"
 868
 869                                                 .. TODO
 870                                                    Does this apply to
 871                                                    GlobalBuffer?
 872      "ActualAcc"       string                   The actual memory accesses
 873                                                 performed by the kernel on the
 874                                                 kernel argument. Only present if
 875                                                 "ValueKind" is "GlobalBuffer",
 876                                                 "Image", or "Pipe". This may be
 877                                                 more restrictive than indicated
 878                                                 by "AccQual" to reflect what the
 879                                                 kernel actual does. If not
 880                                                 present then the runtime must
 881                                                 assume what is implied by
 882                                                 "AccQual" and "IsConst". Values
 883                                                 are:
 884
 885                                                 - "ReadOnly"
 886                                                 - "WriteOnly"
 887                                                 - "ReadWrite"
 888
 889      "IsConst"         boolean                  Indicates if the kernel argument
 890                                                 is const qualified. Only present
 891                                                 if "ValueKind" is
 892                                                 "GlobalBuffer".
 893
 894      "IsRestrict"      boolean                  Indicates if the kernel argument
 895                                                 is restrict qualified. Only
 896                                                 present if "ValueKind" is
 897                                                 "GlobalBuffer".
 898
 899      "IsVolatile"      boolean                  Indicates if the kernel argument
 900                                                 is volatile qualified. Only
 901                                                 present if "ValueKind" is
 902                                                 "GlobalBuffer".
 903
 904      "IsPipe"          boolean                  Indicates if the kernel argument
 905                                                 is pipe qualified. Only present
 906                                                 if "ValueKind" is "Pipe".
 907
 908                                                 .. TODO
 909                                                    Can GlobalBuffer be pipe
 910                                                    qualified?
 911      ================= ============== ========= ================================
 912
 913 ..
 914
 915   .. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
 916      :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
 917
 918      ============================ ============== ========= =====================
 919      String Key                   Value Type     Required? Description
 920      ============================ ============== ========= =====================
 921      "KernargSegmentSize"         integer        Required  The size in bytes of
 922                                                            the kernarg segment
 923                                                            that holds the values
 924                                                            of the arguments to
 925                                                            the kernel.
 926      "GroupSegmentFixedSize"      integer        Required  The amount of group
 927                                                            segment memory
 928                                                            required by a
 929                                                            work-group in
 930                                                            bytes. This does not
 931                                                            include any
 932                                                            dynamically allocated
 933                                                            group segment memory
 934                                                            that may be added
 935                                                            when the kernel is
 936                                                            dispatched.
 937      "PrivateSegmentFixedSize"    integer        Required  The amount of fixed
 938                                                            private address space
 939                                                            memory required for a
 940                                                            work-item in
 941                                                            bytes. If
 942                                                            IsDynamicCallstack
 943                                                            is 1 then additional
 944                                                            space must be added
 945                                                            to this value for the
 946                                                            call stack.
 947      "KernargSegmentAlign"        integer        Required  The maximum byte
 948                                                            alignment of
 949                                                            arguments in the
 950                                                            kernarg segment. Must
 951                                                            be a power of 2.
 952      "WavefrontSize"              integer        Required  Wavefront size. Must
 953                                                            be a power of 2.
 954      "NumSGPRs"                   integer                  Number of scalar
 955                                                            registers used by a
 956                                                            wavefront for
 957                                                            GFX6-GFX9. This
 958                                                            includes the special
 959                                                            SGPRs for VCC, Flat
 960                                                            Scratch (GFX7-GFX9)
 961                                                            and XNACK (for
 962                                                            GFX8-GFX9). It does
 963                                                            not include the 16
 964                                                            SGPR added if a trap
 965                                                            handler is
 966                                                            enabled. It is not
 967                                                            rounded up to the
 968                                                            allocation
 969                                                            granularity.
 970      "NumVGPRs"                   integer                  Number of vector
 971                                                            registers used by
 972                                                            each work-item for
 973                                                            GFX6-GFX9
 974      "MaxFlatWorkgroupSize"       integer                  Maximum flat
 975                                                            work-group size
 976                                                            supported by the
 977                                                            kernel in work-items.
 978      "IsDynamicCallStack"         boolean                  Indicates if the
 979                                                            generated machine
 980                                                            code is using a
 981                                                            dynamically sized
 982                                                            call stack.
 983      "IsXNACKEnabled"             boolean                  Indicates if the
 984                                                            generated machine
 985                                                            code is capable of
 986                                                            supporting XNACK.
 987      ============================ ============== ========= =====================
 988
 989 ..
 990
 991   .. table:: AMDHSA Code Object Kernel Debug Properties Metadata Mapping
 992      :name: amdgpu-amdhsa-code-object-kernel-debug-properties-metadata-mapping-table
 993
 994      =================================== ============== ========= ==============
 995      String Key                          Value Type     Required? Description
 996      =================================== ============== ========= ==============
 997      "DebuggerABIVersion"                string
 998      "ReservedNumVGPRs"                  integer
 999      "ReservedFirstVGPR"                 integer
1000      "PrivateSegmentBufferSGPR"          integer
1001      "WavefrontPrivateSegmentOffsetSGPR" integer
1002      =================================== ============== ========= ==============
1003
1004 .. TODO
1005    Plan to remove the debug properties metadata.
1006
1007 .. _amdgpu-symbols:
1008
1009 Symbols
1010 -------
1011
1012 Symbols include the following:
1013
1014   .. table:: AMDGPU ELF Symbols
1015      :name: amdgpu-elf-symbols-table
1016
1017      ===================== ============== ============= ==================
1018      Name                  Type           Section       Description
1019      ===================== ============== ============= ==================
1020      *link-name*           ``STT_OBJECT`` - ``.data``   Global variable
1021                                           - ``.rodata``
1022                                           - ``.bss``
1023      *link-name*\ ``@kd``  ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
1024      *link-name*           ``STT_FUNC``   - ``.text``   Kernel entry point
1025      ===================== ============== ============= ==================
1026
1027 Global variable
1028   Global variables both used and defined by the compilation unit.
1029
1030   If the symbol is defined in the compilation unit then it is allocated in the
1031   appropriate section according to if it has initialized data or is readonly.
1032
1033   If the symbol is external then its section is ``STN_UNDEF`` and the loader
1034   will resolve relocations using the definition provided by another code object
1035   or explicitly defined by the runtime.
1036
1037   All global symbols, whether defined in the compilation unit or external, are
1038   accessed by the machine code indirectly through a GOT table entry. This
1039   allows them to be preemptable. The GOT table is only supported when the target
1040   triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
1041
1042   .. TODO
1043      Add description of linked shared object symbols. Seems undefined symbols
1044      are marked as STT_NOTYPE.
1045
1046 Kernel descriptor
1047   Every HSA kernel has an associated kernel descriptor. It is the address of the
1048   kernel descriptor that is used in the AQL dispatch packet used to invoke the
1049   kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
1050   defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
1051
1052 Kernel entry point
1053   Every HSA kernel also has a symbol for its machine code entry point.
1054
1055 .. _amdgpu-relocation-records:
1056
1057 Relocation Records
1058 ------------------
1059
1060 AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
1061 relocatable fields are:
1062
1063 ``word32``
1064   This specifies a 32-bit field occupying 4 bytes with arbitrary byte
1065   alignment. These values use the same byte order as other word values in the
1066   AMD GPU architecture.
1067
1068 ``word64``
1069   This specifies a 64-bit field occupying 8 bytes with arbitrary byte
1070   alignment. These values use the same byte order as other word values in the
1071   AMD GPU architecture.
1072
1073 Following notations are used for specifying relocation calculations:
1074
1075 **A**
1076   Represents the addend used to compute the value of the relocatable field.
1077
1078 **G**
1079   Represents the offset into the global offset table at which the relocation
1080   entry’s symbol will reside during execution.
1081
1082 **GOT**
1083   Represents the address of the global offset table.
1084
1085 **P**
1086   Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
1087   of the storage unit being relocated (computed using ``r_offset``).
1088
1089 **S**
1090   Represents the value of the symbol whose index resides in the relocation
1091   entry.
1092
1093 The following relocation types are supported:
1094
1095   .. table:: AMDGPU ELF Relocation Records
1096      :name: amdgpu-elf-relocation-records-table
1097
1098      ==========================  =====  ==========  ==============================
1099      Relocation Type             Value  Field       Calculation
1100      ==========================  =====  ==========  ==============================
1101      ``R_AMDGPU_NONE``           0      *none*      *none*
1102      ``R_AMDGPU_ABS32_LO``       1      ``word32``  (S + A) & 0xFFFFFFFF
1103      ``R_AMDGPU_ABS32_HI``       2      ``word32``  (S + A) >> 32
1104      ``R_AMDGPU_ABS64``          3      ``word64``  S + A
1105      ``R_AMDGPU_REL32``          4      ``word32``  S + A - P
1106      ``R_AMDGPU_REL64``          5      ``word64``  S + A - P
1107      ``R_AMDGPU_ABS32``          6      ``word32``  S + A
1108      ``R_AMDGPU_GOTPCREL``       7      ``word32``  G + GOT + A - P
1109      ``R_AMDGPU_GOTPCREL32_LO``  8      ``word32``  (G + GOT + A - P) & 0xFFFFFFFF
1110      ``R_AMDGPU_GOTPCREL32_HI``  9      ``word32``  (G + GOT + A - P) >> 32
1111      ``R_AMDGPU_REL32_LO``       10     ``word32``  (S + A - P) & 0xFFFFFFFF
1112      ``R_AMDGPU_REL32_HI``       11     ``word32``  (S + A - P) >> 32
1113      ==========================  =====  ==========  ==============================
1114
1115 .. _amdgpu-dwarf:
1116
1117 DWARF
1118 -----
1119
1120 Standard DWARF [DWARF]_ Version 2 sections can be generated. These contain
1121 information that maps the code object executable code and data to the source
1122 language constructs. It can be used by tools such as debuggers and profilers.
1123
1124 Address Space Mapping
1125 ~~~~~~~~~~~~~~~~~~~~~
1126
1127 The following address space mapping is used:
1128
1129   .. table:: AMDGPU DWARF Address Space Mapping
1130      :name: amdgpu-dwarf-address-space-mapping-table
1131
1132      =================== =================
1133      DWARF Address Space Memory Space
1134      =================== =================
1135      1                   Private (Scratch)
1136      2                   Local (group/LDS)
1137      *omitted*           Global
1138      *omitted*           Constant
1139      *omitted*           Generic (Flat)
1140      *not supported*     Region (GDS)
1141      =================== =================
1142
1143 See :ref:`amdgpu-address-spaces` for infomration on the memory space terminology
1144 used in the table.
1145
1146 An ``address_class`` attribute is generated on pointer type DIEs to specify the
1147 DWARF address space of the value of the pointer when it is in the *private* or
1148 *local* address space. Otherwise the attribute is omitted.
1149
1150 An ``XDEREF`` operation is generated in location list expressions for variables
1151 that are allocated in the *private* and *local* address space. Otherwise no
1152 ``XDREF`` is omitted.
1153
1154 Register Mapping
1155 ~~~~~~~~~~~~~~~~
1156
1157 *This section is WIP.*
1158
1159 .. TODO
1160    Define DWARF register enumeration.
1161
1162    If want to present a wavefront state then should expose vector registers as
1163    64 wide (rather than per work-item view that LLVM uses). Either as separate
1164    registers, or a 64x4 byte single register. In either case use a new LANE op
1165    (akin to XDREF) to select the current lane usage in a location
1166    expression. This would also allow scalar register spilling to vector register
1167    lanes to be expressed (currently no debug information is being generated for
1168    spilling). If choose a wide single register approach then use LANE in
1169    conjunction with PIECE operation to select the dword part of the register for
1170    the current lane. If the separate register approach then use LANE to select
1171    the register.
1172
1173 Source Text
1174 ~~~~~~~~~~~
1175
1176 *This section is WIP.*
1177
1178 .. TODO
1179    DWARF extension to include runtime generated source text.
1180
1181 .. _amdgpu-code-conventions:
1182
1183 Code Conventions
1184 ================
1185
1186 AMDHSA
1187 ------
1188
1189 This section provides code conventions used when the target triple OS is
1190 ``amdhsa`` (see :ref:`amdgpu-target-triples`).
1191
1192 Kernel Dispatch
1193 ~~~~~~~~~~~~~~~
1194
1195 The HSA architected queuing language (AQL) defines a user space memory interface
1196 that can be used to control the dispatch of kernels, in an agent independent
1197 way. An agent can have zero or more AQL queues created for it using the ROCm
1198 runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
1199 *HSA Platform System Architecture Specification* [HSA]_ for the AQL queue
1200 mechanics and packet layouts.
1201
1202 The packet processor of a kernel agent is responsible for detecting and
1203 dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
1204 packet processor is implemented by the hardware command processor (CP),
1205 asynchronous dispatch controller (ADC) and shader processor input controller
1206 (SPI).
1207
1208 The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
1209 mode driver to initialize and register the AQL queue with CP.
1210
1211 To dispatch a kernel the following actions are performed. This can occur in the
1212 CPU host program, or from an HSA kernel executing on a GPU.
1213
1214 1. A pointer to an AQL queue for the kernel agent on which the kernel is to be
1215    executed is obtained.
1216 2. A pointer to the kernel descriptor (see
1217    :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
1218    obtained. It must be for a kernel that is contained in a code object that that
1219    was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
1220    associated.
1221 3. Space is allocated for the kernel arguments using the ROCm runtime allocator
1222    for a memory region with the kernarg property for the kernel agent that will
1223    execute the kernel. It must be at least 16 byte aligned.
1224 4. Kernel argument values are assigned to the kernel argument memory
1225    allocation. The layout is defined in the *HSA Programmer’s Language Reference*
1226    [HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
1227    memory in the same way constant memory is accessed. (Note that the HSA
1228    specification allows an implementation to copy the kernel argument contents to
1229    another location that is accessed by the kernel.)
1230 5. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
1231    api uses 64 bit atomic operations to reserve space in the AQL queue for the
1232    packet. The packet must be set up, and the final write must use an atomic
1233    store release to set the packet kind to ensure the packet contents are
1234    visible to the kernel agent. AQL defines a doorbell signal mechanism to
1235    notify the kernel agent that the AQL queue has been updated. These rules, and
1236    the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
1237    System Architecture Specification* [HSA]_.
1238 6. A kernel dispatch packet includes information about the actual dispatch,
1239    such as grid and work-group size, together with information from the code
1240    object about the kernel, such as segment sizes. The ROCm runtime queries on
1241    the kernel symbol can be used to obtain the code object values which are
1242    recorded in the :ref:`amdgpu-code-object-metadata`.
1243 7. CP executes micro-code and is responsible for detecting and setting up the
1244    GPU to execute the wavefronts of a kernel dispatch.
1245 8. CP ensures that when the a wavefront starts executing the kernel machine
1246    code, the scalar general purpose registers (SGPR) and vector general purpose
1247    registers (VGPR) are set up as required by the machine code. The required
1248    setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
1249    register state is defined in
1250    :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
1251 9. The prolog of the kernel machine code (see
1252    :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
1253    before continuing executing the machine code that corresponds to the kernel.
1254 10. When the kernel dispatch has completed execution, CP signals the completion
1255     signal specified in the kernel dispatch packet if not 0.
1256
1257 .. _amdgpu-amdhsa-memory-spaces:
1258
1259 Memory Spaces
1260 ~~~~~~~~~~~~~
1261
1262 The memory space properties are:
1263
1264   .. table:: AMDHSA Memory Spaces
1265      :name: amdgpu-amdhsa-memory-spaces-table
1266
1267      ================= =========== ======== ======= ==================
1268      Memory Space Name HSA Segment Hardware Address NULL Value
1269                        Name        Name     Size
1270      ================= =========== ======== ======= ==================
1271      Private           private     scratch  32      0x00000000
1272      Local             group       LDS      32      0xFFFFFFFF
1273      Global            global      global   64      0x0000000000000000
1274      Constant          constant    *same as 64      0x0000000000000000
1275                                    global*
1276      Generic           flat        flat     64      0x0000000000000000
1277      Region            N/A         GDS      32      *not implemented
1278                                                     for AMDHSA*
1279      ================= =========== ======== ======= ==================
1280
1281 The global and constant memory spaces both use global virtual addresses, which
1282 are the same virtual address space used by the CPU. However, some virtual
1283 addresses may only be accessible to the CPU, some only accessible by the GPU,
1284 and some by both.
1285
1286 Using the constant memory space indicates that the data will not change during
1287 the execution of the kernel. This allows scalar read instructions to be
1288 used. The vector and scalar L1 caches are invalidated of volatile data before
1289 each kernel dispatch execution to allow constant memory to change values between
1290 kernel dispatches.
1291
1292 The local memory space uses the hardware Local Data Store (LDS) which is
1293 automatically allocated when the hardware creates work-groups of wavefronts, and
1294 freed when all the wavefronts of a work-group have terminated. The data store
1295 (DS) instructions can be used to access it.
1296
1297 The private memory space uses the hardware scratch memory support. If the kernel
1298 uses scratch, then the hardware allocates memory that is accessed using
1299 wavefront lane dword (4 byte) interleaving. The mapping used from private
1300 address to physical address is:
1301
1302   ``wavefront-scratch-base +
1303   (private-address * wavefront-size * 4) +
1304   (wavefront-lane-id * 4)``
1305
1306 There are different ways that the wavefront scratch base address is determined
1307 by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
1308 memory can be accessed in an interleaved manner using buffer instruction with
1309 the scratch buffer descriptor and per wave scratch offset, by the scratch
1310 instructions, or by flat instructions. If each lane of a wavefront accesses the
1311 same private address, the interleaving results in adjacent dwords being accessed
1312 and hence requires fewer cache lines to be fetched. Multi-dword access is not
1313 supported except by flat and scratch instructions in GFX9.
1314
1315 The generic address space uses the hardware flat address support available in
1316 GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
1317 local appertures), that are outside the range of addressible global memory, to
1318 map from a flat address to a private or local address.
1319
1320 FLAT instructions can take a flat address and access global, private (scratch)
1321 and group (LDS) memory depending in if the address is within one of the
1322 apperture ranges. Flat access to scratch requires hardware aperture setup and
1323 setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
1324 access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
1325 (see :ref:`amdgpu-amdhsa-m0`).
1326
1327 To convert between a segment address and a flat address the base address of the
1328 appertures address can be used. For GFX7-GFX8 these are available in the
1329 :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
1330 Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
1331 GFX9 the appature base addresses are directly available as inline constant
1332 registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
1333 address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
1334 which makes it easier to convert from flat to segment or segment to flat.
1335
1336 HSA Image and Samplers
1337 ~~~~~~~~~~~~~~~~~~~~~~
1338
1339 Image and sample handles created by the ROCm runtime are 64 bit addresses of a
1340 hardware 32 byte V# and 48 byte S# object respectively. In order to support the
1341 HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
1342 enumeration values for the queries that are not trivially deducible from the S#
1343 representation.
1344
1345 HSA Signals
1346 ~~~~~~~~~~~
1347
1348 Signal handles created by the ROCm runtime are 64 bit addresses of a structure
1349 allocated in memory accessible from both the CPU and GPU. The structure is
1350 defined by the ROCm runtime and subject to change between releases (see
1351 [AMD-ROCm-github]_).
1352
1353 .. _amdgpu-amdhsa-hsa-aql-queue:
1354
1355 HSA AQL Queue
1356 ~~~~~~~~~~~~~
1357
1358 The AQL queue structure is defined by the ROCm runtime and subject to change
1359 between releases (see [AMD-ROCm-github]_). For some processors it contains
1360 fields needed to implement certain language features such as the flat address
1361 aperture bases. It also contains fields used by CP such as managing the
1362 allocation of scratch memory.
1363
1364 .. _amdgpu-amdhsa-kernel-descriptor:
1365
1366 Kernel Descriptor
1367 ~~~~~~~~~~~~~~~~~
1368
1369 A kernel descriptor consists of the information needed by CP to initiate the
1370 execution of a kernel, including the entry point address of the machine code
1371 that implements the kernel.
1372
1373 Kernel Descriptor for GFX6-GFX9
1374 +++++++++++++++++++++++++++++++
1375
1376 CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
1377
1378   .. table:: Kernel Descriptor for GFX6-GFX9
1379      :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
1380
1381      ======= ======= =============================== ===========================
1382      Bits    Size    Field Name                      Description
1383      ======= ======= =============================== ===========================
1384      31:0    4 bytes group_segment_fixed_size        The amount of fixed local
1385                                                      address space memory
1386                                                      required for a work-group
1387                                                      in bytes. This does not
1388                                                      include any dynamically
1389                                                      allocated local address
1390                                                      space memory that may be
1391                                                      added when the kernel is
1392                                                      dispatched.
1393      63:32   4 bytes private_segment_fixed_size      The amount of fixed
1394                                                      private address space
1395                                                      memory required for a
1396                                                      work-item in bytes. If
1397                                                      is_dynamic_callstack is 1
1398                                                      then additional space must
1399                                                      be added to this value for
1400                                                      the call stack.
1401      95:64   4 bytes max_flat_workgroup_size         Maximum flat work-group
1402                                                      size supported by the
1403                                                      kernel in work-items.
1404      96      1 bit   is_dynamic_call_stack           Indicates if the generated
1405                                                      machine code is using a
1406                                                      dynamically sized call
1407                                                      stack.
1408      97      1 bit   is_xnack_enabled                Indicates if the generated
1409                                                      machine code is capable of
1410                                                      suppoting XNACK.
1411      127:98  30 bits                                 Reserved. Must be 0.
1412      191:128 8 bytes kernel_code_entry_byte_offset   Byte offset (possibly
1413                                                      negative) from base
1414                                                      address of kernel
1415                                                      descriptor to kernel's
1416                                                      entry point instruction
1417                                                      which must be 256 byte
1418                                                      aligned.
1419      383:192 24                                      Reserved. Must be 0.
1420              bytes
1421      415:384 4 bytes compute_pgm_rsrc1               Compute Shader (CS)
1422                                                      program settings used by
1423                                                      CP to set up
1424                                                      ``COMPUTE_PGM_RSRC1``
1425                                                      configuration
1426                                                      register. See
1427                                                      :ref:`amdgpu-amdhsa-compute_pgm_rsrc1_t-gfx6-gfx9-table`.
1428      447:416 4 bytes compute_pgm_rsrc2               Compute Shader (CS)
1429                                                      program settings used by
1430                                                      CP to set up
1431                                                      ``COMPUTE_PGM_RSRC2``
1432                                                      configuration
1433                                                      register. See
1434                                                      :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
1435      448     1 bit   enable_sgpr_private_segment     Enable the setup of the
1436                      _buffer                         SGPR user data registers
1437                                                      (see
1438                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1439
1440                                                      The total number of SGPR
1441                                                      user data registers
1442                                                      requested must not exceed
1443                                                      16 and match value in
1444                                                      ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
1445                                                      Any requests beyond 16
1446                                                      will be ignored.
1447      449     1 bit   enable_sgpr_dispatch_ptr        *see above*
1448      450     1 bit   enable_sgpr_queue_ptr           *see above*
1449      451     1 bit   enable_sgpr_kernarg_segment_ptr *see above*
1450      452     1 bit   enable_sgpr_dispatch_id         *see above*
1451      453     1 bit   enable_sgpr_flat_scratch_init   *see above*
1452      454     1 bit   enable_sgpr_private_segment     *see above*
1453                      _size
1454      455     1 bit   enable_sgpr_grid_workgroup      Not implemented in CP and
1455                      _count_X                        should always be 0.
1456      456     1 bit   enable_sgpr_grid_workgroup      Not implemented in CP and
1457                      _count_Y                        should always be 0.
1458      457     1 bit   enable_sgpr_grid_workgroup      Not implemented in CP and
1459                      _count_Z                        should always be 0.
1460      463:458 6 bits                                  Reserved. Must be 0.
1461      511:464 4                                       Reserved. Must be 0.
1462              bytes
1463      512     **Total size 64 bytes.**
1464      ======= ===================================================================
1465
1466 ..
1467
1468   .. table:: compute_pgm_rsrc1 for GFX6-GFX9
1469      :name: amdgpu-amdhsa-compute_pgm_rsrc1_t-gfx6-gfx9-table
1470
1471      ======= ======= =============================== ===========================================================================
1472      Bits    Size    Field Name                      Description
1473      ======= ======= =============================== ===========================================================================
1474      5:0     6 bits  granulated_workitem_vgpr_count  Number of vector registers
1475                                                      used by each work-item,
1476                                                      granularity is device
1477                                                      specific:
1478
1479                                                      GFX6-9
1480                                                        roundup((max-vgpg + 1)
1481                                                        / 4) - 1
1482
1483                                                      Used by CP to set up
1484                                                      ``COMPUTE_PGM_RSRC1.VGPRS``.
1485      9:6     4 bits  granulated_wavefront_sgpr_count Number of scalar registers
1486                                                      used by a wavefront,
1487                                                      granularity is device
1488                                                      specific:
1489
1490                                                      GFX6-8
1491                                                        roundup((max-sgpg + 1)
1492                                                        / 8) - 1
1493                                                      GFX9
1494                                                        roundup((max-sgpg + 1)
1495                                                        / 16) - 1
1496
1497                                                      Includes the special SGPRs
1498                                                      for VCC, Flat Scratch (for
1499                                                      GFX7 onwards) and XNACK
1500                                                      (for GFX8 onwards). It does
1501                                                      not include the 16 SGPR
1502                                                      added if a trap handler is
1503                                                      enabled.
1504
1505                                                      Used by CP to set up
1506                                                      ``COMPUTE_PGM_RSRC1.SGPRS``.
1507      11:10   2 bits  priority                        Must be 0.
1508
1509                                                      Start executing wavefront
1510                                                      at the specified priority.
1511
1512                                                      CP is responsible for
1513                                                      filling in
1514                                                      ``COMPUTE_PGM_RSRC1.PRIORITY``.
1515      13:12   2 bits  float_mode_round_32             Wavefront starts execution
1516                                                      with specified rounding
1517                                                      mode for single (32
1518                                                      bit) floating point
1519                                                      precision floating point
1520                                                      operations.
1521
1522                                                      Floating point rounding
1523                                                      mode values are defined in
1524                                                      :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1525
1526                                                      Used by CP to set up
1527                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1528      15:14   2 bits  float_mode_round_16_64          Wavefront starts execution
1529                                                      with specified rounding
1530                                                      denorm mode for half/double (16
1531                                                      and 64 bit) floating point
1532                                                      precision floating point
1533                                                      operations.
1534
1535                                                      Floating point rounding
1536                                                      mode values are defined in
1537                                                      :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1538
1539                                                      Used by CP to set up
1540                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1541      17:16   2 bits  float_mode_denorm_32            Wavefront starts execution
1542                                                      with specified denorm mode
1543                                                      for single (32
1544                                                      bit)  floating point
1545                                                      precision floating point
1546                                                      operations.
1547
1548                                                      Floating point denorm mode
1549                                                      values are defined in
1550                                                      :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1551
1552                                                      Used by CP to set up
1553                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1554      19:18   2 bits  float_mode_denorm_16_64         Wavefront starts execution
1555                                                      with specified denorm mode
1556                                                      for half/double (16
1557                                                      and 64 bit) floating point
1558                                                      precision floating point
1559                                                      operations.
1560
1561                                                      Floating point denorm mode
1562                                                      values are defined in
1563                                                      :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1564
1565                                                      Used by CP to set up
1566                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1567      20      1 bit   priv                            Must be 0.
1568
1569                                                      Start executing wavefront
1570                                                      in privilege trap handler
1571                                                      mode.
1572
1573                                                      CP is responsible for
1574                                                      filling in
1575                                                      ``COMPUTE_PGM_RSRC1.PRIV``.
1576      21      1 bit   enable_dx10_clamp               Wavefront starts execution
1577                                                      with DX10 clamp mode
1578                                                      enabled. Used by the vector
1579                                                      ALU to force DX-10 style
1580                                                      treatment of NaN's (when
1581                                                      set, clamp NaN to zero,
1582                                                      otherwise pass NaN
1583                                                      through).
1584
1585                                                      Used by CP to set up
1586                                                      ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
1587      22      1 bit   debug_mode                      Must be 0.
1588
1589                                                      Start executing wavefront
1590                                                      in single step mode.
1591
1592                                                      CP is responsible for
1593                                                      filling in
1594                                                      ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
1595      23      1 bit   enable_ieee_mode                Wavefront starts execution
1596                                                      with IEEE mode
1597                                                      enabled. Floating point
1598                                                      opcodes that support
1599                                                      exception flag gathering
1600                                                      will quiet and propagate
1601                                                      signaling-NaN inputs per
1602                                                      IEEE 754-2008. Min_dx10 and
1603                                                      max_dx10 become IEEE
1604                                                      754-2008 compliant due to
1605                                                      signaling-NaN propagation
1606                                                      and quieting.
1607
1608                                                      Used by CP to set up
1609                                                      ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
1610      24      1 bit   bulky                           Must be 0.
1611
1612                                                      Only one work-group allowed
1613                                                      to execute on a compute
1614                                                      unit.
1615
1616                                                      CP is responsible for
1617                                                      filling in
1618                                                      ``COMPUTE_PGM_RSRC1.BULKY``.
1619      25      1 bit   cdbg_user                       Must be 0.
1620
1621                                                      Flag that can be used to
1622                                                      control debugging code.
1623
1624                                                      CP is responsible for
1625                                                      filling in
1626                                                      ``COMPUTE_PGM_RSRC1.CDBG_USER``.
1627      31:26   6 bits                                  Reserved. Must be 0.
1628      32      **Total size 4 bytes**
1629      ======= ===================================================================================================================
1630
1631 ..
1632
1633   .. table:: compute_pgm_rsrc2 for GFX6-GFX9
1634      :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
1635
1636      ======= ======= =============================== ===========================================================================
1637      Bits    Size    Field Name                      Description
1638      ======= ======= =============================== ===========================================================================
1639      0       1 bit   enable_sgpr_private_segment     Enable the setup of the
1640                      _wave_offset                    SGPR wave scratch offset
1641                                                      system register (see
1642                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1643
1644                                                      Used by CP to set up
1645                                                      ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
1646      5:1     5 bits  user_sgpr_count                 The total number of SGPR
1647                                                      user data registers
1648                                                      requested. This number must
1649                                                      match the number of user
1650                                                      data registers enabled.
1651
1652                                                      Used by CP to set up
1653                                                      ``COMPUTE_PGM_RSRC2.USER_SGPR``.
1654      6       1 bit   enable_trap_handler             Set to 1 if code contains a
1655                                                      TRAP instruction which
1656                                                      requires a trap handler to
1657                                                      be enabled.
1658
1659                                                      CP sets
1660                                                      ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``
1661                                                      if the runtime has
1662                                                      installed a trap handler
1663                                                      regardless of the setting
1664                                                      of this field.
1665      7       1 bit   enable_sgpr_workgroup_id_x      Enable the setup of the
1666                                                      system SGPR register for
1667                                                      the work-group id in the X
1668                                                      dimension (see
1669                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1670
1671                                                      Used by CP to set up
1672                                                      ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
1673      8       1 bit   enable_sgpr_workgroup_id_y      Enable the setup of the
1674                                                      system SGPR register for
1675                                                      the work-group id in the Y
1676                                                      dimension (see
1677                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1678
1679                                                      Used by CP to set up
1680                                                      ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
1681      9       1 bit   enable_sgpr_workgroup_id_z      Enable the setup of the
1682                                                      system SGPR register for
1683                                                      the work-group id in the Z
1684                                                      dimension (see
1685                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1686
1687                                                      Used by CP to set up
1688                                                      ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
1689      10      1 bit   enable_sgpr_workgroup_info      Enable the setup of the
1690                                                      system SGPR register for
1691                                                      work-group information (see
1692                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1693
1694                                                      Used by CP to set up
1695                                                      ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
1696      12:11   2 bits  enable_vgpr_workitem_id         Enable the setup of the
1697                                                      VGPR system registers used
1698                                                      for the work-item ID.
1699                                                      :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
1700                                                      defines the values.
1701
1702                                                      Used by CP to set up
1703                                                      ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
1704      13      1 bit   enable_exception_address_watch  Must be 0.
1705
1706                                                      Wavefront starts execution
1707                                                      with address watch
1708                                                      exceptions enabled which
1709                                                      are generated when L1 has
1710                                                      witnessed a thread access
1711                                                      an *address of
1712                                                      interest*.
1713
1714                                                      CP is responsible for
1715                                                      filling in the address
1716                                                      watch bit in
1717                                                      ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1718                                                      according to what the
1719                                                      runtime requests.
1720      14      1 bit   enable_exception_memory         Must be 0.
1721
1722                                                      Wavefront starts execution
1723                                                      with memory violation
1724                                                      exceptions exceptions
1725                                                      enabled which are generated
1726                                                      when a memory violation has
1727                                                      occurred for this wave from
1728                                                      L1 or LDS
1729                                                      (write-to-read-only-memory,
1730                                                      mis-aligned atomic, LDS
1731                                                      address out of range,
1732                                                      illegal address, etc.).
1733
1734                                                      CP sets the memory
1735                                                      violation bit in
1736                                                      ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1737                                                      according to what the
1738                                                      runtime requests.
1739      23:15   9 bits  granulated_lds_size             Must be 0.
1740
1741                                                      CP uses the rounded value
1742                                                      from the dispatch packet,
1743                                                      not this value, as the
1744                                                      dispatch may contain
1745                                                      dynamically allocated group
1746                                                      segment memory. CP writes
1747                                                      directly to
1748                                                      ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
1749
1750                                                      Amount of group segment
1751                                                      (LDS) to allocate for each
1752                                                      work-group. Granularity is
1753                                                      device specific:
1754
1755                                                      GFX6:
1756                                                        roundup(lds-size / (64 * 4))
1757                                                      GFX7-GFX9:
1758                                                        roundup(lds-size / (128 * 4))
1759
1760      24      1 bit   enable_exception_ieee_754_fp    Wavefront starts execution
1761                      _invalid_operation              with specified exceptions
1762                                                      enabled.
1763
1764                                                      Used by CP to set up
1765                                                      ``COMPUTE_PGM_RSRC2.EXCP_EN``
1766                                                      (set from bits 0..6).
1767
1768                                                      IEEE 754 FP Invalid
1769                                                      Operation
1770      25      1 bit   enable_exception_fp_denormal    FP Denormal one or more
1771                      _source                         input operands is a
1772                                                      denormal number
1773      26      1 bit   enable_exception_ieee_754_fp    IEEE 754 FP Division by
1774                      _division_by_zero               Zero
1775      27      1 bit   enable_exception_ieee_754_fp    IEEE 754 FP FP Overflow
1776                      _overflow
1777      28      1 bit   enable_exception_ieee_754_fp    IEEE 754 FP Underflow
1778                      _underflow
1779      29      1 bit   enable_exception_ieee_754_fp    IEEE 754 FP Inexact
1780                      _inexact
1781      30      1 bit   enable_exception_int_divide_by  Integer Division by Zero
1782                      _zero                           (rcp_iflag_f32 instruction
1783                                                      only)
1784      31      1 bit                                   Reserved. Must be 0.
1785      32      **Total size 4 bytes.**
1786      ======= ===================================================================================================================
1787
1788 ..
1789
1790   .. table:: Floating Point Rounding Mode Enumeration Values
1791      :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
1792
1793      ===================================== ===== ===============================
1794      Enumeration Name                      Value Description
1795      ===================================== ===== ===============================
1796      AMD_FLOAT_ROUND_MODE_NEAR_EVEN        0     Round Ties To Even
1797      AMD_FLOAT_ROUND_MODE_PLUS_INFINITY    1     Round Toward +infinity
1798      AMD_FLOAT_ROUND_MODE_MINUS_INFINITY   2     Round Toward -infinity
1799      AMD_FLOAT_ROUND_MODE_ZERO             3     Round Toward 0
1800      ===================================== ===== ===============================
1801
1802 ..
1803
1804   .. table:: Floating Point Denorm Mode Enumeration Values
1805      :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
1806
1807      ===================================== ===== ===============================
1808      Enumeration Name                      Value Description
1809      ===================================== ===== ===============================
1810      AMD_FLOAT_DENORM_MODE_FLUSH_SRC_DST   0     Flush Source and Destination
1811                                                  Denorms
1812      AMD_FLOAT_DENORM_MODE_FLUSH_DST       1     Flush Output Denorms
1813      AMD_FLOAT_DENORM_MODE_FLUSH_SRC       2     Flush Source Denorms
1814      AMD_FLOAT_DENORM_MODE_FLUSH_NONE      3     No Flush
1815      ===================================== ===== ===============================
1816
1817 ..
1818
1819   .. table:: System VGPR Work-Item ID Enumeration Values
1820      :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
1821
1822      ===================================== ===== ===============================
1823      Enumeration Name                      Value Description
1824      ===================================== ===== ===============================
1825      AMD_SYSTEM_VGPR_WORKITEM_ID_X         0     Set work-item X dimension ID.
1826      AMD_SYSTEM_VGPR_WORKITEM_ID_X_Y       1     Set work-item X and Y
1827                                                  dimensions ID.
1828      AMD_SYSTEM_VGPR_WORKITEM_ID_X_Y_Z     2     Set work-item X, Y and Z
1829                                                  dimensions ID.
1830      AMD_SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3     Undefined.
1831      ===================================== ===== ===============================
1832
1833 .. _amdgpu-amdhsa-initial-kernel-execution-state:
1834
1835 Initial Kernel Execution State
1836 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1837
1838 This section defines the register state that will be set up by the packet
1839 processor prior to the start of execution of every wavefront. This is limited by
1840 the constraints of the hardware controllers of CP/ADC/SPI.
1841
1842 The order of the SGPR registers is defined, but the compiler can specify which
1843 ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
1844 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
1845 for enabled registers are dense starting at SGPR0: the first enabled register is
1846 SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
1847 an SGPR number.
1848
1849 The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
1850 all waves of the grid. It is possible to specify more than 16 User SGPRs using
1851 the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
1852 initialized. These are then immediately followed by the System SGPRs that are
1853 set up by ADC/SPI and can have different values for each wave of the grid
1854 dispatch.
1855
1856 SGPR register initial state is defined in
1857 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
1858
1859   .. table:: SGPR Register Set Up Order
1860      :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
1861
1862      ========== ========================== ====== ==============================
1863      SGPR Order Name                       Number Description
1864                 (kernel descriptor enable  of
1865                 field)                     SGPRs
1866      ========== ========================== ====== ==============================
1867      First      Private Segment Buffer     4      V# that can be used, together
1868                 (enable_sgpr_private              with Scratch Wave Offset as an
1869                 _segment_buffer)                  offset, to access the private
1870                                                   memory space using a segment
1871                                                   address.
1872
1873                                                   CP uses the value provided by
1874                                                   the runtime.
1875      then       Dispatch Ptr               2      64 bit address of AQL dispatch
1876                 (enable_sgpr_dispatch_ptr)        packet for kernel dispatch
1877                                                   actually executing.
1878      then       Queue Ptr                  2      64 bit address of amd_queue_t
1879                 (enable_sgpr_queue_ptr)           object for AQL queue on which
1880                                                   the dispatch packet was
1881                                                   queued.
1882      then       Kernarg Segment Ptr        2      64 bit address of Kernarg
1883                 (enable_sgpr_kernarg              segment. This is directly
1884                 _segment_ptr)                     copied from the
1885                                                   kernarg_address in the kernel
1886                                                   dispatch packet.
1887
1888                                                   Having CP load it once avoids
1889                                                   loading it at the beginning of
1890                                                   every wavefront.
1891      then       Dispatch Id                2      64 bit Dispatch ID of the
1892                 (enable_sgpr_dispatch_id)         dispatch packet being
1893                                                   executed.
1894      then       Flat Scratch Init          2      This is 2 SGPRs:
1895                 (enable_sgpr_flat_scratch
1896                 _init)                            GFX6
1897                                                     Not supported.
1898                                                   GFX7-GFX8
1899                                                     The first SGPR is a 32 bit
1900                                                     byte offset from
1901                                                     ``SH_HIDDEN_PRIVATE_BASE_VIMID``
1902                                                     to per SPI base of memory
1903                                                     for scratch for the queue
1904                                                     executing the kernel
1905                                                     dispatch. CP obtains this
1906                                                     from the runtime.
1907
1908                                                     This is the same offset used
1909                                                     in computing the Scratch
1910                                                     Segment Buffer base
1911                                                     address. The value of
1912                                                     Scratch Wave Offset must be
1913                                                     added by the kernel machine
1914                                                     code and moved to SGPRn-4
1915                                                     for use as the FLAT SCRATCH
1916                                                     BASE in flat memory
1917                                                     instructions.
1918
1919                                                     The second SGPR is 32 bit
1920                                                     byte size of a single
1921                                                     work-item’s scratch memory
1922                                                     usage. This is directly
1923                                                     loaded from the kernel
1924                                                     dispatch packet Private
1925                                                     Segment Byte Size and
1926                                                     rounded up to a multiple of
1927                                                     DWORD.
1928
1929                                                     The kernel code must move to
1930                                                     SGPRn-3 for use as the FLAT
1931                                                     SCRATCH SIZE in flat memory
1932                                                     instructions. Having CP load
1933                                                     it once avoids loading it at
1934                                                     the beginning of every
1935                                                     wavefront.
1936                                                   GFX9
1937                                                     This is the 64 bit base
1938                                                     address of the per SPI
1939                                                     scratch backing memory
1940                                                     managed by SPI for the queue
1941                                                     executing the kernel
1942                                                     dispatch. CP obtains this
1943                                                     from the runtime (and
1944                                                     divides it if there are
1945                                                     multiple Shader Arrays each
1946                                                     with its own SPI). The value
1947                                                     of Scratch Wave Offset must
1948                                                     be added by the kernel
1949                                                     machine code and moved to
1950                                                     SGPRn-4 and SGPRn-3 for use
1951                                                     as the FLAT SCRATCH BASE in
1952                                                     flat memory instructions.
1953      then       Private Segment Size       1      The 32 bit byte size of a
1954                 (enable_sgpr_private              single work-item’s scratch
1955                 _segment_size)                    memory allocation. This is the
1956                                                   value from the kernel dispatch
1957                                                   packet Private Segment Byte
1958                                                   Size rounded up by CP to a
1959                                                   multiple of DWORD.
1960
1961                                                   Having CP load it once avoids
1962                                                   loading it at the beginning of
1963                                                   every wavefront.
1964
1965                                                   This is not used for
1966                                                   GFX7-GFX8 since it is the same
1967                                                   value as the second SGPR of
1968                                                   Flat Scratch Init. However, it
1969                                                   may be needed for GFX9 which
1970                                                   changes the meaning of the
1971                                                   Flat Scratch Init value.
1972      then       Grid Work-Group Count X    1      32 bit count of the number of
1973                 (enable_sgpr_grid                 work-groups in the X dimension
1974                 _workgroup_count_X)               for the grid being
1975                                                   executed. Computed from the
1976                                                   fields in the kernel dispatch
1977                                                   packet as ((grid_size.x +
1978                                                   workgroup_size.x - 1) /
1979                                                   workgroup_size.x).
1980      then       Grid Work-Group Count Y    1      32 bit count of the number of
1981                 (enable_sgpr_grid                 work-groups in the Y dimension
1982                 _workgroup_count_Y &&             for the grid being
1983                 less than 16 previous             executed. Computed from the
1984                 SGPRs)                            fields in the kernel dispatch
1985                                                   packet as ((grid_size.y +
1986                                                   workgroup_size.y - 1) /
1987                                                   workgroupSize.y).
1988
1989                                                   Only initialized if <16
1990                                                   previous SGPRs initialized.
1991      then       Grid Work-Group Count Z    1      32 bit count of the number of
1992                 (enable_sgpr_grid                 work-groups in the Z dimension
1993                 _workgroup_count_Z &&             for the grid being
1994                 less than 16 previous             executed. Computed from the
1995                 SGPRs)                            fields in the kernel dispatch
1996                                                   packet as ((grid_size.z +
1997                                                   workgroup_size.z - 1) /
1998                                                   workgroupSize.z).
1999
2000                                                   Only initialized if <16
2001                                                   previous SGPRs initialized.
2002      then       Work-Group Id X            1      32 bit work-group id in X
2003                 (enable_sgpr_workgroup_id         dimension of grid for
2004                 _X)                               wavefront.
2005      then       Work-Group Id Y            1      32 bit work-group id in Y
2006                 (enable_sgpr_workgroup_id         dimension of grid for
2007                 _Y)                               wavefront.
2008      then       Work-Group Id Z            1      32 bit work-group id in Z
2009                 (enable_sgpr_workgroup_id         dimension of grid for
2010                 _Z)                               wavefront.
2011      then       Work-Group Info            1      {first_wave, 14’b0000,
2012                 (enable_sgpr_workgroup            ordered_append_term[10:0],
2013                 _info)                            threadgroup_size_in_waves[5:0]}
2014      then       Scratch Wave Offset        1      32 bit byte offset from base
2015                 (enable_sgpr_private              of scratch base of queue
2016                 _segment_wave_offset)             executing the kernel
2017                                                   dispatch. Must be used as an
2018                                                   offset with Private
2019                                                   segment address when using
2020                                                   Scratch Segment Buffer. It
2021                                                   must be used to set up FLAT
2022                                                   SCRATCH for flat addressing
2023                                                   (see
2024                                                   :ref:`amdgpu-amdhsa-flat-scratch`).
2025      ========== ========================== ====== ==============================
2026
2027 The order of the VGPR registers is defined, but the compiler can specify which
2028 ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
2029 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2030 for enabled registers are dense starting at VGPR0: the first enabled register is
2031 VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
2032 VGPR number.
2033
2034 VGPR register initial state is defined in
2035 :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
2036
2037   .. table:: VGPR Register Set Up Order
2038      :name: amdgpu-amdhsa-vgpr-register-set-up-order-table
2039
2040      ========== ========================== ====== ==============================
2041      VGPR Order Name                       Number Description
2042                 (kernel descriptor enable  of
2043                 field)                     VGPRs
2044      ========== ========================== ====== ==============================
2045      First      Work-Item Id X             1      32 bit work item id in X
2046                 (Always initialized)              dimension of work-group for
2047                                                   wavefront lane.
2048      then       Work-Item Id Y             1      32 bit work item id in Y
2049                 (enable_vgpr_workitem_id          dimension of work-group for
2050                 > 0)                              wavefront lane.
2051      then       Work-Item Id Z             1      32 bit work item id in Z
2052                 (enable_vgpr_workitem_id          dimension of work-group for
2053                 > 1)                              wavefront lane.
2054      ========== ========================== ====== ==============================
2055
2056 The setting of registers is is done by GPU CP/ADC/SPI hardware as follows:
2057
2058 1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
2059    registers.
2060 2. Work-group Id registers X, Y, Z are set by ADC which supports any
2061    combination including none.
2062 3. Scratch Wave Offset is set by SPI in a per wave basis which is why its value
2063    cannot included with the flat scratch init value which is per queue.
2064 4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
2065    or (X, Y, Z).
2066
2067 Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
2068 value to the hardware required SGPRn-3 and SGPRn-4 respectively.
2069
2070 The global segment can be accessed either using buffer instructions (GFX6 which
2071 has V# 64 bit address support), flat instructions (GFX7-9), or global
2072 instructions (GFX9).
2073
2074 If buffer operations are used then the compiler can generate a V# with the
2075 following properties:
2076
2077 * base address of 0
2078 * no swizzle
2079 * ATC: 1 if IOMMU present (such as APU)
2080 * ptr64: 1
2081 * MTYPE set to support memory coherence that matches the runtime (such as CC for
2082   APU and NC for dGPU).
2083
2084 .. _amdgpu-amdhsa-kernel-prolog:
2085
2086 Kernel Prolog
2087 ~~~~~~~~~~~~~
2088
2089 .. _amdgpu-amdhsa-m0:
2090
2091 M0
2092 ++
2093
2094 GFX6-GFX8
2095   The M0 register must be initialized with a value at least the total LDS size
2096   if the kernel may access LDS via DS or flat operations. Total LDS size is
2097   available in dispatch packet. For M0, it is also possible to use maximum
2098   possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
2099   GFX7-GFX8).
2100 GFX9
2101   The M0 register is not used for range checking LDS accesses and so does not
2102   need to be initialized in the prolog.
2103
2104 .. _amdgpu-amdhsa-flat-scratch:
2105
2106 Flat Scratch
2107 ++++++++++++
2108
2109 If the kernel may use flat operations to access scratch memory, the prolog code
2110 must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
2111 are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wave
2112 Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
2113
2114 GFX6
2115   Flat scratch is not supported.
2116
2117 GFX7-8
2118   1. The low word of Flat Scratch Init is 32 bit byte offset from
2119      ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
2120      being managed by SPI for the queue executing the kernel dispatch. This is
2121      the same value used in the Scratch Segment Buffer V# base address. The
2122      prolog must add the value of Scratch Wave Offset to get the wave's byte
2123      scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
2124      FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
2125      by 8 before moving into FLAT_SCRATCH_LO.
2126   2. The second word of Flat Scratch Init is 32 bit byte size of a single
2127      work-items scratch memory usage. This is directly loaded from the kernel
2128      dispatch packet Private Segment Byte Size and rounded up to a multiple of
2129      DWORD. Having CP load it once avoids loading it at the beginning of every
2130      wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
2131      SIZE.
2132 GFX9
2133   The Flat Scratch Init is the 64 bit address of the base of scratch backing
2134   memory being managed by SPI for the queue executing the kernel dispatch. The
2135   prolog must add the value of Scratch Wave Offset and moved to the FLAT_SCRATCH
2136   pair for use as the flat scratch base in flat memory instructions.
2137
2138 .. _amdgpu-amdhsa-memory-model:
2139
2140 Memory Model
2141 ~~~~~~~~~~~~
2142
2143 This section describes the mapping of LLVM memory model onto AMDGPU machine code
2144 (see :ref:`memmodel`). *The implementation is WIP.*
2145
2146 .. TODO
2147    Update when implementation complete.
2148
2149    Support more relaxed OpenCL memory model to be controlled by environment
2150    component of target triple.
2151
2152 The AMDGPU backend supports the memory synchronization scopes specified in
2153 :ref:`amdgpu-memory-scopes`.
2154
2155 The code sequences used to implement the memory model are defined in table
2156 :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
2157
2158 The sequences specify the order of instructions that a single thread must
2159 execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
2160 to other memory instructions executed by the same thread. This allows them to be
2161 moved earlier or later which can allow them to be combined with other instances
2162 of the same instruction, or hoisted/sunk out of loops to improve
2163 performance. Only the instructions related to the memory model are given;
2164 additional ``s_waitcnt`` instructions are required to ensure registers are
2165 defined before being used. These may be able to be combined with the memory
2166 model ``s_waitcnt`` instructions as described above.
2167
2168 The AMDGPU memory model supports both the HSA [HSA]_ memory model, and the
2169 OpenCL [OpenCL]_ memory model. The HSA memory model uses a single happens-before
2170 relation for all address spaces (see :ref:`amdgpu-address-spaces`). The OpenCL
2171 memory model which has separate happens-before relations for the global and
2172 local address spaces, and only a fence specifying both global and local address
2173 space joins the relationships. Since the LLVM ``memfence`` instruction does not
2174 allow an address space to be specified the OpenCL fence has to convervatively
2175 assume both local and global address space was specified. However, optimizations
2176 can often be done to eliminate the additional ``s_waitcnt``instructions when
2177 there are no intervening corresponding ``ds/flat_load/store/atomic`` memory
2178 instructions. The code sequences in the table indicate what can be omitted for
2179 the OpenCL memory. The target triple environment is used to determine if the
2180 source language is OpenCL (see :ref:`amdgpu-opencl`).
2181
2182 ``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
2183 operations.
2184
2185 ``buffer/global/flat_load/store/atomic`` instructions to global memory are
2186 termed vector memory operations.
2187
2188 For GFX6-GFX9:
2189
2190 * Each agent has multiple compute units (CU).
2191 * Each CU has multiple SIMDs that execute wavefronts.
2192 * The wavefronts for a single work-group are executed in the same CU but may be
2193   executed by different SIMDs.
2194 * Each CU has a single LDS memory shared by the wavefronts of the work-groups
2195   executing on it.
2196 * All LDS operations of a CU are performed as wavefront wide operations in a
2197   global order and involve no caching. Completion is reported to a wavefront in
2198   execution order.
2199 * The LDS memory has multiple request queues shared by the SIMDs of a
2200   CU. Therefore, the LDS operations performed by different waves of a work-group
2201   can be reordered relative to each other, which can result in reordering the
2202   visibility of vector memory operations with respect to LDS operations of other
2203   wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
2204   ensure synchronization between LDS operations and vector memory operations
2205   between waves of a work-group, but not between operations performed by the
2206   same wavefront.
2207 * The vector memory operations are performed as wavefront wide operations and
2208   completion is reported to a wavefront in execution order. The exception is
2209   that for GFX7-9 ``flat_load/store/atomic`` instructions can report out of
2210   vector memory order if they access LDS memory, and out of LDS operation order
2211   if they access global memory.
2212 * The vector memory operations access a vector L1 cache shared by all wavefronts
2213   on a CU. Therefore, no special action is required for coherence between
2214   wavefronts in the same work-group. A ``buffer_wbinvl1_vol`` is required for
2215   coherence between waves executing in different work-groups as they may be
2216   executing on different CUs.
2217 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
2218   on a group of CUs. The scalar and vector L1 caches are not coherent. However,
2219   scalar operations are used in a restricted way so do not impact the memory
2220   model. See :ref:`amdgpu-amdhsa-memory-spaces`.
2221 * The vector and scalar memory operations use an L2 cache shared by all CUs on
2222   the same agent.
2223 * The L2 cache has independent channels to service disjoint ranges of virtual
2224   addresses.
2225 * Each CU has a separate request queue per channel. Therefore, the vector and
2226   scalar memory operations performed by waves executing in different work-groups
2227   (which may be executing on different CUs) of an agent can be reordered
2228   relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
2229   synchronization between vector memory operations of different CUs. It ensures a
2230   previous vector memory operation has completed before executing a subsequent
2231   vector memory or LDS operation and so can be used to meet the requirements of
2232   acquire and release.
2233 * The L2 cache can be kept coherent with other agents on some targets, or ranges
2234   of virtual addresses can be set up to bypass it to ensure system coherence.
2235
2236 Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-8),
2237 or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
2238 memory, atomic memory orderings are not meaningful and all accesses are treated
2239 as non-atomic.
2240
2241 Constant address space uses ``buffer/global_load`` instructions (or equivalent
2242 scalar memory instructions). Since the constant address space contents do not
2243 change during the execution of a kernel dispatch it is not legal to perform
2244 stores, and atomic memory orderings are not meaningful and all access are
2245 treated as non-atomic.
2246
2247 A memory synchronization scope wider than work-group is not meaningful for the
2248 group (LDS) address space and is treated as work-group.
2249
2250 The memory model does not support the region address space which is treated as
2251 non-atomic.
2252
2253 Acquire memory ordering is not meaningful on store atomic instructions and is
2254 treated as non-atomic.
2255
2256 Release memory ordering is not meaningful on load atomic instructions and is
2257 treated a non-atomic.
2258
2259 Acquire-release memory ordering is not meaningful on load or store atomic
2260 instructions and is treated as acquire and release respectively.
2261
2262 AMDGPU backend only uses scalar memory operations to access memory that is
2263 proven to not change during the execution of the kernel dispatch. This includes
2264 constant address space and global address space for program scope const
2265 variables. Therefore the kernel machine code does not have to maintain the
2266 scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
2267 and vector L1 caches are invalidated between kernel dispatches by CP since
2268 constant address space data may change between kernel dispatch executions. See
2269 :ref:`amdgpu-amdhsa-memory-spaces`.
2270
2271 The one execption is if scalar writes are used to spill SGPR registers. In this
2272 case the AMDGPU backend ensures the memory location used to spill is never
2273 accessed by vector memory operations at the same time. If scalar writes are used
2274 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
2275 return since the locations may be used for vector memory instructions by a
2276 future wave that uses the same scratch area, or a function call that creates a
2277 frame at the same address, respectively. There is no need for a ``s_dcache_inv``
2278 as all scalar writes are write-before-read in the same thread.
2279
2280 Scratch backing memory (which is used for the private address space) is accessed
2281 with MTYPE NC_NV (non-coherenent non-volatile). Since the private address space
2282 is only accessed by a single thread, and is always write-before-read,
2283 there is never a need to invalidate these entries from the L1 cache. Hence all
2284 cache invalidates are done as ``*_vol`` to only invalidate the volatile cache
2285 lines.
2286
2287 On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
2288 to invalidate the L2 cache. This also causes it to be treated as non-volatile
2289 and so is not invalidated by ``*_vol``. On APU it is accessed as CC (cache
2290 coherent) and so the L2 cache will coherent with the CPU and other agents.
2291
2292   .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
2293      :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
2294
2295      ============ ============ ============== ========== =======================
2296      LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code
2297                   Ordering     Sync Scope     Address
2298                                               Space
2299      ============ ============ ============== ========== =======================
2300      **Non-Atomic**
2301      ---------------------------------------------------------------------------
2302      load         *none*       *none*         - global   non-volatile
2303                                               - generic    1. buffer/global/flat_load
2304                                                          volatile
2305                                                            1. buffer/global/flat_load
2306                                                               glc=1
2307      load         *none*       *none*         - local    1. ds_load
2308      store        *none*       *none*         - global   1. buffer/global/flat_store
2309                                               - generic
2310      store        *none*       *none*         - local    1. ds_store
2311      **Unordered Atomic**
2312      ---------------------------------------------------------------------------
2313      load atomic  unordered    *any*          *any*      *Same as non-atomic*.
2314      store atomic unordered    *any*          *any*      *Same as non-atomic*.
2315      atomicrmw    unordered    *any*          *any*      *Same as monotonic
2316                                                          atomic*.
2317      **Monotonic Atomic**
2318      ---------------------------------------------------------------------------
2319      load atomic  monotonic    - singlethread - global   1. buffer/global/flat_load
2320                                - wavefront    - generic
2321                                - workgroup
2322      load atomic  monotonic    - singlethread - local    1. ds_load
2323                                - wavefront
2324                                - workgroup
2325      load atomic  monotonic    - agent        - global   1. buffer/global/flat_load
2326                                - system       - generic     glc=1
2327      store atomic monotonic    - singlethread - global   1. buffer/global/flat_store
2328                                - wavefront    - generic
2329                                - workgroup
2330                                - agent
2331                                - system
2332      store atomic monotonic    - singlethread - local    1. ds_store
2333                                - wavefront
2334                                - workgroup
2335      atomicrmw    monotonic    - singlethread - global   1. buffer/global/flat_atomic
2336                                - wavefront    - generic
2337                                - workgroup
2338                                - agent
2339                                - system
2340      atomicrmw    monotonic    - singlethread - local    1. ds_atomic
2341                                - wavefront
2342                                - workgroup
2343      **Acquire Atomic**
2344      ---------------------------------------------------------------------------
2345      load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load
2346                                - wavefront    - local
2347                                               - generic
2348      load atomic  acquire      - workgroup    - global   1. buffer/global_load
2349      load atomic  acquire      - workgroup    - local    1. ds/flat_load
2350                                               - generic  2. s_waitcnt lgkmcnt(0)
2351
2352                                                            - If OpenCL, omit
2353                                                              waitcnt.
2354                                                            - Must happen before
2355                                                              any following
2356                                                              global/generic
2357                                                              load/load
2358                                                              atomic/store/store
2359                                                              atomic/atomicrmw.
2360                                                            - Ensures any
2361                                                              following global
2362                                                              data read is no
2363                                                              older than the load
2364                                                              atomic value being
2365                                                              acquired.
2366
2367      load atomic  acquire      - agent        - global   1. buffer/global_load
2368                                - system                     glc=1
2369                                                          2. s_waitcnt vmcnt(0)
2370
2371                                                            - Must happen before
2372                                                              following
2373                                                              buffer_wbinvl1_vol.
2374                                                            - Ensures the load
2375                                                              has completed
2376                                                              before invalidating
2377                                                              the cache.
2378
2379                                                          3. buffer_wbinvl1_vol
2380
2381                                                            - Must happen before
2382                                                              any following
2383                                                              global/generic
2384                                                              load/load
2385                                                              atomic/atomicrmw.
2386                                                            - Ensures that
2387                                                              following
2388                                                              loads will not see
2389                                                              stale global data.
2390
2391      load atomic  acquire      - agent        - generic  1. flat_load glc=1
2392                                - system                  2. s_waitcnt vmcnt(0) &
2393                                                             lgkmcnt(0)
2394
2395                                                            - If OpenCL omit
2396                                                              lgkmcnt(0).
2397                                                            - Must happen before
2398                                                              following
2399                                                              buffer_wbinvl1_vol.
2400                                                            - Ensures the flat_load
2401                                                              has completed
2402                                                              before invalidating
2403                                                              the cache.
2404
2405                                                          3. buffer_wbinvl1_vol
2406
2407                                                            - Must happen before
2408                                                              any following
2409                                                              global/generic
2410                                                              load/load
2411                                                              atomic/atomicrmw.
2412                                                            - Ensures that
2413                                                              following loads
2414                                                              will not see stale
2415                                                              global data.
2416
2417      atomicrmw    acquire      - singlethread - global   1. buffer/global/ds/flat_atomic
2418                                - wavefront    - local
2419                                               - generic
2420      atomicrmw    acquire      - workgroup    - global   1. buffer/global_atomic
2421      atomicrmw    acquire      - workgroup    - local    1. ds/flat_atomic
2422                                               - generic  2. waitcnt lgkmcnt(0)
2423
2424                                                            - If OpenCL, omit
2425                                                              waitcnt.
2426                                                            - Must happen before
2427                                                              any following
2428                                                              global/generic
2429                                                              load/load
2430                                                              atomic/store/store
2431                                                              atomic/atomicrmw.
2432                                                            - Ensures any
2433                                                              following global
2434                                                              data read is no
2435                                                              older than the
2436                                                              atomicrmw value
2437                                                              being acquired.
2438
2439      atomicrmw    acquire      - agent        - global   1. buffer/global_atomic
2440                                - system                  2. s_waitcnt vmcnt(0)
2441
2442                                                            - Must happen before
2443                                                              following
2444                                                              buffer_wbinvl1_vol.
2445                                                            - Ensures the
2446                                                              atomicrmw has
2447                                                              completed before
2448                                                              invalidating the
2449                                                              cache.
2450
2451                                                          3. buffer_wbinvl1_vol
2452
2453                                                            - Must happen before
2454                                                              any following
2455                                                              global/generic
2456                                                              load/load
2457                                                              atomic/atomicrmw.
2458                                                            - Ensures that
2459                                                              following loads
2460                                                              will not see stale
2461                                                              global data.
2462
2463      atomicrmw    acquire      - agent        - generic  1. flat_atomic
2464                                - system                  2. s_waitcnt vmcnt(0) &
2465                                                             lgkmcnt(0)
2466
2467                                                            - If OpenCL, omit
2468                                                              lgkmcnt(0).
2469                                                            - Must happen before
2470                                                              following
2471                                                              buffer_wbinvl1_vol.
2472                                                            - Ensures the
2473                                                              atomicrmw has
2474                                                              completed before
2475                                                              invalidating the
2476                                                              cache.
2477
2478                                                          3. buffer_wbinvl1_vol
2479
2480                                                            - Must happen before
2481                                                              any following
2482                                                              global/generic
2483                                                              load/load
2484                                                              atomic/atomicrmw.
2485                                                            - Ensures that
2486                                                              following loads
2487                                                              will not see stale
2488                                                              global data.
2489
2490      fence        acquire      - singlethread *none*     *none*
2491                                - wavefront
2492      fence        acquire      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
2493
2494                                                            - If OpenCL and
2495                                                              address space is
2496                                                              not generic, omit
2497                                                              waitcnt. However,
2498                                                              since LLVM
2499                                                              currently has no
2500                                                              address space on
2501                                                              the fence need to
2502                                                              conservatively
2503                                                              always generate. If
2504                                                              fence had an
2505                                                              address space then
2506                                                              set to address
2507                                                              space of OpenCL
2508                                                              fence flag, or to
2509                                                              generic if both
2510                                                              local and global
2511                                                              flags are
2512                                                              specified.
2513                                                            - Must happen after
2514                                                              any preceding
2515                                                              local/generic load
2516                                                              atomic/atomicrmw
2517                                                              with an equal or
2518                                                              wider sync scope
2519                                                              and memory ordering
2520                                                              stronger than
2521                                                              unordered (this is
2522                                                              termed the
2523                                                              fence-paired-atomic).
2524                                                            - Must happen before
2525                                                              any following
2526                                                              global/generic
2527                                                              load/load
2528                                                              atomic/store/store
2529                                                              atomic/atomicrmw.
2530                                                            - Ensures any
2531                                                              following global
2532                                                              data read is no
2533                                                              older than the
2534                                                              value read by the
2535                                                              fence-paired-atomic.
2536
2537      fence        acquire      - agent        *none*     1. s_waitcnt vmcnt(0) &
2538                                - system                     lgkmcnt(0)
2539
2540                                                            - If OpenCL and
2541                                                              address space is
2542                                                              not generic, omit
2543                                                              lgkmcnt(0).
2544                                                              However, since LLVM
2545                                                              currently has no
2546                                                              address space on
2547                                                              the fence need to
2548                                                              conservatively
2549                                                              always generate
2550                                                              (see comment for
2551                                                              previous fence).
2552                                                            - Could be split into
2553                                                              separate s_waitcnt
2554                                                              vmcnt(0) and
2555                                                              s_waitcnt
2556                                                              lgkmcnt(0) to allow
2557                                                              them to be
2558                                                              independently moved
2559                                                              according to the
2560                                                              following rules.
2561                                                            - s_waitcnt vmcnt(0)
2562                                                              must happen after
2563                                                              any preceding
2564                                                              global/generic load
2565                                                              atomic/atomicrmw
2566                                                              with an equal or
2567                                                              wider sync scope
2568                                                              and memory ordering
2569                                                              stronger than
2570                                                              unordered (this is
2571                                                              termed the
2572                                                              fence-paired-atomic).
2573                                                            - s_waitcnt lgkmcnt(0)
2574                                                              must happen after
2575                                                              any preceding
2576                                                              group/generic load
2577                                                              atomic/atomicrmw
2578                                                              with an equal or
2579                                                              wider sync scope
2580                                                              and memory ordering
2581                                                              stronger than
2582                                                              unordered (this is
2583                                                              termed the
2584                                                              fence-paired-atomic).
2585                                                            - Must happen before
2586                                                              the following
2587                                                              buffer_wbinvl1_vol.
2588                                                            - Ensures that the
2589                                                              fence-paired atomic
2590                                                              has completed
2591                                                              before invalidating
2592                                                              the
2593                                                              cache. Therefore
2594                                                              any following
2595                                                              locations read must
2596                                                              be no older than
2597                                                              the value read by
2598                                                              the
2599                                                              fence-paired-atomic.
2600
2601                                                          2. buffer_wbinvl1_vol
2602
2603                                                            - Must happen before
2604                                                              any following global/generic
2605                                                              load/load
2606                                                              atomic/store/store
2607                                                              atomic/atomicrmw.
2608                                                            - Ensures that
2609                                                              following loads
2610                                                              will not see stale
2611                                                              global data.
2612
2613      **Release Atomic**
2614      ---------------------------------------------------------------------------
2615      store atomic release      - singlethread - global   1. buffer/global/ds/flat_store
2616                                - wavefront    - local
2617                                               - generic
2618      store atomic release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
2619                                               - generic
2620                                                            - If OpenCL, omit
2621                                                              waitcnt.
2622                                                            - Must happen after
2623                                                              any preceding
2624                                                              local/generic
2625                                                              load/store/load
2626                                                              atomic/store
2627                                                              atomic/atomicrmw.
2628                                                            - Must happen before
2629                                                              the following
2630                                                              store.
2631                                                            - Ensures that all
2632                                                              memory operations
2633                                                              to local have
2634                                                              completed before
2635                                                              performing the
2636                                                              store that is being
2637                                                              released.
2638
2639                                                          2. buffer/global/flat_store
2640      store atomic release      - workgroup    - local    1. ds_store
2641      store atomic release      - agent        - global   1. s_waitcnt vmcnt(0) &
2642                                - system       - generic     lgkmcnt(0)
2643
2644                                                            - If OpenCL, omit
2645                                                              lgkmcnt(0).
2646                                                            - Could be split into
2647                                                              separate s_waitcnt
2648                                                              vmcnt(0) and
2649                                                              s_waitcnt
2650                                                              lgkmcnt(0) to allow
2651                                                              them to be
2652                                                              independently moved
2653                                                              according to the
2654                                                              following rules.
2655                                                            - s_waitcnt vmcnt(0)
2656                                                              must happen after
2657                                                              any preceding
2658                                                              global/generic
2659                                                              load/store/load
2660                                                              atomic/store
2661                                                              atomic/atomicrmw.
2662                                                            - s_waitcnt lgkmcnt(0)
2663                                                              must happen after
2664                                                              any preceding
2665                                                              local/generic
2666                                                              load/store/load
2667                                                              atomic/store
2668                                                              atomic/atomicrmw.
2669                                                            - Must happen before
2670                                                              the following
2671                                                              store.
2672                                                            - Ensures that all
2673                                                              memory operations
2674                                                              to global have
2675                                                              completed before
2676                                                              performing the
2677                                                              store that is being
2678                                                              released.
2679
2680                                                          2. buffer/global/ds/flat_store
2681      atomicrmw    release      - singlethread - global   1. buffer/global/ds/flat_atomic
2682                                - wavefront    - local
2683                                               - generic
2684      atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
2685                                               - generic
2686                                                            - If OpenCL, omit
2687                                                              waitcnt.
2688                                                            - Must happen after
2689                                                              any preceding
2690                                                              local/generic
2691                                                              load/store/load
2692                                                              atomic/store
2693                                                              atomic/atomicrmw.
2694                                                            - Must happen before
2695                                                              the following
2696                                                              atomicrmw.
2697                                                            - Ensures that all
2698                                                              memory operations
2699                                                              to local have
2700                                                              completed before
2701                                                              performing the
2702                                                              atomicrmw that is
2703                                                              being released.
2704
2705                                                          2. buffer/global/flat_atomic
2706      atomicrmw    release      - workgroup    - local    1. ds_atomic
2707      atomicrmw    release      - agent        - global   1. s_waitcnt vmcnt(0) &
2708                                - system       - generic     lgkmcnt(0)
2709
2710                                                            - If OpenCL, omit
2711                                                              lgkmcnt(0).
2712                                                            - Could be split into
2713                                                              separate s_waitcnt
2714                                                              vmcnt(0) and
2715                                                              s_waitcnt
2716                                                              lgkmcnt(0) to allow
2717                                                              them to be
2718                                                              independently moved
2719                                                              according to the
2720                                                              following rules.
2721                                                            - s_waitcnt vmcnt(0)
2722                                                              must happen after
2723                                                              any preceding
2724                                                              global/generic
2725                                                              load/store/load
2726                                                              atomic/store
2727                                                              atomic/atomicrmw.
2728                                                            - s_waitcnt lgkmcnt(0)
2729                                                              must happen after
2730                                                              any preceding
2731                                                              local/generic
2732                                                              load/store/load
2733                                                              atomic/store
2734                                                              atomic/atomicrmw.
2735                                                            - Must happen before
2736                                                              the following
2737                                                              atomicrmw.
2738                                                            - Ensures that all
2739                                                              memory operations
2740                                                              to global and local
2741                                                              have completed
2742                                                              before performing
2743                                                              the atomicrmw that
2744                                                              is being released.
2745
2746                                                          2. buffer/global/ds/flat_atomic*
2747      fence        release      - singlethread *none*     *none*
2748                                - wavefront
2749      fence        release      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
2750
2751                                                            - If OpenCL and
2752                                                              address space is
2753                                                              not generic, omit
2754                                                              waitcnt. However,
2755                                                              since LLVM
2756                                                              currently has no
2757                                                              address space on
2758                                                              the fence need to
2759                                                              conservatively
2760                                                              always generate
2761                                                              (see comment for
2762                                                              previous fence).
2763                                                            - Must happen after
2764                                                              any preceding
2765                                                              local/generic
2766                                                              load/load
2767                                                              atomic/store/store
2768                                                              atomic/atomicrmw.
2769                                                            - Must happen before
2770                                                              any following store
2771                                                              atomic/atomicrmw
2772                                                              with an equal or
2773                                                              wider sync scope
2774                                                              and memory ordering
2775                                                              stronger than
2776                                                              unordered (this is
2777                                                              termed the
2778                                                              fence-paired-atomic).
2779                                                            - Ensures that all
2780                                                              memory operations
2781                                                              to local have
2782                                                              completed before
2783                                                              performing the
2784                                                              following
2785                                                              fence-paired-atomic.
2786
2787      fence        release      - agent        *none*     1. s_waitcnt vmcnt(0) &
2788                                - system                     lgkmcnt(0)
2789
2790                                                            - If OpenCL and
2791                                                              address space is
2792                                                              not generic, omit
2793                                                              lgkmcnt(0).
2794                                                              However, since LLVM
2795                                                              currently has no
2796                                                              address space on
2797                                                              the fence need to
2798                                                              conservatively
2799                                                              always generate
2800                                                              (see comment for
2801                                                              previous fence).
2802                                                            - Could be split into
2803                                                              separate s_waitcnt
2804                                                              vmcnt(0) and
2805                                                              s_waitcnt
2806                                                              lgkmcnt(0) to allow
2807                                                              them to be
2808                                                              independently moved
2809                                                              according to the
2810                                                              following rules.
2811                                                            - s_waitcnt vmcnt(0)
2812                                                              must happen after
2813                                                              any preceding
2814                                                              global/generic
2815                                                              load/store/load
2816                                                              atomic/store
2817                                                              atomic/atomicrmw.
2818                                                            - s_waitcnt lgkmcnt(0)
2819                                                              must happen after
2820                                                              any preceding
2821                                                              local/generic
2822                                                              load/store/load
2823                                                              atomic/store
2824                                                              atomic/atomicrmw.
2825                                                            - Must happen before
2826                                                              any following store
2827                                                              atomic/atomicrmw
2828                                                              with an equal or
2829                                                              wider sync scope
2830                                                              and memory ordering
2831                                                              stronger than
2832                                                              unordered (this is
2833                                                              termed the
2834                                                              fence-paired-atomic).
2835                                                            - Ensures that all
2836                                                              memory operations
2837                                                              to global have
2838                                                              completed before
2839                                                              performing the
2840                                                              following
2841                                                              fence-paired-atomic.
2842
2843      **Acquire-Release Atomic**
2844      ---------------------------------------------------------------------------
2845      atomicrmw    acq_rel      - singlethread - global   1. buffer/global/ds/flat_atomic
2846                                - wavefront    - local
2847                                               - generic
2848      atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
2849
2850                                                            - If OpenCL, omit
2851                                                              waitcnt.
2852                                                            - Must happen after
2853                                                              any preceding
2854                                                              local/generic
2855                                                              load/store/load
2856                                                              atomic/store
2857                                                              atomic/atomicrmw.
2858                                                            - Must happen before
2859                                                              the following
2860                                                              atomicrmw.
2861                                                            - Ensures that all
2862                                                              memory operations
2863                                                              to local have
2864                                                              completed before
2865                                                              performing the
2866                                                              atomicrmw that is
2867                                                              being released.
2868
2869                                                          2. buffer/global_atomic
2870      atomicrmw    acq_rel      - workgroup    - local    1. ds_atomic
2871                                                          2. s_waitcnt lgkmcnt(0)
2872
2873                                                            - If OpenCL, omit
2874                                                              waitcnt.
2875                                                            - Must happen before
2876                                                              any following
2877                                                              global/generic
2878                                                              load/load
2879                                                              atomic/store/store
2880                                                              atomic/atomicrmw.
2881                                                            - Ensures any
2882                                                              following global
2883                                                              data read is no
2884                                                              older than the load
2885                                                              atomic value being
2886                                                              acquired.
2887
2888      atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)
2889
2890                                                            - If OpenCL, omit
2891                                                              waitcnt.
2892                                                            - Must happen after
2893                                                              any preceding
2894                                                              local/generic
2895                                                              load/store/load
2896                                                              atomic/store
2897                                                              atomic/atomicrmw.
2898                                                            - Must happen before
2899                                                              the following
2900                                                              atomicrmw.
2901                                                            - Ensures that all
2902                                                              memory operations
2903                                                              to local have
2904                                                              completed before
2905                                                              performing the
2906                                                              atomicrmw that is
2907                                                              being released.
2908
2909                                                          2. flat_atomic
2910                                                          3. s_waitcnt lgkmcnt(0)
2911
2912                                                            - If OpenCL, omit
2913                                                              waitcnt.
2914                                                            - Must happen before
2915                                                              any following
2916                                                              global/generic
2917                                                              load/load
2918                                                              atomic/store/store
2919                                                              atomic/atomicrmw.
2920                                                            - Ensures any
2921                                                              following global
2922                                                              data read is no
2923                                                              older than the load
2924                                                              atomic value being
2925                                                              acquired.
2926      atomicrmw    acq_rel      - agent        - global   1. s_waitcnt vmcnt(0) &
2927                                - system                     lgkmcnt(0)
2928
2929                                                            - If OpenCL, omit
2930                                                              lgkmcnt(0).
2931                                                            - Could be split into
2932                                                              separate s_waitcnt
2933                                                              vmcnt(0) and
2934                                                              s_waitcnt
2935                                                              lgkmcnt(0) to allow
2936                                                              them to be
2937                                                              independently moved
2938                                                              according to the
2939                                                              following rules.
2940                                                            - s_waitcnt vmcnt(0)
2941                                                              must happen after
2942                                                              any preceding
2943                                                              global/generic
2944                                                              load/store/load
2945                                                              atomic/store
2946                                                              atomic/atomicrmw.
2947                                                            - s_waitcnt lgkmcnt(0)
2948                                                              must happen after
2949                                                              any preceding
2950                                                              local/generic
2951                                                              load/store/load
2952                                                              atomic/store
2953                                                              atomic/atomicrmw.
2954                                                            - Must happen before
2955                                                              the following
2956                                                              atomicrmw.
2957                                                            - Ensures that all
2958                                                              memory operations
2959                                                              to global have
2960                                                              completed before
2961                                                              performing the
2962                                                              atomicrmw that is
2963                                                              being released.
2964
2965                                                          2. buffer/global_atomic
2966                                                          3. s_waitcnt vmcnt(0)
2967
2968                                                            - Must happen before
2969                                                              following
2970                                                              buffer_wbinvl1_vol.
2971                                                            - Ensures the
2972                                                              atomicrmw has
2973                                                              completed before
2974                                                              invalidating the
2975                                                              cache.
2976
2977                                                          4. buffer_wbinvl1_vol
2978
2979                                                            - Must happen before
2980                                                              any following
2981                                                              global/generic
2982                                                              load/load
2983                                                              atomic/atomicrmw.
2984                                                            - Ensures that
2985                                                              following loads
2986                                                              will not see stale
2987                                                              global data.
2988
2989      atomicrmw    acq_rel      - agent        - generic  1. s_waitcnt vmcnt(0) &
2990                                - system                     lgkmcnt(0)
2991
2992                                                            - If OpenCL, omit
2993                                                              lgkmcnt(0).
2994                                                            - Could be split into
2995                                                              separate s_waitcnt
2996                                                              vmcnt(0) and
2997                                                              s_waitcnt
2998                                                              lgkmcnt(0) to allow
2999                                                              them to be
3000                                                              independently moved
3001                                                              according to the
3002                                                              following rules.
3003                                                            - s_waitcnt vmcnt(0)
3004                                                              must happen after
3005                                                              any preceding
3006                                                              global/generic
3007                                                              load/store/load
3008                                                              atomic/store
3009                                                              atomic/atomicrmw.
3010                                                            - s_waitcnt lgkmcnt(0)
3011                                                              must happen after
3012                                                              any preceding
3013                                                              local/generic
3014                                                              load/store/load
3015                                                              atomic/store
3016                                                              atomic/atomicrmw.
3017                                                            - Must happen before
3018                                                              the following
3019                                                              atomicrmw.
3020                                                            - Ensures that all
3021                                                              memory operations
3022                                                              to global have
3023                                                              completed before
3024                                                              performing the
3025                                                              atomicrmw that is
3026                                                              being released.
3027
3028                                                          2. flat_atomic
3029                                                          3. s_waitcnt vmcnt(0) &
3030                                                             lgkmcnt(0)
3031
3032                                                            - If OpenCL, omit
3033                                                              lgkmcnt(0).
3034                                                            - Must happen before
3035                                                              following
3036                                                              buffer_wbinvl1_vol.
3037                                                            - Ensures the
3038                                                              atomicrmw has
3039                                                              completed before
3040                                                              invalidating the
3041                                                              cache.
3042
3043                                                          4. buffer_wbinvl1_vol
3044
3045                                                            - Must happen before
3046                                                              any following
3047                                                              global/generic
3048                                                              load/load
3049                                                              atomic/atomicrmw.
3050                                                            - Ensures that
3051                                                              following loads
3052                                                              will not see stale
3053                                                              global data.
3054
3055      fence        acq_rel      - singlethread *none*     *none*
3056                                - wavefront
3057      fence        acq_rel      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
3058
3059                                                            - If OpenCL and
3060                                                              address space is
3061                                                              not generic, omit
3062                                                              waitcnt. However,
3063                                                              since LLVM
3064                                                              currently has no
3065                                                              address space on
3066                                                              the fence need to
3067                                                              conservatively
3068                                                              always generate
3069                                                              (see comment for
3070                                                              previous fence).
3071                                                            - Must happen after
3072                                                              any preceding
3073                                                              local/generic
3074                                                              load/load
3075                                                              atomic/store/store
3076                                                              atomic/atomicrmw.
3077                                                            - Must happen before
3078                                                              any following
3079                                                              global/generic
3080                                                              load/load
3081                                                              atomic/store/store
3082                                                              atomic/atomicrmw.
3083                                                            - Ensures that all
3084                                                              memory operations
3085                                                              to local have
3086                                                              completed before
3087                                                              performing any
3088                                                              following global
3089                                                              memory operations.
3090                                                            - Ensures that the
3091                                                              preceding
3092                                                              local/generic load
3093                                                              atomic/atomicrmw
3094                                                              with an equal or
3095                                                              wider sync scope
3096                                                              and memory ordering
3097                                                              stronger than
3098                                                              unordered (this is
3099                                                              termed the
3100                                                              fence-paired-atomic)
3101                                                              has completed
3102                                                              before following
3103                                                              global memory
3104                                                              operations. This
3105                                                              satisfies the
3106                                                              requirements of
3107                                                              acquire.
3108                                                            - Ensures that all
3109                                                              previous memory
3110                                                              operations have
3111                                                              completed before a
3112                                                              following
3113                                                              local/generic store
3114                                                              atomic/atomicrmw
3115                                                              with an equal or
3116                                                              wider sync scope
3117                                                              and memory ordering
3118                                                              stronger than
3119                                                              unordered (this is
3120                                                              termed the
3121                                                              fence-paired-atomic).
3122                                                              This satisfies the
3123                                                              requirements of
3124                                                              release.
3125
3126      fence        acq_rel      - agent        *none*     1. s_waitcnt vmcnt(0) &
3127                                - system                     lgkmcnt(0)
3128
3129                                                            - If OpenCL and
3130                                                              address space is
3131                                                              not generic, omit
3132                                                              lgkmcnt(0).
3133                                                              However, since LLVM
3134                                                              currently has no
3135                                                              address space on
3136                                                              the fence need to
3137                                                              conservatively
3138                                                              always generate
3139                                                              (see comment for
3140                                                              previous fence).
3141                                                            - Could be split into
3142                                                              separate s_waitcnt
3143                                                              vmcnt(0) and
3144                                                              s_waitcnt
3145                                                              lgkmcnt(0) to allow
3146                                                              them to be
3147                                                              independently moved
3148                                                              according to the
3149                                                              following rules.
3150                                                            - s_waitcnt vmcnt(0)
3151                                                              must happen after
3152                                                              any preceding
3153                                                              global/generic
3154                                                              load/store/load
3155                                                              atomic/store
3156                                                              atomic/atomicrmw.
3157                                                            - s_waitcnt lgkmcnt(0)
3158                                                              must happen after
3159                                                              any preceding
3160                                                              local/generic
3161                                                              load/store/load
3162                                                              atomic/store
3163                                                              atomic/atomicrmw.
3164                                                            - Must happen before
3165                                                              the following
3166                                                              buffer_wbinvl1_vol.
3167                                                            - Ensures that the
3168                                                              preceding
3169                                                              global/local/generic
3170                                                              load
3171                                                              atomic/atomicrmw
3172                                                              with an equal or
3173                                                              wider sync scope
3174                                                              and memory ordering
3175                                                              stronger than
3176                                                              unordered (this is
3177                                                              termed the
3178                                                              fence-paired-atomic)
3179                                                              has completed
3180                                                              before invalidating
3181                                                              the cache. This
3182                                                              satisfies the
3183                                                              requirements of
3184                                                              acquire.
3185                                                            - Ensures that all
3186                                                              previous memory
3187                                                              operations have
3188                                                              completed before a
3189                                                              following
3190                                                              global/local/generic
3191                                                              store
3192                                                              atomic/atomicrmw
3193                                                              with an equal or
3194                                                              wider sync scope
3195                                                              and memory ordering
3196                                                              stronger than
3197                                                              unordered (this is
3198                                                              termed the
3199                                                              fence-paired-atomic).
3200                                                              This satisfies the
3201                                                              requirements of
3202                                                              release.
3203
3204                                                          2. buffer_wbinvl1_vol
3205
3206                                                            - Must happen before
3207                                                              any following
3208                                                              global/generic
3209                                                              load/load
3210                                                              atomic/store/store
3211                                                              atomic/atomicrmw.
3212                                                            - Ensures that
3213                                                              following loads
3214                                                              will not see stale
3215                                                              global data. This
3216                                                              satisfies the
3217                                                              requirements of
3218                                                              acquire.
3219
3220      **Sequential Consistent Atomic**
3221      ---------------------------------------------------------------------------
3222      load atomic  seq_cst      - singlethread - global   *Same as corresponding
3223                                - wavefront    - local    load atomic acquire*.
3224                                - workgroup    - generic
3225      load atomic  seq_cst      - agent        - global   1. s_waitcnt vmcnt(0)
3226                                - system       - local
3227                                               - generic    - Must happen after
3228                                                              preceding
3229                                                              global/generic load
3230                                                              atomic/store
3231                                                              atomic/atomicrmw
3232                                                              with memory
3233                                                              ordering of seq_cst
3234                                                              and with equal or
3235                                                              wider sync scope.
3236                                                              (Note that seq_cst
3237                                                              fences have their
3238                                                              own s_waitcnt
3239                                                              vmcnt(0) and so do
3240                                                              not need to be
3241                                                              considered.)
3242                                                            - Ensures any
3243                                                              preceding
3244                                                              sequential
3245                                                              consistent global
3246                                                              memory instructions
3247                                                              have completed
3248                                                              before executing
3249                                                              this sequentially
3250                                                              consistent
3251                                                              instruction. This
3252                                                              prevents reordering
3253                                                              a seq_cst store
3254                                                              followed by a
3255                                                              seq_cst load (Note
3256                                                              that seq_cst is
3257                                                              stronger than
3258                                                              acquire/release as
3259                                                              the reordering of
3260                                                              load acquire
3261                                                              followed by a store
3262                                                              release is
3263                                                              prevented by the
3264                                                              waitcnt vmcnt(0) of
3265                                                              the release, but
3266                                                              there is nothing
3267                                                              preventing a store
3268                                                              release followed by
3269                                                              load acquire from
3270                                                              competing out of
3271                                                              order.)
3272
3273                                                          2. *Following
3274                                                             instructions same as
3275                                                             corresponding load
3276                                                             atomic acquire*.
3277
3278      store atomic seq_cst      - singlethread - global   *Same as corresponding
3279                                - wavefront    - local    store atomic release*.
3280                                - workgroup    - generic
3281      store atomic seq_cst      - agent        - global   *Same as corresponding
3282                                - system       - generic  store atomic release*.
3283      atomicrmw    seq_cst      - singlethread - global   *Same as corresponding
3284                                - wavefront    - local    atomicrmw acq_rel*.
3285                                - workgroup    - generic
3286      atomicrmw    seq_cst      - agent        - global   *Same as corresponding
3287                                - system       - generic  atomicrmw acq_rel*.
3288      fence        seq_cst      - singlethread *none*     *Same as corresponding
3289                                - wavefront               fence acq_rel*.
3290                                - workgroup
3291                                - agent
3292                                - system
3293      ============ ============ ============== ========== =======================
3294
3295 The memory order also adds the single thread optimization constrains defined in
3296 table
3297 :ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
3298
3299   .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
3300      :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
3301
3302      ============ ==============================================================
3303      LLVM Memory  Optimization Constraints
3304      Ordering
3305      ============ ==============================================================
3306      unordered    *none*
3307      monotonic    *none*
3308      acquire      - If a load atomic/atomicrmw then no following load/load
3309                     atomic/store/ store atomic/atomicrmw/fence instruction can
3310                     be moved before the acquire.
3311                   - If a fence then same as load atomic, plus no preceding
3312                     associated fence-paired-atomic can be moved after the fence.
3313      release      - If a store atomic/atomicrmw then no preceding load/load
3314                     atomic/store/ store atomic/atomicrmw/fence instruction can
3315                     be moved after the release.
3316                   - If a fence then same as store atomic, plus no following
3317                     associated fence-paired-atomic can be moved before the
3318                     fence.
3319      acq_rel      Same constraints as both acquire and release.
3320      seq_cst      - If a load atomic then same constraints as acquire, plus no
3321                     preceding sequentially consistent load atomic/store
3322                     atomic/atomicrmw/fence instruction can be moved after the
3323                     seq_cst.
3324                   - If a store atomic then the same constraints as release, plus
3325                     no following sequentially consistent load atomic/store
3326                     atomic/atomicrmw/fence instruction can be moved before the
3327                     seq_cst.
3328                   - If an atomicrmw/fence then same constraints as acq_rel.
3329      ============ ==============================================================
3330
3331 Trap Handler ABI
3332 ~~~~~~~~~~~~~~~~
3333
3334 For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
3335 (such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
3336 the ``s_trap`` instruction with the following usage:
3337
3338   .. table:: AMDGPU Trap Handler for AMDHSA OS
3339      :name: amdgpu-trap-handler-for-amdhsa-os-table
3340
3341      =================== =============== =============== =======================
3342      Usage               Code Sequence   Trap Handler    Description
3343                                          Inputs
3344      =================== =============== =============== =======================
3345      reserved            ``s_trap 0x00``                 Reserved by hardware.
3346      ``debugtrap(arg)``  ``s_trap 0x01`` ``SGPR0-1``:    Reserved for HSA
3347                                            ``queue_ptr`` ``debugtrap``
3348                                          ``VGPR0``:      intrinsic (not
3349                                            ``arg``       implemented).
3350      ``llvm.trap``       ``s_trap 0x02`` ``SGPR0-1``:    Causes dispatch to be
3351                                            ``queue_ptr`` terminated and its
3352                                                          associated queue put
3353                                                          into the error state.
3354      ``llvm.debugtrap``  ``s_trap 0x03`` ``SGPR0-1``:    If debugger not
3355                                            ``queue_ptr`` installed handled
3356                                                          same as ``llvm.trap``.
3357      debugger breakpoint ``s_trap 0x07``                 Reserved for  debugger
3358                                                          breakpoints.
3359      debugger            ``s_trap 0x08``                 Reserved for debugger.
3360      debugger            ``s_trap 0xfe``                 Reserved for debugger.
3361      debugger            ``s_trap 0xff``                 Reserved for debugger.
3362      =================== =============== =============== =======================
3363
3364 Non-AMDHSA
3365 ----------
3366
3367 Trap Handler ABI
3368 ~~~~~~~~~~~~~~~~
3369
3370 For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
3371 not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
3372 instructions are handled as follows:
3373
3374   .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
3375      :name: amdgpu-trap-handler-for-non-amdhsa-os-table
3376
3377      =============== =============== ===========================================
3378      Usage           Code Sequence   Description
3379      =============== =============== ===========================================
3380      llvm.trap       s_endpgm        Causes wavefront to be terminated.
3381      llvm.debugtrap  *none*          Compiler warning given that there is no
3382                                      trap handler installed.
3383      =============== =============== ===========================================
3384
3385 Source Languages
3386 ================
3387
3388 .. _amdgpu-opencl:
3389
3390 OpenCL
3391 ------
3392
3393 When generating code for the OpenCL language the target triple environment
3394 should be ``opencl`` or ``amdgizcl`` (see :ref:`amdgpu-target-triples`).
3395
3396 When the language is OpenCL the following differences occur:
3397
3398 1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
3399 2. The AMDGPU backend adds additional arguments to the kernel.
3400 3. Additional metadata is generated (:ref:`amdgpu-code-object-metadata`).
3401
3402 .. TODO
3403    Specify what affect this has. Hidden arguments added. Additional metadata
3404    generated.
3405
3406 .. _amdgpu-hcc:
3407
3408 HCC
3409 ---
3410
3411 When generating code for the OpenCL language the target triple environment
3412 should be ``hcc`` (see :ref:`amdgpu-target-triples`).
3413
3414 When the language is OpenCL the following differences occur:
3415
3416 1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
3417
3418 .. TODO
3419    Specify what affect this has.
3420
3421 Assembler
3422 ---------
3423
3424 AMDGPU backend has LLVM-MC based assembler which is currently in development.
3425 It supports AMDGCN GFX6-GFX8.
3426
3427 This section describes general syntax for instructions and operands. For more
3428 information about instructions, their semantics and supported combinations of
3429 operands, refer to one of instruction set architecture manuals
3430 [AMD-Souther-Islands]_ [AMD-Sea-Islands]_ [AMD-Volcanic-Islands]_.
3431
3432 An instruction has the following syntax (register operands are normally
3433 comma-separated while extra operands are space-separated):
3434
3435 *<opcode> <register_operand0>, ... <extra_operand0> ...*
3436
3437 Operands
3438 ~~~~~~~~
3439
3440 The following syntax for register operands is supported:
3441
3442 * SGPR registers: s0, ... or s[0], ...
3443 * VGPR registers: v0, ... or v[0], ...
3444 * TTMP registers: ttmp0, ... or ttmp[0], ...
3445 * Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
3446 * Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
3447 * Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
3448 * Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
3449 * Register index expressions: v[2*2], s[1-1:2-1]
3450 * 'off' indicates that an operand is not enabled
3451
3452 The following extra operands are supported:
3453
3454 * offset, offset0, offset1
3455 * idxen, offen bits
3456 * glc, slc, tfe bits
3457 * waitcnt: integer or combination of counter values
3458 * VOP3 modifiers:
3459
3460   - abs (\| \|), neg (\-)
3461
3462 * DPP modifiers:
3463
3464   - row_shl, row_shr, row_ror, row_rol
3465   - row_mirror, row_half_mirror, row_bcast
3466   - wave_shl, wave_shr, wave_ror, wave_rol, quad_perm
3467   - row_mask, bank_mask, bound_ctrl
3468
3469 * SDWA modifiers:
3470
3471   - dst_sel, src0_sel, src1_sel (BYTE_N, WORD_M, DWORD)
3472   - dst_unused (UNUSED_PAD, UNUSED_SEXT, UNUSED_PRESERVE)
3473   - abs, neg, sext
3474
3475 Instruction Examples
3476 ~~~~~~~~~~~~~~~~~~~~
3477
3478 DS
3479 ~~
3480
3481 .. code-block:: nasm
3482
3483   ds_add_u32 v2, v4 offset:16
3484   ds_write_src2_b64 v2 offset0:4 offset1:8
3485   ds_cmpst_f32 v2, v4, v6
3486   ds_min_rtn_f64 v[8:9], v2, v[4:5]
3487
3488
3489 For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
3490
3491 FLAT
3492 ++++
3493
3494 .. code-block:: nasm
3495
3496   flat_load_dword v1, v[3:4]
3497   flat_store_dwordx3 v[3:4], v[5:7]
3498   flat_atomic_swap v1, v[3:4], v5 glc
3499   flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
3500   flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
3501
3502 For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
3503
3504 MUBUF
3505 +++++
3506
3507 .. code-block:: nasm
3508
3509   buffer_load_dword v1, off, s[4:7], s1
3510   buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
3511   buffer_store_format_xy v[1:2], off, s[4:7], s1
3512   buffer_wbinvl1
3513   buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
3514
3515 For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
3516
3517 SMRD/SMEM
3518 +++++++++
3519
3520 .. code-block:: nasm
3521
3522   s_load_dword s1, s[2:3], 0xfc
3523   s_load_dwordx8 s[8:15], s[2:3], s4
3524   s_load_dwordx16 s[88:103], s[2:3], s4
3525   s_dcache_inv_vol
3526   s_memtime s[4:5]
3527
3528 For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
3529
3530 SOP1
3531 ++++
3532
3533 .. code-block:: nasm
3534
3535   s_mov_b32 s1, s2
3536   s_mov_b64 s[0:1], 0x80000000
3537   s_cmov_b32 s1, 200
3538   s_wqm_b64 s[2:3], s[4:5]
3539   s_bcnt0_i32_b64 s1, s[2:3]
3540   s_swappc_b64 s[2:3], s[4:5]
3541   s_cbranch_join s[4:5]
3542
3543 For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
3544
3545 SOP2
3546 ++++
3547
3548 .. code-block:: nasm
3549
3550   s_add_u32 s1, s2, s3
3551   s_and_b64 s[2:3], s[4:5], s[6:7]
3552   s_cselect_b32 s1, s2, s3
3553   s_andn2_b32 s2, s4, s6
3554   s_lshr_b64 s[2:3], s[4:5], s6
3555   s_ashr_i32 s2, s4, s6
3556   s_bfm_b64 s[2:3], s4, s6
3557   s_bfe_i64 s[2:3], s[4:5], s6
3558   s_cbranch_g_fork s[4:5], s[6:7]
3559
3560 For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
3561
3562 SOPC
3563 ++++
3564
3565 .. code-block:: nasm
3566
3567   s_cmp_eq_i32 s1, s2
3568   s_bitcmp1_b32 s1, s2
3569   s_bitcmp0_b64 s[2:3], s4
3570   s_setvskip s3, s5
3571
3572 For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
3573
3574 SOPP
3575 ++++
3576
3577 .. code-block:: nasm
3578
3579   s_barrier
3580   s_nop 2
3581   s_endpgm
3582   s_waitcnt 0 ; Wait for all counters to be 0
3583   s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
3584   s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
3585   s_sethalt 9
3586   s_sleep 10
3587   s_sendmsg 0x1
3588   s_sendmsg sendmsg(MSG_INTERRUPT)
3589   s_trap 1
3590
3591 For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
3592
3593 Unless otherwise mentioned, little verification is performed on the operands
3594 of SOPP Instructions, so it is up to the programmer to be familiar with the
3595 range or acceptable values.
3596
3597 VALU
3598 ++++
3599
3600 For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
3601 the assembler will automatically use optimal encoding based on its operands.
3602 To force specific encoding, one can add a suffix to the opcode of the instruction:
3603
3604 * _e32 for 32-bit VOP1/VOP2/VOPC
3605 * _e64 for 64-bit VOP3
3606 * _dpp for VOP_DPP
3607 * _sdwa for VOP_SDWA
3608
3609 VOP1/VOP2/VOP3/VOPC examples:
3610
3611 .. code-block:: nasm
3612
3613   v_mov_b32 v1, v2
3614   v_mov_b32_e32 v1, v2
3615   v_nop
3616   v_cvt_f64_i32_e32 v[1:2], v2
3617   v_floor_f32_e32 v1, v2
3618   v_bfrev_b32_e32 v1, v2
3619   v_add_f32_e32 v1, v2, v3
3620   v_mul_i32_i24_e64 v1, v2, 3
3621   v_mul_i32_i24_e32 v1, -3, v3
3622   v_mul_i32_i24_e32 v1, -100, v3
3623   v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
3624   v_max_f16_e32 v1, v2, v3
3625
3626 VOP_DPP examples:
3627
3628 .. code-block:: nasm
3629
3630   v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
3631   v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
3632   v_mov_b32 v0, v0 wave_shl:1
3633   v_mov_b32 v0, v0 row_mirror
3634   v_mov_b32 v0, v0 row_bcast:31
3635   v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
3636   v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
3637   v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
3638
3639 VOP_SDWA examples:
3640
3641 .. code-block:: nasm
3642
3643   v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
3644   v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
3645   v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
3646   v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
3647   v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
3648
3649 For full list of supported instructions, refer to "Vector ALU instructions".
3650
3651 HSA Code Object Directives
3652 ~~~~~~~~~~~~~~~~~~~~~~~~~~
3653
3654 AMDGPU ABI defines auxiliary data in output code object. In assembly source,
3655 one can specify them with assembler directives.
3656
3657 .hsa_code_object_version major, minor
3658 +++++++++++++++++++++++++++++++++++++
3659
3660 *major* and *minor* are integers that specify the version of the HSA code
3661 object that will be generated by the assembler.
3662
3663 .hsa_code_object_isa [major, minor, stepping, vendor, arch]
3664 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3665
3666
3667 *major*, *minor*, and *stepping* are all integers that describe the instruction
3668 set architecture (ISA) version of the assembly program.
3669
3670 *vendor* and *arch* are quoted strings.  *vendor* should always be equal to
3671 "AMD" and *arch* should always be equal to "AMDGPU".
3672
3673 By default, the assembler will derive the ISA version, *vendor*, and *arch*
3674 from the value of the -mcpu option that is passed to the assembler.
3675
3676 .amdgpu_hsa_kernel (name)
3677 +++++++++++++++++++++++++
3678
3679 This directives specifies that the symbol with given name is a kernel entry point
3680 (label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
3681
3682 .amd_kernel_code_t
3683 ++++++++++++++++++
3684
3685 This directive marks the beginning of a list of key / value pairs that are used
3686 to specify the amd_kernel_code_t object that will be emitted by the assembler.
3687 The list must be terminated by the *.end_amd_kernel_code_t* directive.  For
3688 any amd_kernel_code_t values that are unspecified a default value will be
3689 used.  The default value for all keys is 0, with the following exceptions:
3690
3691 - *kernel_code_version_major* defaults to 1.
3692 - *machine_kind* defaults to 1.
3693 - *machine_version_major*, *machine_version_minor*, and
3694   *machine_version_stepping* are derived from the value of the -mcpu option
3695   that is passed to the assembler.
3696 - *kernel_code_entry_byte_offset* defaults to 256.
3697 - *wavefront_size* defaults to 6.
3698 - *kernarg_segment_alignment*, *group_segment_alignment*, and
3699   *private_segment_alignment* default to 4.  Note that alignments are specified
3700   as a power of two, so a value of **n** means an alignment of 2^ **n**.
3701
3702 The *.amd_kernel_code_t* directive must be placed immediately after the
3703 function label and before any instructions.
3704
3705 For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
3706 comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
3707
3708 Here is an example of a minimal amd_kernel_code_t specification:
3709
3710 .. code-block:: none
3711
3712    .hsa_code_object_version 1,0
3713    .hsa_code_object_isa
3714
3715    .hsatext
3716    .globl  hello_world
3717    .p2align 8
3718    .amdgpu_hsa_kernel hello_world
3719
3720    hello_world:
3721
3722       .amd_kernel_code_t
3723          enable_sgpr_kernarg_segment_ptr = 1
3724          is_ptr64 = 1
3725          compute_pgm_rsrc1_vgprs = 0
3726          compute_pgm_rsrc1_sgprs = 0
3727          compute_pgm_rsrc2_user_sgpr = 2
3728          kernarg_segment_byte_size = 8
3729          wavefront_sgpr_count = 2
3730          workitem_vgpr_count = 3
3731      .end_amd_kernel_code_t
3732
3733      s_load_dwordx2 s[0:1], s[0:1] 0x0
3734      v_mov_b32 v0, 3.14159
3735      s_waitcnt lgkmcnt(0)
3736      v_mov_b32 v1, s0
3737      v_mov_b32 v2, s1
3738      flat_store_dword v[1:2], v0
3739      s_endpgm
3740    .Lfunc_end0:
3741         .size   hello_world, .Lfunc_end0-hello_world
3742
3743 Additional Documentation
3744 ========================
3745
3746 .. [AMD-R6xx] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
3747 .. [AMD-R7xx] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
3748 .. [AMD-Evergreen] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
3749 .. [AMD-Cayman-Trinity] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
3750 .. [AMD-Souther-Islands] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
3751 .. [AMD-Sea-Islands] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
3752 .. [AMD-Volcanic-Islands] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
3753 .. [AMD-OpenCL_Programming-Guide]  `AMD Accelerated Parallel Processing OpenCL Programming Guide <http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf>`_
3754 .. [AMD-APP-SDK] `AMD Accelerated Parallel Processing APP SDK Documentation <http://developer.amd.com/tools/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/documentation/>`__
3755 .. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
3756 .. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
3757 .. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
3758 .. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
3759 .. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
3760 .. [YAML] `YAML Ain’t Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
3761 .. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
3762 .. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
3763 .. [AMD-AMDGPU-Compute-Application-Binary-Interface] `AMDGPU Compute Application Binary Interface <https://github.com/RadeonOpenCompute/ROCm-ComputeABI-Doc/blob/master/AMDGPU-ABI.md>`__