docs/AMDGPUUsage.rst

   1 =============================
   2 User Guide for AMDGPU Backend
   3 =============================
   4
   5 .. contents::
   6    :local:
   7
   8 Introduction
   9 ============
  10
  11 The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
  12 R600 family up until the current GCN families. It lives in the
  13 ``lib/Target/AMDGPU`` directory.
  14
  15 LLVM
  16 ====
  17
  18 .. _amdgpu-target-triples:
  19
  20 Target Triples
  21 --------------
  22
  23 Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
  24 specify the target triple:
  25
  26   .. table:: AMDGPU Target Triples
  27      :name: amdgpu-target-triples-table
  28
  29      ============ ======== ========= ===========
  30      Architecture Vendor   OS        Environment
  31      ============ ======== ========= ===========
  32      r600         amd      <empty>   <empty>
  33      amdgcn       amd      <empty>   <empty>
  34      amdgcn       amd      amdhsa    <empty>
  35      amdgcn       amd      amdhsa    opencl
  36      amdgcn       amd      amdhsa    amdgizcl
  37      amdgcn       amd      amdhsa    amdgiz
  38      amdgcn       amd      amdhsa    hcc
  39      ============ ======== ========= ===========
  40
  41 ``r600-amd--``
  42   Supports AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders executed on
  43   the MESA runtime.
  44
  45 ``amdgcn-amd--``
  46   Supports AMD GPUs GCN GFX6 onwards for graphics and compute shaders executed on
  47   the MESA runtime.
  48
  49 ``amdgcn-amd-amdhsa-``
  50   Supports AMD GCN GPUs GFX6 onwards for compute kernels executed on HSA [HSA]_
  51   compatible runtimes such as AMD's ROCm [AMD-ROCm]_.
  52
  53 ``amdgcn-amd-amdhsa-opencl``
  54   Supports AMD GCN GPUs GFX6 onwards for OpenCL compute kernels executed on HSA
  55   [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. See
  56   :ref:`amdgpu-opencl`.
  57
  58 ``amdgcn-amd-amdhsa-amdgizcl``
  59   Same as ``amdgcn-amd-amdhsa-opencl`` except a different address space mapping
  60   is used (see :ref:`amdgpu-address-spaces`).
  61
  62 ``amdgcn-amd-amdhsa-amdgiz``
  63   Same as ``amdgcn-amd-amdhsa-`` except a different address space mapping is
  64   used (see :ref:`amdgpu-address-spaces`).
  65
  66 ``amdgcn-amd-amdhsa-hcc``
  67   Supports AMD GCN GPUs GFX6 onwards for AMD HC language compute kernels
  68   executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. See
  69   :ref:`amdgpu-hcc`.
  70
  71 .. _amdgpu-processors:
  72
  73 Processors
  74 ----------
  75
  76 Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
  77 names from both the *Processor* and *Alternative Processor* can be used.
  78
  79   .. table:: AMDGPU Processors
  80      :name: amdgpu-processors-table
  81
  82      ========== =========== ============ ===== ======= ==================
  83      Processor  Alternative Target       dGPU/ Runtime Example
  84                 Processor   Triple       APU   Support Products
  85                             Architecture
  86      ========== =========== ============ ===== ======= ==================
  87      **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
  88      --------------------------------------------------------------------
  89      r600                   r600         dGPU
  90      r630                   r600         dGPU
  91      rs880                  r600         dGPU
  92      rv670                  r600         dGPU
  93      **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
  94      --------------------------------------------------------------------
  95      rv710                  r600         dGPU
  96      rv730                  r600         dGPU
  97      rv770                  r600         dGPU
  98      **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
  99      --------------------------------------------------------------------
 100      cedar                  r600         dGPU
 101      redwood                r600         dGPU
 102      sumo                   r600         dGPU
 103      juniper                r600         dGPU
 104      cypress                r600         dGPU
 105      **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
 106      --------------------------------------------------------------------
 107      barts                  r600         dGPU
 108      turks                  r600         dGPU
 109      caicos                 r600         dGPU
 110      cayman                 r600         dGPU
 111      **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
 112      --------------------------------------------------------------------
 113      gfx600     - tahiti    amdgcn       dGPU
 114      gfx601     - pitcairn  amdgcn       dGPU
 115                 - verde
 116                 - oland
 117                 - hainan
 118      **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
 119      --------------------------------------------------------------------
 120      gfx700     - bonaire   amdgcn       dGPU          - Radeon HD 7790
 121                                                        - Radeon HD 8770
 122                                                        - R7 260
 123                                                        - R7 260X
 124      \          - kaveri    amdgcn       APU           - A6-7000
 125                                                        - A6 Pro-7050B
 126                                                        - A8-7100
 127                                                        - A8 Pro-7150B
 128                                                        - A10-7300
 129                                                        - A10 Pro-7350B
 130                                                        - FX-7500
 131                                                        - A8-7200P
 132                                                        - A10-7400P
 133                                                        - FX-7600P
 134      gfx701     - hawaii    amdgcn       dGPU  ROCm    - FirePro W8100
 135                                                        - FirePro W9100
 136                                                        - FirePro S9150
 137                                                        - FirePro S9170
 138      gfx702                              dGPU  ROCm    - Radeon R9 290
 139                                                        - Radeon R9 290x
 140                                                        - Radeon R390
 141                                                        - Radeon R390x
 142      gfx703     - kabini    amdgcn       APU           - E1-2100
 143                 - mullins                              - E1-2200
 144                                                        - E1-2500
 145                                                        - E2-3000
 146                                                        - E2-3800
 147                                                        - A4-5000
 148                                                        - A4-5100
 149                                                        - A6-5200
 150                                                        - A4 Pro-3340B
 151      **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
 152      --------------------------------------------------------------------
 153      gfx800     - iceland   amdgcn       dGPU          - FirePro S7150
 154                                                        - FirePro S7100
 155                                                        - FirePro W7100
 156                                                        - Radeon R285
 157                                                        - Radeon R9 380
 158                                                        - Radeon R9 385
 159                                                        - Mobile FirePro
 160                                                          M7170
 161      gfx801     - carrizo   amdgcn       APU           - A6-8500P
 162                                                        - Pro A6-8500B
 163                                                        - A8-8600P
 164                                                        - Pro A8-8600B
 165                                                        - FX-8800P
 166                                                        - Pro A12-8800B
 167      \                      amdgcn       APU   ROCm    - A10-8700P
 168                                                        - Pro A10-8700B
 169                                                        - A10-8780P
 170      \                      amdgcn       APU           - A10-9600P
 171                                                        - A10-9630P
 172                                                        - A12-9700P
 173                                                        - A12-9730P
 174                                                        - FX-9800P
 175                                                        - FX-9830P
 176      \                      amdgcn       APU           - E2-9010
 177                                                        - A6-9210
 178                                                        - A9-9410
 179      gfx802     - tonga     amdgcn       dGPU  ROCm    Same as gfx800
 180      gfx803     - fiji      amdgcn       dGPU  ROCm    - Radeon R9 Nano
 181                                                        - Radeon R9 Fury
 182                                                        - Radeon R9 FuryX
 183                                                        - Radeon Pro Duo
 184                                                        - FirePro S9300x2
 185                                                        - Radeon Instinct MI8
 186      \          - polaris10 amdgcn       dGPU  ROCm    - Radeon RX 470
 187                                                        - Radeon RX 480
 188                                                        - Radeon Instinct MI6
 189      \          - polaris11 amdgcn       dGPU  ROCm    - Radeon RX 460
 190      gfx804                 amdgcn       dGPU          Same as gfx803
 191      gfx810     - stoney    amdgcn       APU
 192      **GCN GFX9** [AMD-GCN-GFX9]_
 193      --------------------------------------------------------------------
 194      gfx900                 amdgcn       dGPU          - Radeon Vega
 195                                                          Frontier Edition
 196                                                        - Radeon RX Vega 56
 197                                                        - Radeon RX Vega 64
 198                                                        - Radeon RX Vega 64
 199                                                          Liquid
 200                                                        - Radeon Instinct MI25
 201      gfx901                 amdgcn       dGPU  ROCm    Same as gfx900
 202                                                        except XNACK is
 203                                                        enabled
 204      gfx902                 amdgcn       APU           *TBA*
 205
 206                                                        .. TODO
 207                                                           Add product
 208                                                           names.
 209      gfx903                 amdgcn       APU           Same as gfx902
 210                                                        except XNACK is
 211                                                        enabled
 212      ========== =========== ============ ===== ======= ==================
 213
 214 .. _amdgpu-address-spaces:
 215
 216 Address Spaces
 217 --------------
 218
 219 The AMDGPU backend uses the following address space mappings.
 220
 221 The memory space names used in the table, aside from the region memory space, is
 222 from the OpenCL standard.
 223
 224 LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
 225
 226   .. table:: Address Space Mapping
 227      :name: amdgpu-address-space-mapping-table
 228
 229      ================== ================= ================= ================= =================
 230      LLVM Address Space Memory Space
 231      ------------------ -----------------------------------------------------------------------
 232      \                  Current Default   amdgiz/amdgizcl   hcc               Future Default
 233      ================== ================= ================= ================= =================
 234      0                  Private (Scratch) Generic (Flat)    Generic (Flat)    Generic (Flat)
 235      1                  Global            Global            Global            Global
 236      2                  Constant          Constant          Constant          Region (GDS)
 237      3                  Local (group/LDS) Local (group/LDS) Local (group/LDS) Local (group/LDS)
 238      4                  Generic (Flat)    Region (GDS)      Region (GDS)      Constant
 239      5                  Region (GDS)      Private (Scratch) Private (Scratch) Private (Scratch)
 240      ================== ================= ================= ================= =================
 241
 242 Current Default
 243   This is the current default address space mapping used for all languages
 244   except hcc. This will shortly be deprecated.
 245
 246 amdgiz/amdgizcl
 247   This is the current address space mapping used when ``amdgiz`` or ``amdgizcl``
 248   is specified as the target triple environment value.
 249
 250 hcc
 251   This is the current address space mapping used when ``hcc`` is specified as
 252   the target triple environment value.This will shortly be deprecated.
 253
 254 Future Default
 255   This will shortly be the only address space mapping for all languages using
 256   AMDGPU backend.
 257
 258 .. _amdgpu-memory-scopes:
 259
 260 Memory Scopes
 261 -------------
 262
 263 This section provides LLVM memory synchronization scopes supported by the AMDGPU
 264 backend memory model when the target triple OS is ``amdhsa`` (see
 265 :ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
 266
 267 The memory model supported is based on the HSA memory model [HSA]_ which is
 268 based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
 269 relation is transitive over the synchonizes-with relation independent of scope,
 270 and synchonizes-with allows the memory scope instances to be inclusive (see
 271 table :ref:`amdgpu-amdhsa-llvm-sync-scopes-amdhsa-table`).
 272
 273 This is different to the OpenCL [OpenCL]_ memory model which does not have scope
 274 inclusion and requires the memory scopes to exactly match. However, this
 275 is conservatively correct for OpenCL.
 276
 277   .. table:: AMDHSA LLVM Sync Scopes for AMDHSA
 278      :name: amdgpu-amdhsa-llvm-sync-scopes-amdhsa-table
 279
 280      ================ ==========================================================
 281      LLVM Sync Scope  Description
 282      ================ ==========================================================
 283      *none*           The default: ``system``.
 284
 285                       Synchronizes with, and participates in modification and
 286                       seq_cst total orderings with, other operations (except
 287                       image operations) for all address spaces (except private,
 288                       or generic that accesses private) provided the other
 289                       operation's sync scope is:
 290
 291                       - ``system``.
 292                       - ``agent`` and executed by a thread on the same agent.
 293                       - ``workgroup`` and executed by a thread in the same
 294                         workgroup.
 295                       - ``wavefront`` and executed by a thread in the same
 296                         wavefront.
 297
 298      ``agent``        Synchronizes with, and participates in modification and
 299                       seq_cst total orderings with, other operations (except
 300                       image operations) for all address spaces (except private,
 301                       or generic that accesses private) provided the other
 302                       operation's sync scope is:
 303
 304                       - ``system`` or ``agent`` and executed by a thread on the
 305                         same agent.
 306                       - ``workgroup`` and executed by a thread in the same
 307                         workgroup.
 308                       - ``wavefront`` and executed by a thread in the same
 309                         wavefront.
 310
 311      ``workgroup``    Synchronizes with, and participates in modification and
 312                       seq_cst total orderings with, other operations (except
 313                       image operations) for all address spaces (except private,
 314                       or generic that accesses private) provided the other
 315                       operation's sync scope is:
 316
 317                       - ``system``, ``agent`` or ``workgroup`` and executed by a
 318                         thread in the same workgroup.
 319                       - ``wavefront`` and executed by a thread in the same
 320                         wavefront.
 321
 322      ``wavefront``    Synchronizes with, and participates in modification and
 323                       seq_cst total orderings with, other operations (except
 324                       image operations) for all address spaces (except private,
 325                       or generic that accesses private) provided the other
 326                       operation's sync scope is:
 327
 328                       - ``system``, ``agent``, ``workgroup`` or ``wavefront``
 329                         and executed by a thread in the same wavefront.
 330
 331      ``singlethread`` Only synchronizes with, and participates in modification
 332                       and seq_cst total orderings with, other operations (except
 333                       image operations) running in the same thread for all
 334                       address spaces (for example, in signal handlers).
 335      ================ ==========================================================
 336
 337 AMDGPU Intrinsics
 338 -----------------
 339
 340 The AMDGPU backend implements the following intrinsics.
 341
 342 *This section is WIP.*
 343
 344 .. TODO
 345    List AMDGPU intrinsics
 346
 347 Code Object
 348 ===========
 349
 350 The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
 351 can be linked by ``lld`` to produce a standard ELF shared code object which can
 352 be loaded and executed on an AMDGPU target.
 353
 354 Header
 355 ------
 356
 357 The AMDGPU backend uses the following ELF header:
 358
 359   .. table:: AMDGPU ELF Header
 360      :name: amdgpu-elf-header-table
 361
 362      ========================== ===============================
 363      Field                      Value
 364      ========================== ===============================
 365      ``e_ident[EI_CLASS]``      ``ELFCLASS64``
 366      ``e_ident[EI_DATA]``       ``ELFDATA2LSB``
 367      ``e_ident[EI_OSABI]``      ``ELFOSABI_AMDGPU_HSA``,
 368                                 ``ELFOSABI_AMDGPU_PAL`` or
 369                                 ``ELFOSABI_AMDGPU_MESA3D``
 370      ``e_ident[EI_ABIVERSION]`` ``ELFABIVERSION_AMDGPU_HSA``,
 371                                 ``ELFABIVERSION_AMDGPU_PAL`` or
 372                                 ``ELFABIVERSION_AMDGPU_MESA3D``
 373      ``e_type``                 ``ET_REL`` or ``ET_DYN``
 374      ``e_machine``              ``EM_AMDGPU``
 375      ``e_entry``                0
 376      ``e_flags``                0
 377      ========================== ===============================
 378
 379 ..
 380
 381   .. table:: AMDGPU ELF Header Enumeration Values
 382      :name: amdgpu-elf-header-enumeration-values-table
 383
 384      =============================== =====
 385      Name                            Value
 386      =============================== =====
 387      ``EM_AMDGPU``                   224
 388      ``ELFOSABI_AMDGPU_HSA``         64
 389      ``ELFOSABI_AMDGPU_PAL``         65
 390      ``ELFOSABI_AMDGPU_MESA3D``      66
 391      ``ELFABIVERSION_AMDGPU_HSA``    1
 392      ``ELFABIVERSION_AMDGPU_PAL``    0
 393      ``ELFABIVERSION_AMDGPU_MESA3D`` 0
 394      =============================== =====
 395
 396 ``e_ident[EI_CLASS]``
 397   The ELF class is always ``ELFCLASS64``. The AMDGPU backend only supports 64
 398   bit applications.
 399
 400 ``e_ident[EI_DATA]``
 401   All AMDGPU targets use ELFDATA2LSB for little-endian byte ordering.
 402
 403 ``e_ident[EI_OSABI]``
 404   One of the following AMD GPU architecture specific OS ABIs:
 405
 406   * ``ELFOSABI_AMDGPU_HSA`` is used to specify that the code object conforms to
 407     the AMD HSA runtime ABI [HSA]_.
 408
 409   * ``ELFOSABI_AMDGPU_PAL`` is used to specify that the code object conforms to
 410     the AMD PAL runtime ABI.
 411
 412   * ``ELFOSABI_AMDGPU_MESA3D`` is used to specify that the code object conforms
 413     to the AMD MESA runtime ABI.
 414
 415 ``e_ident[EI_ABIVERSION]``
 416   The ABI version of the AMD GPU architecture specific OS ABI to which the code
 417   object conforms:
 418
 419   * ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
 420     runtime ABI.
 421
 422   * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
 423     runtime ABI.
 424
 425   * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
 426     runtime ABI.
 427
 428 ``e_type``
 429   Can be one of the following values:
 430
 431
 432   ``ET_REL``
 433     The type produced by the AMD GPU backend compiler as it is relocatable code
 434     object.
 435
 436   ``ET_DYN``
 437     The type produced by the linker as it is a shared code object.
 438
 439   The AMD HSA runtime loader requires a ``ET_DYN`` code object.
 440
 441 ``e_machine``
 442   The value ``EM_AMDGPU`` is used for the machine for all members of the AMD GPU
 443   architecture family. The specific member is specified in the
 444   ``NT_AMD_AMDGPU_ISA`` entry in the ``.note`` section (see
 445   :ref:`amdgpu-note-records`).
 446
 447 ``e_entry``
 448   The entry point is 0 as the entry points for individual kernels must be
 449   selected in order to invoke them through AQL packets.
 450
 451 ``e_flags``
 452   The value is 0 as no flags are used.
 453
 454 Sections
 455 --------
 456
 457 An AMDGPU target ELF code object has the standard ELF sections which include:
 458
 459   .. table:: AMDGPU ELF Sections
 460      :name: amdgpu-elf-sections-table
 461
 462      ================== ================ =================================
 463      Name               Type             Attributes
 464      ================== ================ =================================
 465      ``.bss``           ``SHT_NOBITS``   ``SHF_ALLOC`` + ``SHF_WRITE``
 466      ``.data``          ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
 467      ``.debug_``\ *\**  ``SHT_PROGBITS`` *none*
 468      ``.dynamic``       ``SHT_DYNAMIC``  ``SHF_ALLOC``
 469      ``.dynstr``        ``SHT_PROGBITS`` ``SHF_ALLOC``
 470      ``.dynsym``        ``SHT_PROGBITS`` ``SHF_ALLOC``
 471      ``.got``           ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
 472      ``.hash``          ``SHT_HASH``     ``SHF_ALLOC``
 473      ``.note``          ``SHT_NOTE``     *none*
 474      ``.rela``\ *name*  ``SHT_RELA``     *none*
 475      ``.rela.dyn``      ``SHT_RELA``     *none*
 476      ``.rodata``        ``SHT_PROGBITS`` ``SHF_ALLOC``
 477      ``.shstrtab``      ``SHT_STRTAB``   *none*
 478      ``.strtab``        ``SHT_STRTAB``   *none*
 479      ``.symtab``        ``SHT_SYMTAB``   *none*
 480      ``.text``          ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
 481      ================== ================ =================================
 482
 483 These sections have their standard meanings (see [ELF]_) and are only generated
 484 if needed.
 485
 486 ``.debug``\ *\**
 487   The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
 488   DWARF produced by the AMDGPU backend.
 489
 490 ``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
 491   The standard sections used by a dynamic loader.
 492
 493 ``.note``
 494   See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
 495   backend.
 496
 497 ``.rela``\ *name*, ``.rela.dyn``
 498   For relocatable code objects, *name* is the name of the section that the
 499   relocation records apply. For example, ``.rela.text`` is the section name for
 500   relocation records associated with the ``.text`` section.
 501
 502   For linked shared code objects, ``.rela.dyn`` contains all the relocation
 503   records from each of the relocatable code object's ``.rela``\ *name* sections.
 504
 505   See :ref:`amdgpu-relocation-records` for the relocation records supported by
 506   the AMDGPU backend.
 507
 508 ``.text``
 509   The executable machine code for the kernels and functions they call. Generated
 510   as position independent code. See :ref:`amdgpu-code-conventions` for
 511   information on conventions used in the isa generation.
 512
 513 .. _amdgpu-note-records:
 514
 515 Note Records
 516 ------------
 517
 518 As required by ``ELFCLASS64``, minimal zero byte padding must be generated after
 519 the ``name`` field to ensure the ``desc`` field is 4 byte aligned. In addition,
 520 minimal zero byte padding must be generated to ensure the ``desc`` field size is
 521 a multiple of 4 bytes. The ``sh_addralign`` field of the ``.note`` section must
 522 be at least 4 to indicate at least 8 byte alignment.
 523
 524 The AMDGPU backend code object uses the following ELF note records in the
 525 ``.note`` section. The *Description* column specifies the layout of the note
 526 record's ``desc`` field. All fields are consecutive bytes. Note records with
 527 variable size strings have a corresponding ``*_size`` field that specifies the
 528 number of bytes, including the terminating null character, in the string. The
 529 string(s) come immediately after the preceding fields.
 530
 531 Additional note records can be present.
 532
 533   .. table:: AMDGPU ELF Note Records
 534      :name: amdgpu-elf-note-records-table
 535
 536      ===== ============================== ======================================
 537      Name  Type                           Description
 538      ===== ============================== ======================================
 539      "AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
 540      "AMD" ``NT_AMD_AMDGPU_ISA``          <isa name null terminated string>
 541      ===== ============================== ======================================
 542
 543 ..
 544
 545   .. table:: AMDGPU ELF Note Record Enumeration Values
 546      :name: amdgpu-elf-note-record-enumeration-values-table
 547
 548      ============================== =====
 549      Name                           Value
 550      ============================== =====
 551      *reserved*                       0-9
 552      ``NT_AMD_AMDGPU_HSA_METADATA``    10
 553      ``NT_AMD_AMDGPU_ISA``             11
 554      ============================== =====
 555
 556 ``NT_AMD_AMDGPU_ISA``
 557   Specifies the instruction set architecture used by the machine code contained
 558   in the code object.
 559
 560   This note record is required for code objects containing machine code for
 561   processors matching the ``amdgcn`` architecture in table
 562   :ref:`amdgpu-processors`.
 563
 564   The null terminated string has the following syntax:
 565
 566     *architecture*\ ``-``\ *vendor*\ ``-``\ *os*\ ``-``\ *environment*\ ``-``\ *processor*
 567
 568   where:
 569
 570     *architecture*
 571       The architecture from table :ref:`amdgpu-target-triples-table`.
 572
 573       This is always ``amdgcn`` when the target triple OS is ``amdhsa`` (see
 574       :ref:`amdgpu-target-triples`).
 575
 576     *vendor*
 577       The vendor from table :ref:`amdgpu-target-triples-table`.
 578
 579       For the AMDGPU backend this is always ``amd``.
 580
 581     *os*
 582       The OS from table :ref:`amdgpu-target-triples-table`.
 583
 584     *environment*
 585       An environment from table :ref:`amdgpu-target-triples-table`, or blank if
 586       the environment has no affect on the execution of the code object.
 587
 588       For the AMDGPU backend this is currently always blank.
 589     *processor*
 590       The processor from table :ref:`amdgpu-processors-table`.
 591
 592   For example:
 593
 594     ``amdgcn-amd-amdhsa--gfx901``
 595
 596 ``NT_AMD_AMDGPU_HSA_METADATA``
 597   Specifies extensible metadata associated with the code objects executed on HSA
 598   [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
 599   the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
 600   :ref:`amdgpu-amdhsa-hsa-code-object-metadata` for the syntax of the code
 601   object metadata string.
 602
 603 .. _amdgpu-symbols:
 604
 605 Symbols
 606 -------
 607
 608 Symbols include the following:
 609
 610   .. table:: AMDGPU ELF Symbols
 611      :name: amdgpu-elf-symbols-table
 612
 613      ===================== ============== ============= ==================
 614      Name                  Type           Section       Description
 615      ===================== ============== ============= ==================
 616      *link-name*           ``STT_OBJECT`` - ``.data``   Global variable
 617                                           - ``.rodata``
 618                                           - ``.bss``
 619      *link-name*\ ``@kd``  ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
 620      *link-name*           ``STT_FUNC``   - ``.text``   Kernel entry point
 621      ===================== ============== ============= ==================
 622
 623 Global variable
 624   Global variables both used and defined by the compilation unit.
 625
 626   If the symbol is defined in the compilation unit then it is allocated in the
 627   appropriate section according to if it has initialized data or is readonly.
 628
 629   If the symbol is external then its section is ``STN_UNDEF`` and the loader
 630   will resolve relocations using the definition provided by another code object
 631   or explicitly defined by the runtime.
 632
 633   All global symbols, whether defined in the compilation unit or external, are
 634   accessed by the machine code indirectly through a GOT table entry. This
 635   allows them to be preemptable. The GOT table is only supported when the target
 636   triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
 637
 638   .. TODO
 639      Add description of linked shared object symbols. Seems undefined symbols
 640      are marked as STT_NOTYPE.
 641
 642 Kernel descriptor
 643   Every HSA kernel has an associated kernel descriptor. It is the address of the
 644   kernel descriptor that is used in the AQL dispatch packet used to invoke the
 645   kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
 646   defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
 647
 648 Kernel entry point
 649   Every HSA kernel also has a symbol for its machine code entry point.
 650
 651 .. _amdgpu-relocation-records:
 652
 653 Relocation Records
 654 ------------------
 655
 656 AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
 657 relocatable fields are:
 658
 659 ``word32``
 660   This specifies a 32-bit field occupying 4 bytes with arbitrary byte
 661   alignment. These values use the same byte order as other word values in the
 662   AMD GPU architecture.
 663
 664 ``word64``
 665   This specifies a 64-bit field occupying 8 bytes with arbitrary byte
 666   alignment. These values use the same byte order as other word values in the
 667   AMD GPU architecture.
 668
 669 Following notations are used for specifying relocation calculations:
 670
 671 **A**
 672   Represents the addend used to compute the value of the relocatable field.
 673
 674 **G**
 675   Represents the offset into the global offset table at which the relocation
 676   entry's symbol will reside during execution.
 677
 678 **GOT**
 679   Represents the address of the global offset table.
 680
 681 **P**
 682   Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
 683   of the storage unit being relocated (computed using ``r_offset``).
 684
 685 **S**
 686   Represents the value of the symbol whose index resides in the relocation
 687   entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``.
 688
 689 **B**
 690   Represents the base address of a loaded executable or shared object which is
 691   the difference between the ELF address and the actual load address. Relocations
 692   using this are only valid in executable or shared objects.
 693
 694 The following relocation types are supported:
 695
 696   .. table:: AMDGPU ELF Relocation Records
 697      :name: amdgpu-elf-relocation-records-table
 698
 699      ==========================  =====  ==========  ==============================
 700      Relocation Type             Value  Field       Calculation
 701      ==========================  =====  ==========  ==============================
 702      ``R_AMDGPU_NONE``           0      *none*      *none*
 703      ``R_AMDGPU_ABS32_LO``       1      ``word32``  (S + A) & 0xFFFFFFFF
 704      ``R_AMDGPU_ABS32_HI``       2      ``word32``  (S + A) >> 32
 705      ``R_AMDGPU_ABS64``          3      ``word64``  S + A
 706      ``R_AMDGPU_REL32``          4      ``word32``  S + A - P
 707      ``R_AMDGPU_REL64``          5      ``word64``  S + A - P
 708      ``R_AMDGPU_ABS32``          6      ``word32``  S + A
 709      ``R_AMDGPU_GOTPCREL``       7      ``word32``  G + GOT + A - P
 710      ``R_AMDGPU_GOTPCREL32_LO``  8      ``word32``  (G + GOT + A - P) & 0xFFFFFFFF
 711      ``R_AMDGPU_GOTPCREL32_HI``  9      ``word32``  (G + GOT + A - P) >> 32
 712      ``R_AMDGPU_REL32_LO``       10     ``word32``  (S + A - P) & 0xFFFFFFFF
 713      ``R_AMDGPU_REL32_HI``       11     ``word32``  (S + A - P) >> 32
 714      *reserved*                  12
 715      ``R_AMDGPU_RELATIVE64``     13     ``word64``  B + A
 716      ==========================  =====  ==========  ==============================
 717
 718 .. _amdgpu-dwarf:
 719
 720 DWARF
 721 -----
 722
 723 Standard DWARF [DWARF]_ Version 2 sections can be generated. These contain
 724 information that maps the code object executable code and data to the source
 725 language constructs. It can be used by tools such as debuggers and profilers.
 726
 727 Address Space Mapping
 728 ~~~~~~~~~~~~~~~~~~~~~
 729
 730 The following address space mapping is used:
 731
 732   .. table:: AMDGPU DWARF Address Space Mapping
 733      :name: amdgpu-dwarf-address-space-mapping-table
 734
 735      =================== =================
 736      DWARF Address Space Memory Space
 737      =================== =================
 738      1                   Private (Scratch)
 739      2                   Local (group/LDS)
 740      *omitted*           Global
 741      *omitted*           Constant
 742      *omitted*           Generic (Flat)
 743      *not supported*     Region (GDS)
 744      =================== =================
 745
 746 See :ref:`amdgpu-address-spaces` for information on the memory space terminology
 747 used in the table.
 748
 749 An ``address_class`` attribute is generated on pointer type DIEs to specify the
 750 DWARF address space of the value of the pointer when it is in the *private* or
 751 *local* address space. Otherwise the attribute is omitted.
 752
 753 An ``XDEREF`` operation is generated in location list expressions for variables
 754 that are allocated in the *private* and *local* address space. Otherwise no
 755 ``XDREF`` is omitted.
 756
 757 Register Mapping
 758 ~~~~~~~~~~~~~~~~
 759
 760 *This section is WIP.*
 761
 762 .. TODO
 763    Define DWARF register enumeration.
 764
 765    If want to present a wavefront state then should expose vector registers as
 766    64 wide (rather than per work-item view that LLVM uses). Either as separate
 767    registers, or a 64x4 byte single register. In either case use a new LANE op
 768    (akin to XDREF) to select the current lane usage in a location
 769    expression. This would also allow scalar register spilling to vector register
 770    lanes to be expressed (currently no debug information is being generated for
 771    spilling). If choose a wide single register approach then use LANE in
 772    conjunction with PIECE operation to select the dword part of the register for
 773    the current lane. If the separate register approach then use LANE to select
 774    the register.
 775
 776 Source Text
 777 ~~~~~~~~~~~
 778
 779 *This section is WIP.*
 780
 781 .. TODO
 782    DWARF extension to include runtime generated source text.
 783
 784 .. _amdgpu-code-conventions:
 785
 786 Code Conventions
 787 ================
 788
 789 This section provides code conventions used for each supported target triple OS
 790 (see :ref:`amdgpu-target-triples`).
 791
 792 AMDHSA
 793 ------
 794
 795 This section provides code conventions used when the target triple OS is
 796 ``amdhsa`` (see :ref:`amdgpu-target-triples`).
 797
 798 .. _amdgpu-amdhsa-hsa-code-object-metadata:
 799
 800 Code Object Metadata
 801 ~~~~~~~~~~~~~~~~~~~~
 802
 803 The code object metadata specifies extensible metadata associated with the code
 804 objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
 805 [AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record
 806 (see :ref:`amdgpu-note-records`) and is required when the target triple OS is
 807 ``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
 808 information necessary to support the ROCM kernel queries. For example, the
 809 segment sizes needed in a dispatch packet. In addition, a high level language
 810 runtime may require other information to be included. For example, the AMD
 811 OpenCL runtime records kernel argument information.
 812
 813 The metadata is specified as a YAML formatted string (see [YAML]_ and
 814 :doc:`YamlIO`).
 815
 816 .. TODO
 817    Is the string null terminated? It probably should not if YAML allows it to
 818    contain null characters, otherwise it should be.
 819
 820 The metadata is represented as a single YAML document comprised of the mapping
 821 defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
 822 referenced tables.
 823
 824 For boolean values, the string values of ``false`` and ``true`` are used for
 825 false and true respectively.
 826
 827 Additional information can be added to the mappings. To avoid conflicts, any
 828 non-AMD key names should be prefixed by "*vendor-name*.".
 829
 830   .. table:: AMDHSA Code Object Metadata Mapping
 831      :name: amdgpu-amdhsa-code-object-metadata-mapping-table
 832
 833      ========== ============== ========= =======================================
 834      String Key Value Type     Required? Description
 835      ========== ============== ========= =======================================
 836      "Version"  sequence of    Required  - The first integer is the major
 837                 2 integers                 version. Currently 1.
 838                                          - The second integer is the minor
 839                                            version. Currently 0.
 840      "Printf"   sequence of              Each string is encoded information
 841                 strings                  about a printf function call. The
 842                                          encoded information is organized as
 843                                          fields separated by colon (':'):
 844
 845                                          ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
 846
 847                                          where:
 848
 849                                          ``ID``
 850                                            A 32 bit integer as a unique id for
 851                                            each printf function call
 852
 853                                          ``N``
 854                                            A 32 bit integer equal to the number
 855                                            of arguments of printf function call
 856                                            minus 1
 857
 858                                          ``S[i]`` (where i = 0, 1, ... , N-1)
 859                                            32 bit integers for the size in bytes
 860                                            of the i-th FormatString argument of
 861                                            the printf function call
 862
 863                                          FormatString
 864                                            The format string passed to the
 865                                            printf function call.
 866      "Kernels"  sequence of    Required  Sequence of the mappings for each
 867                 mapping                  kernel in the code object. See
 868                                          :ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
 869                                          for the definition of the mapping.
 870      ========== ============== ========= =======================================
 871
 872 ..
 873
 874   .. table:: AMDHSA Code Object Kernel Metadata Mapping
 875      :name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
 876
 877      ================= ============== ========= ================================
 878      String Key        Value Type     Required? Description
 879      ================= ============== ========= ================================
 880      "Name"            string         Required  Source name of the kernel.
 881      "SymbolName"      string         Required  Name of the kernel
 882                                                 descriptor ELF symbol.
 883      "Language"        string                   Source language of the kernel.
 884                                                 Values include:
 885
 886                                                 - "OpenCL C"
 887                                                 - "OpenCL C++"
 888                                                 - "HCC"
 889                                                 - "OpenMP"
 890
 891      "LanguageVersion" sequence of              - The first integer is the major
 892                        2 integers                 version.
 893                                                 - The second integer is the
 894                                                   minor version.
 895      "Attrs"           mapping                  Mapping of kernel attributes.
 896                                                 See
 897                                                 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
 898                                                 for the mapping definition.
 899      "Args"            sequence of              Sequence of mappings of the
 900                        mapping                  kernel arguments. See
 901                                                 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
 902                                                 for the definition of the mapping.
 903      "CodeProps"       mapping                  Mapping of properties related to
 904                                                 the kernel code. See
 905                                                 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
 906                                                 for the mapping definition.
 907      "DebugProps"      mapping                  Mapping of properties related to
 908                                                 the kernel debugging. See
 909                                                 :ref:`amdgpu-amdhsa-code-object-kernel-debug-properties-metadata-mapping-table`
 910                                                 for the mapping definition.
 911      ================= ============== ========= ================================
 912
 913 ..
 914
 915   .. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
 916      :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
 917
 918      =================== ============== ========= ==============================
 919      String Key          Value Type     Required? Description
 920      =================== ============== ========= ==============================
 921      "ReqdWorkGroupSize" sequence of              The dispatch work-group size
 922                          3 integers               X, Y, Z must correspond to the
 923                                                   specified values.
 924
 925                                                   Corresponds to the OpenCL
 926                                                   ``reqd_work_group_size``
 927                                                   attribute.
 928      "WorkGroupSizeHint" sequence of              The dispatch work-group size
 929                          3 integers               X, Y, Z is likely to be the
 930                                                   specified values.
 931
 932                                                   Corresponds to the OpenCL
 933                                                   ``work_group_size_hint``
 934                                                   attribute.
 935      "VecTypeHint"       string                   The name of a scalar or vector
 936                                                   type.
 937
 938                                                   Corresponds to the OpenCL
 939                                                   ``vec_type_hint`` attribute.
 940
 941      "RuntimeHandle"     string                   The external symbol name
 942                                                   associated with a kernel.
 943                                                   OpenCL runtime allocates a
 944                                                   global buffer for the symbol
 945                                                   and saves the kernel's address
 946                                                   to it, which is used for
 947                                                   device side enqueueing. Only
 948                                                   available for device side
 949                                                   enqueued kernels.
 950      =================== ============== ========= ==============================
 951
 952 ..
 953
 954   .. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
 955      :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
 956
 957      ================= ============== ========= ================================
 958      String Key        Value Type     Required? Description
 959      ================= ============== ========= ================================
 960      "Name"            string                   Kernel argument name.
 961      "TypeName"        string                   Kernel argument type name.
 962      "Size"            integer        Required  Kernel argument size in bytes.
 963      "Align"           integer        Required  Kernel argument alignment in
 964                                                 bytes. Must be a power of two.
 965      "ValueKind"       string         Required  Kernel argument kind that
 966                                                 specifies how to set up the
 967                                                 corresponding argument.
 968                                                 Values include:
 969
 970                                                 "ByValue"
 971                                                   The argument is copied
 972                                                   directly into the kernarg.
 973
 974                                                 "GlobalBuffer"
 975                                                   A global address space pointer
 976                                                   to the buffer data is passed
 977                                                   in the kernarg.
 978
 979                                                 "DynamicSharedPointer"
 980                                                   A group address space pointer
 981                                                   to dynamically allocated LDS
 982                                                   is passed in the kernarg.
 983
 984                                                 "Sampler"
 985                                                   A global address space
 986                                                   pointer to a S# is passed in
 987                                                   the kernarg.
 988
 989                                                 "Image"
 990                                                   A global address space
 991                                                   pointer to a T# is passed in
 992                                                   the kernarg.
 993
 994                                                 "Pipe"
 995                                                   A global address space pointer
 996                                                   to an OpenCL pipe is passed in
 997                                                   the kernarg.
 998
 999                                                 "Queue"
1000                                                   A global address space pointer
1001                                                   to an OpenCL device enqueue
1002                                                   queue is passed in the
1003                                                   kernarg.
1004
1005                                                 "HiddenGlobalOffsetX"
1006                                                   The OpenCL grid dispatch
1007                                                   global offset for the X
1008                                                   dimension is passed in the
1009                                                   kernarg.
1010
1011                                                 "HiddenGlobalOffsetY"
1012                                                   The OpenCL grid dispatch
1013                                                   global offset for the Y
1014                                                   dimension is passed in the
1015                                                   kernarg.
1016
1017                                                 "HiddenGlobalOffsetZ"
1018                                                   The OpenCL grid dispatch
1019                                                   global offset for the Z
1020                                                   dimension is passed in the
1021                                                   kernarg.
1022
1023                                                 "HiddenNone"
1024                                                   An argument that is not used
1025                                                   by the kernel. Space needs to
1026                                                   be left for it, but it does
1027                                                   not need to be set up.
1028
1029                                                 "HiddenPrintfBuffer"
1030                                                   A global address space pointer
1031                                                   to the runtime printf buffer
1032                                                   is passed in kernarg.
1033
1034                                                 "HiddenDefaultQueue"
1035                                                   A global address space pointer
1036                                                   to the OpenCL device enqueue
1037                                                   queue that should be used by
1038                                                   the kernel by default is
1039                                                   passed in the kernarg.
1040
1041                                                 "HiddenCompletionAction"
1042                                                   A global address space pointer
1043                                                   to help link enqueued kernels into
1044                                                   the ancestor tree for determining
1045                                                   when the parent kernel has finished.
1046
1047      "ValueType"       string         Required  Kernel argument value type. Only
1048                                                 present if "ValueKind" is
1049                                                 "ByValue". For vector data
1050                                                 types, the value is for the
1051                                                 element type. Values include:
1052
1053                                                 - "Struct"
1054                                                 - "I8"
1055                                                 - "U8"
1056                                                 - "I16"
1057                                                 - "U16"
1058                                                 - "F16"
1059                                                 - "I32"
1060                                                 - "U32"
1061                                                 - "F32"
1062                                                 - "I64"
1063                                                 - "U64"
1064                                                 - "F64"
1065
1066                                                 .. TODO
1067                                                    How can it be determined if a
1068                                                    vector type, and what size
1069                                                    vector?
1070      "PointeeAlign"    integer                  Alignment in bytes of pointee
1071                                                 type for pointer type kernel
1072                                                 argument. Must be a power
1073                                                 of 2. Only present if
1074                                                 "ValueKind" is
1075                                                 "DynamicSharedPointer".
1076      "AddrSpaceQual"   string                   Kernel argument address space
1077                                                 qualifier. Only present if
1078                                                 "ValueKind" is "GlobalBuffer" or
1079                                                 "DynamicSharedPointer". Values
1080                                                 are:
1081
1082                                                 - "Private"
1083                                                 - "Global"
1084                                                 - "Constant"
1085                                                 - "Local"
1086                                                 - "Generic"
1087                                                 - "Region"
1088
1089                                                 .. TODO
1090                                                    Is GlobalBuffer only Global
1091                                                    or Constant? Is
1092                                                    DynamicSharedPointer always
1093                                                    Local? Can HCC allow Generic?
1094                                                    How can Private or Region
1095                                                    ever happen?
1096      "AccQual"         string                   Kernel argument access
1097                                                 qualifier. Only present if
1098                                                 "ValueKind" is "Image" or
1099                                                 "Pipe". Values
1100                                                 are:
1101
1102                                                 - "ReadOnly"
1103                                                 - "WriteOnly"
1104                                                 - "ReadWrite"
1105
1106                                                 .. TODO
1107                                                    Does this apply to
1108                                                    GlobalBuffer?
1109      "ActualAccQual"   string                   The actual memory accesses
1110                                                 performed by the kernel on the
1111                                                 kernel argument. Only present if
1112                                                 "ValueKind" is "GlobalBuffer",
1113                                                 "Image", or "Pipe". This may be
1114                                                 more restrictive than indicated
1115                                                 by "AccQual" to reflect what the
1116                                                 kernel actual does. If not
1117                                                 present then the runtime must
1118                                                 assume what is implied by
1119                                                 "AccQual" and "IsConst". Values
1120                                                 are:
1121
1122                                                 - "ReadOnly"
1123                                                 - "WriteOnly"
1124                                                 - "ReadWrite"
1125
1126      "IsConst"         boolean                  Indicates if the kernel argument
1127                                                 is const qualified. Only present
1128                                                 if "ValueKind" is
1129                                                 "GlobalBuffer".
1130
1131      "IsRestrict"      boolean                  Indicates if the kernel argument
1132                                                 is restrict qualified. Only
1133                                                 present if "ValueKind" is
1134                                                 "GlobalBuffer".
1135
1136      "IsVolatile"      boolean                  Indicates if the kernel argument
1137                                                 is volatile qualified. Only
1138                                                 present if "ValueKind" is
1139                                                 "GlobalBuffer".
1140
1141      "IsPipe"          boolean                  Indicates if the kernel argument
1142                                                 is pipe qualified. Only present
1143                                                 if "ValueKind" is "Pipe".
1144
1145                                                 .. TODO
1146                                                    Can GlobalBuffer be pipe
1147                                                    qualified?
1148      ================= ============== ========= ================================
1149
1150 ..
1151
1152   .. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
1153      :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
1154
1155      ============================ ============== ========= =====================
1156      String Key                   Value Type     Required? Description
1157      ============================ ============== ========= =====================
1158      "KernargSegmentSize"         integer        Required  The size in bytes of
1159                                                            the kernarg segment
1160                                                            that holds the values
1161                                                            of the arguments to
1162                                                            the kernel.
1163      "GroupSegmentFixedSize"      integer        Required  The amount of group
1164                                                            segment memory
1165                                                            required by a
1166                                                            work-group in
1167                                                            bytes. This does not
1168                                                            include any
1169                                                            dynamically allocated
1170                                                            group segment memory
1171                                                            that may be added
1172                                                            when the kernel is
1173                                                            dispatched.
1174      "PrivateSegmentFixedSize"    integer        Required  The amount of fixed
1175                                                            private address space
1176                                                            memory required for a
1177                                                            work-item in
1178                                                            bytes. If
1179                                                            IsDynamicCallstack
1180                                                            is 1 then additional
1181                                                            space must be added
1182                                                            to this value for the
1183                                                            call stack.
1184      "KernargSegmentAlign"        integer        Required  The maximum byte
1185                                                            alignment of
1186                                                            arguments in the
1187                                                            kernarg segment. Must
1188                                                            be a power of 2.
1189      "WavefrontSize"              integer        Required  Wavefront size. Must
1190                                                            be a power of 2.
1191      "NumSGPRs"                   integer                  Number of scalar
1192                                                            registers used by a
1193                                                            wavefront for
1194                                                            GFX6-GFX9. This
1195                                                            includes the special
1196                                                            SGPRs for VCC, Flat
1197                                                            Scratch (GFX7-GFX9)
1198                                                            and XNACK (for
1199                                                            GFX8-GFX9). It does
1200                                                            not include the 16
1201                                                            SGPR added if a trap
1202                                                            handler is
1203                                                            enabled. It is not
1204                                                            rounded up to the
1205                                                            allocation
1206                                                            granularity.
1207      "NumVGPRs"                   integer                  Number of vector
1208                                                            registers used by
1209                                                            each work-item for
1210                                                            GFX6-GFX9
1211      "MaxFlatWorkGroupSize"       integer                  Maximum flat
1212                                                            work-group size
1213                                                            supported by the
1214                                                            kernel in work-items.
1215      "IsDynamicCallStack"         boolean                  Indicates if the
1216                                                            generated machine
1217                                                            code is using a
1218                                                            dynamically sized
1219                                                            call stack.
1220      "IsXNACKEnabled"             boolean                  Indicates if the
1221                                                            generated machine
1222                                                            code is capable of
1223                                                            supporting XNACK.
1224      ============================ ============== ========= =====================
1225
1226 ..
1227
1228   .. table:: AMDHSA Code Object Kernel Debug Properties Metadata Mapping
1229      :name: amdgpu-amdhsa-code-object-kernel-debug-properties-metadata-mapping-table
1230
1231      =================================== ============== ========= ==============
1232      String Key                          Value Type     Required? Description
1233      =================================== ============== ========= ==============
1234      "DebuggerABIVersion"                sequence of
1235                                          2 integers
1236      "ReservedNumVGPRs"                  integer
1237      "ReservedFirstVGPR"                 integer
1238      "PrivateSegmentBufferSGPR"          integer
1239      "WavefrontPrivateSegmentOffsetSGPR" integer
1240      =================================== ============== ========= ==============
1241
1242 .. TODO
1243    Plan to remove the debug properties metadata.
1244
1245 Kernel Dispatch
1246 ~~~~~~~~~~~~~~~
1247
1248 The HSA architected queuing language (AQL) defines a user space memory interface
1249 that can be used to control the dispatch of kernels, in an agent independent
1250 way. An agent can have zero or more AQL queues created for it using the ROCm
1251 runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
1252 *HSA Platform System Architecture Specification* [HSA]_ for the AQL queue
1253 mechanics and packet layouts.
1254
1255 The packet processor of a kernel agent is responsible for detecting and
1256 dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
1257 packet processor is implemented by the hardware command processor (CP),
1258 asynchronous dispatch controller (ADC) and shader processor input controller
1259 (SPI).
1260
1261 The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
1262 mode driver to initialize and register the AQL queue with CP.
1263
1264 To dispatch a kernel the following actions are performed. This can occur in the
1265 CPU host program, or from an HSA kernel executing on a GPU.
1266
1267 1. A pointer to an AQL queue for the kernel agent on which the kernel is to be
1268    executed is obtained.
1269 2. A pointer to the kernel descriptor (see
1270    :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
1271    obtained. It must be for a kernel that is contained in a code object that that
1272    was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
1273    associated.
1274 3. Space is allocated for the kernel arguments using the ROCm runtime allocator
1275    for a memory region with the kernarg property for the kernel agent that will
1276    execute the kernel. It must be at least 16 byte aligned.
1277 4. Kernel argument values are assigned to the kernel argument memory
1278    allocation. The layout is defined in the *HSA Programmer's Language Reference*
1279    [HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
1280    memory in the same way constant memory is accessed. (Note that the HSA
1281    specification allows an implementation to copy the kernel argument contents to
1282    another location that is accessed by the kernel.)
1283 5. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
1284    api uses 64 bit atomic operations to reserve space in the AQL queue for the
1285    packet. The packet must be set up, and the final write must use an atomic
1286    store release to set the packet kind to ensure the packet contents are
1287    visible to the kernel agent. AQL defines a doorbell signal mechanism to
1288    notify the kernel agent that the AQL queue has been updated. These rules, and
1289    the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
1290    System Architecture Specification* [HSA]_.
1291 6. A kernel dispatch packet includes information about the actual dispatch,
1292    such as grid and work-group size, together with information from the code
1293    object about the kernel, such as segment sizes. The ROCm runtime queries on
1294    the kernel symbol can be used to obtain the code object values which are
1295    recorded in the :ref:`amdgpu-amdhsa-hsa-code-object-metadata`.
1296 7. CP executes micro-code and is responsible for detecting and setting up the
1297    GPU to execute the wavefronts of a kernel dispatch.
1298 8. CP ensures that when the a wavefront starts executing the kernel machine
1299    code, the scalar general purpose registers (SGPR) and vector general purpose
1300    registers (VGPR) are set up as required by the machine code. The required
1301    setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
1302    register state is defined in
1303    :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
1304 9. The prolog of the kernel machine code (see
1305    :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
1306    before continuing executing the machine code that corresponds to the kernel.
1307 10. When the kernel dispatch has completed execution, CP signals the completion
1308     signal specified in the kernel dispatch packet if not 0.
1309
1310 .. _amdgpu-amdhsa-memory-spaces:
1311
1312 Memory Spaces
1313 ~~~~~~~~~~~~~
1314
1315 The memory space properties are:
1316
1317   .. table:: AMDHSA Memory Spaces
1318      :name: amdgpu-amdhsa-memory-spaces-table
1319
1320      ================= =========== ======== ======= ==================
1321      Memory Space Name HSA Segment Hardware Address NULL Value
1322                        Name        Name     Size
1323      ================= =========== ======== ======= ==================
1324      Private           private     scratch  32      0x00000000
1325      Local             group       LDS      32      0xFFFFFFFF
1326      Global            global      global   64      0x0000000000000000
1327      Constant          constant    *same as 64      0x0000000000000000
1328                                    global*
1329      Generic           flat        flat     64      0x0000000000000000
1330      Region            N/A         GDS      32      *not implemented
1331                                                     for AMDHSA*
1332      ================= =========== ======== ======= ==================
1333
1334 The global and constant memory spaces both use global virtual addresses, which
1335 are the same virtual address space used by the CPU. However, some virtual
1336 addresses may only be accessible to the CPU, some only accessible by the GPU,
1337 and some by both.
1338
1339 Using the constant memory space indicates that the data will not change during
1340 the execution of the kernel. This allows scalar read instructions to be
1341 used. The vector and scalar L1 caches are invalidated of volatile data before
1342 each kernel dispatch execution to allow constant memory to change values between
1343 kernel dispatches.
1344
1345 The local memory space uses the hardware Local Data Store (LDS) which is
1346 automatically allocated when the hardware creates work-groups of wavefronts, and
1347 freed when all the wavefronts of a work-group have terminated. The data store
1348 (DS) instructions can be used to access it.
1349
1350 The private memory space uses the hardware scratch memory support. If the kernel
1351 uses scratch, then the hardware allocates memory that is accessed using
1352 wavefront lane dword (4 byte) interleaving. The mapping used from private
1353 address to physical address is:
1354
1355   ``wavefront-scratch-base +
1356   (private-address * wavefront-size * 4) +
1357   (wavefront-lane-id * 4)``
1358
1359 There are different ways that the wavefront scratch base address is determined
1360 by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
1361 memory can be accessed in an interleaved manner using buffer instruction with
1362 the scratch buffer descriptor and per wave scratch offset, by the scratch
1363 instructions, or by flat instructions. If each lane of a wavefront accesses the
1364 same private address, the interleaving results in adjacent dwords being accessed
1365 and hence requires fewer cache lines to be fetched. Multi-dword access is not
1366 supported except by flat and scratch instructions in GFX9.
1367
1368 The generic address space uses the hardware flat address support available in
1369 GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
1370 local appertures), that are outside the range of addressible global memory, to
1371 map from a flat address to a private or local address.
1372
1373 FLAT instructions can take a flat address and access global, private (scratch)
1374 and group (LDS) memory depending in if the address is within one of the
1375 apperture ranges. Flat access to scratch requires hardware aperture setup and
1376 setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
1377 access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
1378 (see :ref:`amdgpu-amdhsa-m0`).
1379
1380 To convert between a segment address and a flat address the base address of the
1381 appertures address can be used. For GFX7-GFX8 these are available in the
1382 :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
1383 Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
1384 GFX9 the appature base addresses are directly available as inline constant
1385 registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
1386 address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
1387 which makes it easier to convert from flat to segment or segment to flat.
1388
1389 Image and Samplers
1390 ~~~~~~~~~~~~~~~~~~
1391
1392 Image and sample handles created by the ROCm runtime are 64 bit addresses of a
1393 hardware 32 byte V# and 48 byte S# object respectively. In order to support the
1394 HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
1395 enumeration values for the queries that are not trivially deducible from the S#
1396 representation.
1397
1398 HSA Signals
1399 ~~~~~~~~~~~
1400
1401 HSA signal handles created by the ROCm runtime are 64 bit addresses of a
1402 structure allocated in memory accessible from both the CPU and GPU. The
1403 structure is defined by the ROCm runtime and subject to change between releases
1404 (see [AMD-ROCm-github]_).
1405
1406 .. _amdgpu-amdhsa-hsa-aql-queue:
1407
1408 HSA AQL Queue
1409 ~~~~~~~~~~~~~
1410
1411 The HSA AQL queue structure is defined by the ROCm runtime and subject to change
1412 between releases (see [AMD-ROCm-github]_). For some processors it contains
1413 fields needed to implement certain language features such as the flat address
1414 aperture bases. It also contains fields used by CP such as managing the
1415 allocation of scratch memory.
1416
1417 .. _amdgpu-amdhsa-kernel-descriptor:
1418
1419 Kernel Descriptor
1420 ~~~~~~~~~~~~~~~~~
1421
1422 A kernel descriptor consists of the information needed by CP to initiate the
1423 execution of a kernel, including the entry point address of the machine code
1424 that implements the kernel.
1425
1426 Kernel Descriptor for GFX6-GFX9
1427 +++++++++++++++++++++++++++++++
1428
1429 CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
1430
1431   .. table:: Kernel Descriptor for GFX6-GFX9
1432      :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
1433
1434      ======= ======= =============================== ============================
1435      Bits    Size    Field Name                      Description
1436      ======= ======= =============================== ============================
1437      31:0    4 bytes GroupSegmentFixedSize           The amount of fixed local
1438                                                      address space memory
1439                                                      required for a work-group
1440                                                      in bytes. This does not
1441                                                      include any dynamically
1442                                                      allocated local address
1443                                                      space memory that may be
1444                                                      added when the kernel is
1445                                                      dispatched.
1446      63:32   4 bytes PrivateSegmentFixedSize         The amount of fixed
1447                                                      private address space
1448                                                      memory required for a
1449                                                      work-item in bytes. If
1450                                                      is_dynamic_callstack is 1
1451                                                      then additional space must
1452                                                      be added to this value for
1453                                                      the call stack.
1454      95:64   4 bytes MaxFlatWorkGroupSize            Maximum flat work-group
1455                                                      size supported by the
1456                                                      kernel in work-items.
1457      96      1 bit   IsDynamicCallStack              Indicates if the generated
1458                                                      machine code is using a
1459                                                      dynamically sized call
1460                                                      stack.
1461      97      1 bit   IsXNACKEnabled                  Indicates if the generated
1462                                                      machine code is capable of
1463                                                      suppoting XNACK.
1464      127:98  30 bits                                 Reserved, must be 0.
1465      191:128 8 bytes KernelCodeEntryByteOffset       Byte offset (possibly
1466                                                      negative) from base
1467                                                      address of kernel
1468                                                      descriptor to kernel's
1469                                                      entry point instruction
1470                                                      which must be 256 byte
1471                                                      aligned.
1472      383:192 24                                      Reserved, must be 0.
1473              bytes
1474      415:384 4 bytes ComputePgmRsrc1                 Compute Shader (CS)
1475                                                      program settings used by
1476                                                      CP to set up
1477                                                      ``COMPUTE_PGM_RSRC1``
1478                                                      configuration
1479                                                      register. See
1480                                                      :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
1481      447:416 4 bytes ComputePgmRsrc2                 Compute Shader (CS)
1482                                                      program settings used by
1483                                                      CP to set up
1484                                                      ``COMPUTE_PGM_RSRC2``
1485                                                      configuration
1486                                                      register. See
1487                                                      :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
1488      448     1 bit   EnableSGPRPrivateSegmentBuffer  Enable the setup of the
1489                                                      SGPR user data registers
1490                                                      (see
1491                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1492
1493                                                      The total number of SGPR
1494                                                      user data registers
1495                                                      requested must not exceed
1496                                                      16 and match value in
1497                                                      ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
1498                                                      Any requests beyond 16
1499                                                      will be ignored.
1500      449     1 bit   EnableSGPRDispatchPtr           *see above*
1501      450     1 bit   EnableSGPRQueuePtr              *see above*
1502      451     1 bit   EnableSGPRKernargSegmentPtr     *see above*
1503      452     1 bit   EnableSGPRDispatchID            *see above*
1504      453     1 bit   EnableSGPRFlatScratchInit       *see above*
1505      454     1 bit   EnableSGPRPrivateSegmentSize    *see above*
1506      455     1 bit   EnableSGPRGridWorkgroupCountX   Not implemented in CP and
1507                                                      should always be 0.
1508      456     1 bit   EnableSGPRGridWorkgroupCountY   Not implemented in CP and
1509                                                      should always be 0.
1510      457     1 bit   EnableSGPRGridWorkgroupCountZ   Not implemented in CP and
1511                                                      should always be 0.
1512      463:458 6 bits                                  Reserved, must be 0.
1513      511:464 6                                       Reserved, must be 0.
1514              bytes
1515      512     **Total size 64 bytes.**
1516      ======= ====================================================================
1517
1518 ..
1519
1520   .. table:: compute_pgm_rsrc1 for GFX6-GFX9
1521      :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table
1522
1523      ======= ======= =============================== ===========================================================================
1524      Bits    Size    Field Name                      Description
1525      ======= ======= =============================== ===========================================================================
1526      5:0     6 bits  GRANULATED_WORKITEM_VGPR_COUNT  Number of vector registers
1527                                                      used by each work-item,
1528                                                      granularity is device
1529                                                      specific:
1530
1531                                                      GFX6-9
1532                                                        - max_vgpr 1..256
1533                                                        - roundup((max_vgpg + 1)
1534                                                          / 4) - 1
1535
1536                                                      Used by CP to set up
1537                                                      ``COMPUTE_PGM_RSRC1.VGPRS``.
1538      9:6     4 bits  GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar registers
1539                                                      used by a wavefront,
1540                                                      granularity is device
1541                                                      specific:
1542
1543                                                      GFX6-8
1544                                                        - max_sgpr 1..112
1545                                                        - roundup((max_sgpg + 1)
1546                                                          / 8) - 1
1547                                                      GFX9
1548                                                        - max_sgpr 1..112
1549                                                        - roundup((max_sgpg + 1)
1550                                                          / 16) - 1
1551
1552                                                      Includes the special SGPRs
1553                                                      for VCC, Flat Scratch (for
1554                                                      GFX7 onwards) and XNACK
1555                                                      (for GFX8 onwards). It does
1556                                                      not include the 16 SGPR
1557                                                      added if a trap handler is
1558                                                      enabled.
1559
1560                                                      Used by CP to set up
1561                                                      ``COMPUTE_PGM_RSRC1.SGPRS``.
1562      11:10   2 bits  PRIORITY                        Must be 0.
1563
1564                                                      Start executing wavefront
1565                                                      at the specified priority.
1566
1567                                                      CP is responsible for
1568                                                      filling in
1569                                                      ``COMPUTE_PGM_RSRC1.PRIORITY``.
1570      13:12   2 bits  FLOAT_ROUND_MODE_32             Wavefront starts execution
1571                                                      with specified rounding
1572                                                      mode for single (32
1573                                                      bit) floating point
1574                                                      precision floating point
1575                                                      operations.
1576
1577                                                      Floating point rounding
1578                                                      mode values are defined in
1579                                                      :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1580
1581                                                      Used by CP to set up
1582                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1583      15:14   2 bits  FLOAT_ROUND_MODE_16_64          Wavefront starts execution
1584                                                      with specified rounding
1585                                                      denorm mode for half/double (16
1586                                                      and 64 bit) floating point
1587                                                      precision floating point
1588                                                      operations.
1589
1590                                                      Floating point rounding
1591                                                      mode values are defined in
1592                                                      :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1593
1594                                                      Used by CP to set up
1595                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1596      17:16   2 bits  FLOAT_DENORM_MODE_32            Wavefront starts execution
1597                                                      with specified denorm mode
1598                                                      for single (32
1599                                                      bit)  floating point
1600                                                      precision floating point
1601                                                      operations.
1602
1603                                                      Floating point denorm mode
1604                                                      values are defined in
1605                                                      :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1606
1607                                                      Used by CP to set up
1608                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1609      19:18   2 bits  FLOAT_DENORM_MODE_16_64         Wavefront starts execution
1610                                                      with specified denorm mode
1611                                                      for half/double (16
1612                                                      and 64 bit) floating point
1613                                                      precision floating point
1614                                                      operations.
1615
1616                                                      Floating point denorm mode
1617                                                      values are defined in
1618                                                      :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1619
1620                                                      Used by CP to set up
1621                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1622      20      1 bit   PRIV                            Must be 0.
1623
1624                                                      Start executing wavefront
1625                                                      in privilege trap handler
1626                                                      mode.
1627
1628                                                      CP is responsible for
1629                                                      filling in
1630                                                      ``COMPUTE_PGM_RSRC1.PRIV``.
1631      21      1 bit   ENABLE_DX10_CLAMP               Wavefront starts execution
1632                                                      with DX10 clamp mode
1633                                                      enabled. Used by the vector
1634                                                      ALU to force DX10 style
1635                                                      treatment of NaN's (when
1636                                                      set, clamp NaN to zero,
1637                                                      otherwise pass NaN
1638                                                      through).
1639
1640                                                      Used by CP to set up
1641                                                      ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
1642      22      1 bit   DEBUG_MODE                      Must be 0.
1643
1644                                                      Start executing wavefront
1645                                                      in single step mode.
1646
1647                                                      CP is responsible for
1648                                                      filling in
1649                                                      ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
1650      23      1 bit   ENABLE_IEEE_MODE                Wavefront starts execution
1651                                                      with IEEE mode
1652                                                      enabled. Floating point
1653                                                      opcodes that support
1654                                                      exception flag gathering
1655                                                      will quiet and propagate
1656                                                      signaling-NaN inputs per
1657                                                      IEEE 754-2008. Min_dx10 and
1658                                                      max_dx10 become IEEE
1659                                                      754-2008 compliant due to
1660                                                      signaling-NaN propagation
1661                                                      and quieting.
1662
1663                                                      Used by CP to set up
1664                                                      ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
1665      24      1 bit   BULKY                           Must be 0.
1666
1667                                                      Only one work-group allowed
1668                                                      to execute on a compute
1669                                                      unit.
1670
1671                                                      CP is responsible for
1672                                                      filling in
1673                                                      ``COMPUTE_PGM_RSRC1.BULKY``.
1674      25      1 bit   CDBG_USER                       Must be 0.
1675
1676                                                      Flag that can be used to
1677                                                      control debugging code.
1678
1679                                                      CP is responsible for
1680                                                      filling in
1681                                                      ``COMPUTE_PGM_RSRC1.CDBG_USER``.
1682      26      1 bit   FP16_OVFL                       GFX6-8
1683                                                        Reserved, must be 0.
1684                                                      GFX9
1685                                                        Wavefront starts execution
1686                                                        with specified fp16 overflow
1687                                                        mode.
1688
1689                                                        - If 0, fp16 overflow generates
1690                                                          +/-INF values.
1691                                                        - If 1, fp16 overflow that is the
1692                                                          result of an +/-INF input value
1693                                                          or divide by 0 produces a +/-INF,
1694                                                          otherwise clamps computed
1695                                                          overflow to +/-MAX_FP16 as
1696                                                          appropriate.
1697
1698                                                        Used by CP to set up
1699                                                        ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
1700      31:27   5 bits                                  Reserved, must be 0.
1701      32      **Total size 4 bytes**
1702      ======= ===================================================================================================================
1703
1704 ..
1705
1706   .. table:: compute_pgm_rsrc2 for GFX6-GFX9
1707      :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
1708
1709      ======= ======= =============================== ===========================================================================
1710      Bits    Size    Field Name                      Description
1711      ======= ======= =============================== ===========================================================================
1712      0       1 bit   ENABLE_SGPR_PRIVATE_SEGMENT     Enable the setup of the
1713                      _WAVE_OFFSET                    SGPR wave scratch offset
1714                                                      system register (see
1715                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1716
1717                                                      Used by CP to set up
1718                                                      ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
1719      5:1     5 bits  USER_SGPR_COUNT                 The total number of SGPR
1720                                                      user data registers
1721                                                      requested. This number must
1722                                                      match the number of user
1723                                                      data registers enabled.
1724
1725                                                      Used by CP to set up
1726                                                      ``COMPUTE_PGM_RSRC2.USER_SGPR``.
1727      6       1 bit   ENABLE_TRAP_HANDLER             Set to 1 if code contains a
1728                                                      TRAP instruction which
1729                                                      requires a trap handler to
1730                                                      be enabled.
1731
1732                                                      CP sets
1733                                                      ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``
1734                                                      if the runtime has
1735                                                      installed a trap handler
1736                                                      regardless of the setting
1737                                                      of this field.
1738      7       1 bit   ENABLE_SGPR_WORKGROUP_ID_X      Enable the setup of the
1739                                                      system SGPR register for
1740                                                      the work-group id in the X
1741                                                      dimension (see
1742                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1743
1744                                                      Used by CP to set up
1745                                                      ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
1746      8       1 bit   ENABLE_SGPR_WORKGROUP_ID_Y      Enable the setup of the
1747                                                      system SGPR register for
1748                                                      the work-group id in the Y
1749                                                      dimension (see
1750                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1751
1752                                                      Used by CP to set up
1753                                                      ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
1754      9       1 bit   ENABLE_SGPR_WORKGROUP_ID_Z      Enable the setup of the
1755                                                      system SGPR register for
1756                                                      the work-group id in the Z
1757                                                      dimension (see
1758                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1759
1760                                                      Used by CP to set up
1761                                                      ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
1762      10      1 bit   ENABLE_SGPR_WORKGROUP_INFO      Enable the setup of the
1763                                                      system SGPR register for
1764                                                      work-group information (see
1765                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1766
1767                                                      Used by CP to set up
1768                                                      ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
1769      12:11   2 bits  ENABLE_VGPR_WORKITEM_ID         Enable the setup of the
1770                                                      VGPR system registers used
1771                                                      for the work-item ID.
1772                                                      :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
1773                                                      defines the values.
1774
1775                                                      Used by CP to set up
1776                                                      ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
1777      13      1 bit   ENABLE_EXCEPTION_ADDRESS_WATCH  Must be 0.
1778
1779                                                      Wavefront starts execution
1780                                                      with address watch
1781                                                      exceptions enabled which
1782                                                      are generated when L1 has
1783                                                      witnessed a thread access
1784                                                      an *address of
1785                                                      interest*.
1786
1787                                                      CP is responsible for
1788                                                      filling in the address
1789                                                      watch bit in
1790                                                      ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1791                                                      according to what the
1792                                                      runtime requests.
1793      14      1 bit   ENABLE_EXCEPTION_MEMORY         Must be 0.
1794
1795                                                      Wavefront starts execution
1796                                                      with memory violation
1797                                                      exceptions exceptions
1798                                                      enabled which are generated
1799                                                      when a memory violation has
1800                                                      occurred for this wave from
1801                                                      L1 or LDS
1802                                                      (write-to-read-only-memory,
1803                                                      mis-aligned atomic, LDS
1804                                                      address out of range,
1805                                                      illegal address, etc.).
1806
1807                                                      CP sets the memory
1808                                                      violation bit in
1809                                                      ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1810                                                      according to what the
1811                                                      runtime requests.
1812      23:15   9 bits  GRANULATED_LDS_SIZE             Must be 0.
1813
1814                                                      CP uses the rounded value
1815                                                      from the dispatch packet,
1816                                                      not this value, as the
1817                                                      dispatch may contain
1818                                                      dynamically allocated group
1819                                                      segment memory. CP writes
1820                                                      directly to
1821                                                      ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
1822
1823                                                      Amount of group segment
1824                                                      (LDS) to allocate for each
1825                                                      work-group. Granularity is
1826                                                      device specific:
1827
1828                                                      GFX6:
1829                                                        roundup(lds-size / (64 * 4))
1830                                                      GFX7-GFX9:
1831                                                        roundup(lds-size / (128 * 4))
1832
1833      24      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    Wavefront starts execution
1834                      _INVALID_OPERATION              with specified exceptions
1835                                                      enabled.
1836
1837                                                      Used by CP to set up
1838                                                      ``COMPUTE_PGM_RSRC2.EXCP_EN``
1839                                                      (set from bits 0..6).
1840
1841                                                      IEEE 754 FP Invalid
1842                                                      Operation
1843      25      1 bit   ENABLE_EXCEPTION_FP_DENORMAL    FP Denormal one or more
1844                      _SOURCE                         input operands is a
1845                                                      denormal number
1846      26      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Division by
1847                      _DIVISION_BY_ZERO               Zero
1848      27      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP FP Overflow
1849                      _OVERFLOW
1850      28      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Underflow
1851                      _UNDERFLOW
1852      29      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Inexact
1853                      _INEXACT
1854      30      1 bit   ENABLE_EXCEPTION_INT_DIVIDE_BY  Integer Division by Zero
1855                      _ZERO                           (rcp_iflag_f32 instruction
1856                                                      only)
1857      31      1 bit                                   Reserved, must be 0.
1858      32      **Total size 4 bytes.**
1859      ======= ===================================================================================================================
1860
1861 ..
1862
1863   .. table:: Floating Point Rounding Mode Enumeration Values
1864      :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
1865
1866      ====================================== ===== ==============================
1867      Enumeration Name                       Value Description
1868      ====================================== ===== ==============================
1869      AMDGPU_FLOAT_ROUND_MODE_NEAR_EVEN      0     Round Ties To Even
1870      AMDGPU_FLOAT_ROUND_MODE_PLUS_INFINITY  1     Round Toward +infinity
1871      AMDGPU_FLOAT_ROUND_MODE_MINUS_INFINITY 2     Round Toward -infinity
1872      AMDGPU_FLOAT_ROUND_MODE_ZERO           3     Round Toward 0
1873      ====================================== ===== ==============================
1874
1875 ..
1876
1877   .. table:: Floating Point Denorm Mode Enumeration Values
1878      :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
1879
1880      ====================================== ===== ==============================
1881      Enumeration Name                       Value Description
1882      ====================================== ===== ==============================
1883      AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC_DST 0     Flush Source and Destination
1884                                                   Denorms
1885      AMDGPU_FLOAT_DENORM_MODE_FLUSH_DST     1     Flush Output Denorms
1886      AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC     2     Flush Source Denorms
1887      AMDGPU_FLOAT_DENORM_MODE_FLUSH_NONE    3     No Flush
1888      ====================================== ===== ==============================
1889
1890 ..
1891
1892   .. table:: System VGPR Work-Item ID Enumeration Values
1893      :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
1894
1895      ======================================== ===== ============================
1896      Enumeration Name                         Value Description
1897      ======================================== ===== ============================
1898      AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X         0     Set work-item X dimension
1899                                                     ID.
1900      AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y       1     Set work-item X and Y
1901                                                     dimensions ID.
1902      AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y_Z     2     Set work-item X, Y and Z
1903                                                     dimensions ID.
1904      AMDGPU_SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3     Undefined.
1905      ======================================== ===== ============================
1906
1907 .. _amdgpu-amdhsa-initial-kernel-execution-state:
1908
1909 Initial Kernel Execution State
1910 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1911
1912 This section defines the register state that will be set up by the packet
1913 processor prior to the start of execution of every wavefront. This is limited by
1914 the constraints of the hardware controllers of CP/ADC/SPI.
1915
1916 The order of the SGPR registers is defined, but the compiler can specify which
1917 ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
1918 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
1919 for enabled registers are dense starting at SGPR0: the first enabled register is
1920 SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
1921 an SGPR number.
1922
1923 The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
1924 all waves of the grid. It is possible to specify more than 16 User SGPRs using
1925 the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
1926 initialized. These are then immediately followed by the System SGPRs that are
1927 set up by ADC/SPI and can have different values for each wave of the grid
1928 dispatch.
1929
1930 SGPR register initial state is defined in
1931 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
1932
1933   .. table:: SGPR Register Set Up Order
1934      :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
1935
1936      ========== ========================== ====== ==============================
1937      SGPR Order Name                       Number Description
1938                 (kernel descriptor enable  of
1939                 field)                     SGPRs
1940      ========== ========================== ====== ==============================
1941      First      Private Segment Buffer     4      V# that can be used, together
1942                 (enable_sgpr_private              with Scratch Wave Offset as an
1943                 _segment_buffer)                  offset, to access the private
1944                                                   memory space using a segment
1945                                                   address.
1946
1947                                                   CP uses the value provided by
1948                                                   the runtime.
1949      then       Dispatch Ptr               2      64 bit address of AQL dispatch
1950                 (enable_sgpr_dispatch_ptr)        packet for kernel dispatch
1951                                                   actually executing.
1952      then       Queue Ptr                  2      64 bit address of amd_queue_t
1953                 (enable_sgpr_queue_ptr)           object for AQL queue on which
1954                                                   the dispatch packet was
1955                                                   queued.
1956      then       Kernarg Segment Ptr        2      64 bit address of Kernarg
1957                 (enable_sgpr_kernarg              segment. This is directly
1958                 _segment_ptr)                     copied from the
1959                                                   kernarg_address in the kernel
1960                                                   dispatch packet.
1961
1962                                                   Having CP load it once avoids
1963                                                   loading it at the beginning of
1964                                                   every wavefront.
1965      then       Dispatch Id                2      64 bit Dispatch ID of the
1966                 (enable_sgpr_dispatch_id)         dispatch packet being
1967                                                   executed.
1968      then       Flat Scratch Init          2      This is 2 SGPRs:
1969                 (enable_sgpr_flat_scratch
1970                 _init)                            GFX6
1971                                                     Not supported.
1972                                                   GFX7-GFX8
1973                                                     The first SGPR is a 32 bit
1974                                                     byte offset from
1975                                                     ``SH_HIDDEN_PRIVATE_BASE_VIMID``
1976                                                     to per SPI base of memory
1977                                                     for scratch for the queue
1978                                                     executing the kernel
1979                                                     dispatch. CP obtains this
1980                                                     from the runtime. (The
1981                                                     Scratch Segment Buffer base
1982                                                     address is
1983                                                     ``SH_HIDDEN_PRIVATE_BASE_VIMID``
1984                                                     plus this offset.) The value
1985                                                     of Scratch Wave Offset must
1986                                                     be added to this offset by
1987                                                     the kernel machine code,
1988                                                     right shifted by 8, and
1989                                                     moved to the FLAT_SCRATCH_HI
1990                                                     SGPR register.
1991                                                     FLAT_SCRATCH_HI corresponds
1992                                                     to SGPRn-4 on GFX7, and
1993                                                     SGPRn-6 on GFX8 (where SGPRn
1994                                                     is the highest numbered SGPR
1995                                                     allocated to the wave).
1996                                                     FLAT_SCRATCH_HI is
1997                                                     multiplied by 256 (as it is
1998                                                     in units of 256 bytes) and
1999                                                     added to
2000                                                     ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2001                                                     to calculate the per wave
2002                                                     FLAT SCRATCH BASE in flat
2003                                                     memory instructions that
2004                                                     access the scratch
2005                                                     apperture.
2006
2007                                                     The second SGPR is 32 bit
2008                                                     byte size of a single
2009                                                     work-item's scratch memory
2010                                                     usage. CP obtains this from
2011                                                     the runtime, and it is
2012                                                     always a multiple of DWORD.
2013                                                     CP checks that the value in
2014                                                     the kernel dispatch packet
2015                                                     Private Segment Byte Size is
2016                                                     not larger, and requests the
2017                                                     runtime to increase the
2018                                                     queue's scratch size if
2019                                                     necessary. The kernel code
2020                                                     must move it to
2021                                                     FLAT_SCRATCH_LO which is
2022                                                     SGPRn-3 on GFX7 and SGPRn-5
2023                                                     on GFX8. FLAT_SCRATCH_LO is
2024                                                     used as the FLAT SCRATCH
2025                                                     SIZE in flat memory
2026                                                     instructions. Having CP load
2027                                                     it once avoids loading it at
2028                                                     the beginning of every
2029                                                     wavefront. GFX9 This is the
2030                                                     64 bit base address of the
2031                                                     per SPI scratch backing
2032                                                     memory managed by SPI for
2033                                                     the queue executing the
2034                                                     kernel dispatch. CP obtains
2035                                                     this from the runtime (and
2036                                                     divides it if there are
2037                                                     multiple Shader Arrays each
2038                                                     with its own SPI). The value
2039                                                     of Scratch Wave Offset must
2040                                                     be added by the kernel
2041                                                     machine code and the result
2042                                                     moved to the FLAT_SCRATCH
2043                                                     SGPR which is SGPRn-6 and
2044                                                     SGPRn-5. It is used as the
2045                                                     FLAT SCRATCH BASE in flat
2046                                                     memory instructions. then
2047                                                     Private Segment Size 1 The
2048                                                     32 bit byte size of a
2049                                                     (enable_sgpr_private single
2050                                                     work-item's
2051                                                     scratch_segment_size) memory
2052                                                     allocation. This is the
2053                                                     value from the kernel
2054                                                     dispatch packet Private
2055                                                     Segment Byte Size rounded up
2056                                                     by CP to a multiple of
2057                                                     DWORD.
2058
2059                                                   Having CP load it once avoids
2060                                                   loading it at the beginning of
2061                                                   every wavefront.
2062
2063                                                   This is not used for
2064                                                   GFX7-GFX8 since it is the same
2065                                                   value as the second SGPR of
2066                                                   Flat Scratch Init. However, it
2067                                                   may be needed for GFX9 which
2068                                                   changes the meaning of the
2069                                                   Flat Scratch Init value.
2070      then       Grid Work-Group Count X    1      32 bit count of the number of
2071                 (enable_sgpr_grid                 work-groups in the X dimension
2072                 _workgroup_count_X)               for the grid being
2073                                                   executed. Computed from the
2074                                                   fields in the kernel dispatch
2075                                                   packet as ((grid_size.x +
2076                                                   workgroup_size.x - 1) /
2077                                                   workgroup_size.x).
2078      then       Grid Work-Group Count Y    1      32 bit count of the number of
2079                 (enable_sgpr_grid                 work-groups in the Y dimension
2080                 _workgroup_count_Y &&             for the grid being
2081                 less than 16 previous             executed. Computed from the
2082                 SGPRs)                            fields in the kernel dispatch
2083                                                   packet as ((grid_size.y +
2084                                                   workgroup_size.y - 1) /
2085                                                   workgroupSize.y).
2086
2087                                                   Only initialized if <16
2088                                                   previous SGPRs initialized.
2089      then       Grid Work-Group Count Z    1      32 bit count of the number of
2090                 (enable_sgpr_grid                 work-groups in the Z dimension
2091                 _workgroup_count_Z &&             for the grid being
2092                 less than 16 previous             executed. Computed from the
2093                 SGPRs)                            fields in the kernel dispatch
2094                                                   packet as ((grid_size.z +
2095                                                   workgroup_size.z - 1) /
2096                                                   workgroupSize.z).
2097
2098                                                   Only initialized if <16
2099                                                   previous SGPRs initialized.
2100      then       Work-Group Id X            1      32 bit work-group id in X
2101                 (enable_sgpr_workgroup_id         dimension of grid for
2102                 _X)                               wavefront.
2103      then       Work-Group Id Y            1      32 bit work-group id in Y
2104                 (enable_sgpr_workgroup_id         dimension of grid for
2105                 _Y)                               wavefront.
2106      then       Work-Group Id Z            1      32 bit work-group id in Z
2107                 (enable_sgpr_workgroup_id         dimension of grid for
2108                 _Z)                               wavefront.
2109      then       Work-Group Info            1      {first_wave, 14'b0000,
2110                 (enable_sgpr_workgroup            ordered_append_term[10:0],
2111                 _info)                            threadgroup_size_in_waves[5:0]}
2112      then       Scratch Wave Offset        1      32 bit byte offset from base
2113                 (enable_sgpr_private              of scratch base of queue
2114                 _segment_wave_offset)             executing the kernel
2115                                                   dispatch. Must be used as an
2116                                                   offset with Private
2117                                                   segment address when using
2118                                                   Scratch Segment Buffer. It
2119                                                   must be used to set up FLAT
2120                                                   SCRATCH for flat addressing
2121                                                   (see
2122                                                   :ref:`amdgpu-amdhsa-flat-scratch`).
2123      ========== ========================== ====== ==============================
2124
2125 The order of the VGPR registers is defined, but the compiler can specify which
2126 ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
2127 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2128 for enabled registers are dense starting at VGPR0: the first enabled register is
2129 VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
2130 VGPR number.
2131
2132 VGPR register initial state is defined in
2133 :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
2134
2135   .. table:: VGPR Register Set Up Order
2136      :name: amdgpu-amdhsa-vgpr-register-set-up-order-table
2137
2138      ========== ========================== ====== ==============================
2139      VGPR Order Name                       Number Description
2140                 (kernel descriptor enable  of
2141                 field)                     VGPRs
2142      ========== ========================== ====== ==============================
2143      First      Work-Item Id X             1      32 bit work item id in X
2144                 (Always initialized)              dimension of work-group for
2145                                                   wavefront lane.
2146      then       Work-Item Id Y             1      32 bit work item id in Y
2147                 (enable_vgpr_workitem_id          dimension of work-group for
2148                 > 0)                              wavefront lane.
2149      then       Work-Item Id Z             1      32 bit work item id in Z
2150                 (enable_vgpr_workitem_id          dimension of work-group for
2151                 > 1)                              wavefront lane.
2152      ========== ========================== ====== ==============================
2153
2154 The setting of registers is is done by GPU CP/ADC/SPI hardware as follows:
2155
2156 1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
2157    registers.
2158 2. Work-group Id registers X, Y, Z are set by ADC which supports any
2159    combination including none.
2160 3. Scratch Wave Offset is set by SPI in a per wave basis which is why its value
2161    cannot included with the flat scratch init value which is per queue.
2162 4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
2163    or (X, Y, Z).
2164
2165 Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
2166 value to the hardware required SGPRn-3 and SGPRn-4 respectively.
2167
2168 The global segment can be accessed either using buffer instructions (GFX6 which
2169 has V# 64 bit address support), flat instructions (GFX7-9), or global
2170 instructions (GFX9).
2171
2172 If buffer operations are used then the compiler can generate a V# with the
2173 following properties:
2174
2175 * base address of 0
2176 * no swizzle
2177 * ATC: 1 if IOMMU present (such as APU)
2178 * ptr64: 1
2179 * MTYPE set to support memory coherence that matches the runtime (such as CC for
2180   APU and NC for dGPU).
2181
2182 .. _amdgpu-amdhsa-kernel-prolog:
2183
2184 Kernel Prolog
2185 ~~~~~~~~~~~~~
2186
2187 .. _amdgpu-amdhsa-m0:
2188
2189 M0
2190 ++
2191
2192 GFX6-GFX8
2193   The M0 register must be initialized with a value at least the total LDS size
2194   if the kernel may access LDS via DS or flat operations. Total LDS size is
2195   available in dispatch packet. For M0, it is also possible to use maximum
2196   possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
2197   GFX7-GFX8).
2198 GFX9
2199   The M0 register is not used for range checking LDS accesses and so does not
2200   need to be initialized in the prolog.
2201
2202 .. _amdgpu-amdhsa-flat-scratch:
2203
2204 Flat Scratch
2205 ++++++++++++
2206
2207 If the kernel may use flat operations to access scratch memory, the prolog code
2208 must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
2209 are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wave
2210 Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
2211
2212 GFX6
2213   Flat scratch is not supported.
2214
2215 GFX7-8
2216   1. The low word of Flat Scratch Init is 32 bit byte offset from
2217      ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
2218      being managed by SPI for the queue executing the kernel dispatch. This is
2219      the same value used in the Scratch Segment Buffer V# base address. The
2220      prolog must add the value of Scratch Wave Offset to get the wave's byte
2221      scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
2222      FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
2223      by 8 before moving into FLAT_SCRATCH_LO.
2224   2. The second word of Flat Scratch Init is 32 bit byte size of a single
2225      work-items scratch memory usage. This is directly loaded from the kernel
2226      dispatch packet Private Segment Byte Size and rounded up to a multiple of
2227      DWORD. Having CP load it once avoids loading it at the beginning of every
2228      wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
2229      SIZE.
2230 GFX9
2231   The Flat Scratch Init is the 64 bit address of the base of scratch backing
2232   memory being managed by SPI for the queue executing the kernel dispatch. The
2233   prolog must add the value of Scratch Wave Offset and moved to the FLAT_SCRATCH
2234   pair for use as the flat scratch base in flat memory instructions.
2235
2236 .. _amdgpu-amdhsa-memory-model:
2237
2238 Memory Model
2239 ~~~~~~~~~~~~
2240
2241 This section describes the mapping of LLVM memory model onto AMDGPU machine code
2242 (see :ref:`memmodel`). *The implementation is WIP.*
2243
2244 .. TODO
2245    Update when implementation complete.
2246
2247 The AMDGPU backend supports the memory synchronization scopes specified in
2248 :ref:`amdgpu-memory-scopes`.
2249
2250 The code sequences used to implement the memory model are defined in table
2251 :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
2252
2253 The sequences specify the order of instructions that a single thread must
2254 execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
2255 to other memory instructions executed by the same thread. This allows them to be
2256 moved earlier or later which can allow them to be combined with other instances
2257 of the same instruction, or hoisted/sunk out of loops to improve
2258 performance. Only the instructions related to the memory model are given;
2259 additional ``s_waitcnt`` instructions are required to ensure registers are
2260 defined before being used. These may be able to be combined with the memory
2261 model ``s_waitcnt`` instructions as described above.
2262
2263 The AMDGPU backend supports the following memory models:
2264
2265   HSA Memory Model [HSA]_
2266     The HSA memory model uses a single happens-before relation for all address
2267     spaces (see :ref:`amdgpu-address-spaces`).
2268   OpenCL Memory Model [OpenCL]_
2269     The OpenCL memory model which has separate happens-before relations for the
2270     global and local address spaces. Only a fence specifying both global and
2271     local address space, and seq_cst instructions join the relationships. Since
2272     the LLVM ``memfence`` instruction does not allow an address space to be
2273     specified the OpenCL fence has to convervatively assume both local and
2274     global address space was specified. However, optimizations can often be
2275     done to eliminate the additional ``s_waitcnt`` instructions when there are
2276     no intervening memory instructions which access the corresponding address
2277     space. The code sequences in the table indicate what can be omitted for the
2278     OpenCL memory. The target triple environment is used to determine if the
2279     source language is OpenCL (see :ref:`amdgpu-opencl`).
2280
2281 ``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
2282 operations.
2283
2284 ``buffer/global/flat_load/store/atomic`` instructions to global memory are
2285 termed vector memory operations.
2286
2287 For GFX6-GFX9:
2288
2289 * Each agent has multiple compute units (CU).
2290 * Each CU has multiple SIMDs that execute wavefronts.
2291 * The wavefronts for a single work-group are executed in the same CU but may be
2292   executed by different SIMDs.
2293 * Each CU has a single LDS memory shared by the wavefronts of the work-groups
2294   executing on it.
2295 * All LDS operations of a CU are performed as wavefront wide operations in a
2296   global order and involve no caching. Completion is reported to a wavefront in
2297   execution order.
2298 * The LDS memory has multiple request queues shared by the SIMDs of a
2299   CU. Therefore, the LDS operations performed by different waves of a work-group
2300   can be reordered relative to each other, which can result in reordering the
2301   visibility of vector memory operations with respect to LDS operations of other
2302   wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
2303   ensure synchronization between LDS operations and vector memory operations
2304   between waves of a work-group, but not between operations performed by the
2305   same wavefront.
2306 * The vector memory operations are performed as wavefront wide operations and
2307   completion is reported to a wavefront in execution order. The exception is
2308   that for GFX7-9 ``flat_load/store/atomic`` instructions can report out of
2309   vector memory order if they access LDS memory, and out of LDS operation order
2310   if they access global memory.
2311 * The vector memory operations access a single vector L1 cache shared by all
2312   SIMDs a CU. Therefore, no special action is required for coherence between the
2313   lanes of a single wavefront, or for coherence between wavefronts in the same
2314   work-group. A ``buffer_wbinvl1_vol`` is required for coherence between waves
2315   executing in different work-groups as they may be executing on different CUs.
2316 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
2317   on a group of CUs. The scalar and vector L1 caches are not coherent. However,
2318   scalar operations are used in a restricted way so do not impact the memory
2319   model. See :ref:`amdgpu-amdhsa-memory-spaces`.
2320 * The vector and scalar memory operations use an L2 cache shared by all CUs on
2321   the same agent.
2322 * The L2 cache has independent channels to service disjoint ranges of virtual
2323   addresses.
2324 * Each CU has a separate request queue per channel. Therefore, the vector and
2325   scalar memory operations performed by waves executing in different work-groups
2326   (which may be executing on different CUs) of an agent can be reordered
2327   relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
2328   synchronization between vector memory operations of different CUs. It ensures a
2329   previous vector memory operation has completed before executing a subsequent
2330   vector memory or LDS operation and so can be used to meet the requirements of
2331   acquire and release.
2332 * The L2 cache can be kept coherent with other agents on some targets, or ranges
2333   of virtual addresses can be set up to bypass it to ensure system coherence.
2334
2335 Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-8),
2336 or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
2337 memory, atomic memory orderings are not meaningful and all accesses are treated
2338 as non-atomic.
2339
2340 Constant address space uses ``buffer/global_load`` instructions (or equivalent
2341 scalar memory instructions). Since the constant address space contents do not
2342 change during the execution of a kernel dispatch it is not legal to perform
2343 stores, and atomic memory orderings are not meaningful and all access are
2344 treated as non-atomic.
2345
2346 A memory synchronization scope wider than work-group is not meaningful for the
2347 group (LDS) address space and is treated as work-group.
2348
2349 The memory model does not support the region address space which is treated as
2350 non-atomic.
2351
2352 Acquire memory ordering is not meaningful on store atomic instructions and is
2353 treated as non-atomic.
2354
2355 Release memory ordering is not meaningful on load atomic instructions and is
2356 treated a non-atomic.
2357
2358 Acquire-release memory ordering is not meaningful on load or store atomic
2359 instructions and is treated as acquire and release respectively.
2360
2361 AMDGPU backend only uses scalar memory operations to access memory that is
2362 proven to not change during the execution of the kernel dispatch. This includes
2363 constant address space and global address space for program scope const
2364 variables. Therefore the kernel machine code does not have to maintain the
2365 scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
2366 and vector L1 caches are invalidated between kernel dispatches by CP since
2367 constant address space data may change between kernel dispatch executions. See
2368 :ref:`amdgpu-amdhsa-memory-spaces`.
2369
2370 The one execption is if scalar writes are used to spill SGPR registers. In this
2371 case the AMDGPU backend ensures the memory location used to spill is never
2372 accessed by vector memory operations at the same time. If scalar writes are used
2373 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
2374 return since the locations may be used for vector memory instructions by a
2375 future wave that uses the same scratch area, or a function call that creates a
2376 frame at the same address, respectively. There is no need for a ``s_dcache_inv``
2377 as all scalar writes are write-before-read in the same thread.
2378
2379 Scratch backing memory (which is used for the private address space)
2380 is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private
2381 address space is only accessed by a single thread, and is always
2382 write-before-read, there is never a need to invalidate these entries from the L1
2383 cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the
2384 volatile cache lines.
2385
2386 On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
2387 to invalidate the L2 cache. This also causes it to be treated as
2388 non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC
2389 (cache coherent) and so the L2 cache will coherent with the CPU and other
2390 agents.
2391
2392   .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
2393      :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
2394
2395      ============ ============ ============== ========== ===============================
2396      LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code
2397                   Ordering     Sync Scope     Address
2398                                               Space
2399      ============ ============ ============== ========== ===============================
2400      **Non-Atomic**
2401      -----------------------------------------------------------------------------------
2402      load         *none*       *none*         - global   - !volatile & !nontemporal
2403                                               - generic
2404                                               - private    1. buffer/global/flat_load
2405                                               - constant
2406                                                          - volatile & !nontemporal
2407
2408                                                            1. buffer/global/flat_load
2409                                                               glc=1
2410
2411                                                          - nontemporal
2412
2413                                                            1. buffer/global/flat_load
2414                                                               glc=1 slc=1
2415
2416      load         *none*       *none*         - local    1. ds_load
2417      store        *none*       *none*         - global   - !nontemporal
2418                                               - generic
2419                                               - private    1. buffer/global/flat_store
2420                                               - constant
2421                                                          - nontemporal
2422
2423                                                            1. buffer/global/flat_stote
2424                                                               glc=1 slc=1
2425
2426      store        *none*       *none*         - local    1. ds_store
2427      **Unordered Atomic**
2428      -----------------------------------------------------------------------------------
2429      load atomic  unordered    *any*          *any*      *Same as non-atomic*.
2430      store atomic unordered    *any*          *any*      *Same as non-atomic*.
2431      atomicrmw    unordered    *any*          *any*      *Same as monotonic
2432                                                          atomic*.
2433      **Monotonic Atomic**
2434      -----------------------------------------------------------------------------------
2435      load atomic  monotonic    - singlethread - global   1. buffer/global/flat_load
2436                                - wavefront    - generic
2437                                - workgroup
2438      load atomic  monotonic    - singlethread - local    1. ds_load
2439                                - wavefront
2440                                - workgroup
2441      load atomic  monotonic    - agent        - global   1. buffer/global/flat_load
2442                                - system       - generic     glc=1
2443      store atomic monotonic    - singlethread - global   1. buffer/global/flat_store
2444                                - wavefront    - generic
2445                                - workgroup
2446                                - agent
2447                                - system
2448      store atomic monotonic    - singlethread - local    1. ds_store
2449                                - wavefront
2450                                - workgroup
2451      atomicrmw    monotonic    - singlethread - global   1. buffer/global/flat_atomic
2452                                - wavefront    - generic
2453                                - workgroup
2454                                - agent
2455                                - system
2456      atomicrmw    monotonic    - singlethread - local    1. ds_atomic
2457                                - wavefront
2458                                - workgroup
2459      **Acquire Atomic**
2460      -----------------------------------------------------------------------------------
2461      load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load
2462                                - wavefront    - local
2463                                               - generic
2464      load atomic  acquire      - workgroup    - global   1. buffer/global/flat_load
2465      load atomic  acquire      - workgroup    - local    1. ds_load
2466                                                          2. s_waitcnt lgkmcnt(0)
2467
2468                                                            - If OpenCL, omit.
2469                                                            - Must happen before
2470                                                              any following
2471                                                              global/generic
2472                                                              load/load
2473                                                              atomic/store/store
2474                                                              atomic/atomicrmw.
2475                                                            - Ensures any
2476                                                              following global
2477                                                              data read is no
2478                                                              older than the load
2479                                                              atomic value being
2480                                                              acquired.
2481      load atomic  acquire      - workgroup    - generic  1. flat_load
2482                                                          2. s_waitcnt lgkmcnt(0)
2483
2484                                                            - If OpenCL, omit.
2485                                                            - Must happen before
2486                                                              any following
2487                                                              global/generic
2488                                                              load/load
2489                                                              atomic/store/store
2490                                                              atomic/atomicrmw.
2491                                                            - Ensures any
2492                                                              following global
2493                                                              data read is no
2494                                                              older than the load
2495                                                              atomic value being
2496                                                              acquired.
2497      load atomic  acquire      - agent        - global   1. buffer/global/flat_load
2498                                - system                     glc=1
2499                                                          2. s_waitcnt vmcnt(0)
2500
2501                                                            - Must happen before
2502                                                              following
2503                                                              buffer_wbinvl1_vol.
2504                                                            - Ensures the load
2505                                                              has completed
2506                                                              before invalidating
2507                                                              the cache.
2508
2509                                                          3. buffer_wbinvl1_vol
2510
2511                                                            - Must happen before
2512                                                              any following
2513                                                              global/generic
2514                                                              load/load
2515                                                              atomic/atomicrmw.
2516                                                            - Ensures that
2517                                                              following
2518                                                              loads will not see
2519                                                              stale global data.
2520
2521      load atomic  acquire      - agent        - generic  1. flat_load glc=1
2522                                - system                  2. s_waitcnt vmcnt(0) &
2523                                                             lgkmcnt(0)
2524
2525                                                            - If OpenCL omit
2526                                                              lgkmcnt(0).
2527                                                            - Must happen before
2528                                                              following
2529                                                              buffer_wbinvl1_vol.
2530                                                            - Ensures the flat_load
2531                                                              has completed
2532                                                              before invalidating
2533                                                              the cache.
2534
2535                                                          3. buffer_wbinvl1_vol
2536
2537                                                            - Must happen before
2538                                                              any following
2539                                                              global/generic
2540                                                              load/load
2541                                                              atomic/atomicrmw.
2542                                                            - Ensures that
2543                                                              following loads
2544                                                              will not see stale
2545                                                              global data.
2546
2547      atomicrmw    acquire      - singlethread - global   1. buffer/global/ds/flat_atomic
2548                                - wavefront    - local
2549                                               - generic
2550      atomicrmw    acquire      - workgroup    - global   1. buffer/global/flat_atomic
2551      atomicrmw    acquire      - workgroup    - local    1. ds_atomic
2552                                                          2. waitcnt lgkmcnt(0)
2553
2554                                                            - If OpenCL, omit.
2555                                                            - Must happen before
2556                                                              any following
2557                                                              global/generic
2558                                                              load/load
2559                                                              atomic/store/store
2560                                                              atomic/atomicrmw.
2561                                                            - Ensures any
2562                                                              following global
2563                                                              data read is no
2564                                                              older than the
2565                                                              atomicrmw value
2566                                                              being acquired.
2567
2568      atomicrmw    acquire      - workgroup    - generic  1. flat_atomic
2569                                                          2. waitcnt lgkmcnt(0)
2570
2571                                                            - If OpenCL, omit.
2572                                                            - Must happen before
2573                                                              any following
2574                                                              global/generic
2575                                                              load/load
2576                                                              atomic/store/store
2577                                                              atomic/atomicrmw.
2578                                                            - Ensures any
2579                                                              following global
2580                                                              data read is no
2581                                                              older than the
2582                                                              atomicrmw value
2583                                                              being acquired.
2584
2585      atomicrmw    acquire      - agent        - global   1. buffer/global/flat_atomic
2586                                - system                  2. s_waitcnt vmcnt(0)
2587
2588                                                            - Must happen before
2589                                                              following
2590                                                              buffer_wbinvl1_vol.
2591                                                            - Ensures the
2592                                                              atomicrmw has
2593                                                              completed before
2594                                                              invalidating the
2595                                                              cache.
2596
2597                                                          3. buffer_wbinvl1_vol
2598
2599                                                            - Must happen before
2600                                                              any following
2601                                                              global/generic
2602                                                              load/load
2603                                                              atomic/atomicrmw.
2604                                                            - Ensures that
2605                                                              following loads
2606                                                              will not see stale
2607                                                              global data.
2608
2609      atomicrmw    acquire      - agent        - generic  1. flat_atomic
2610                                - system                  2. s_waitcnt vmcnt(0) &
2611                                                             lgkmcnt(0)
2612
2613                                                            - If OpenCL, omit
2614                                                              lgkmcnt(0).
2615                                                            - Must happen before
2616                                                              following
2617                                                              buffer_wbinvl1_vol.
2618                                                            - Ensures the
2619                                                              atomicrmw has
2620                                                              completed before
2621                                                              invalidating the
2622                                                              cache.
2623
2624                                                          3. buffer_wbinvl1_vol
2625
2626                                                            - Must happen before
2627                                                              any following
2628                                                              global/generic
2629                                                              load/load
2630                                                              atomic/atomicrmw.
2631                                                            - Ensures that
2632                                                              following loads
2633                                                              will not see stale
2634                                                              global data.
2635
2636      fence        acquire      - singlethread *none*     *none*
2637                                - wavefront
2638      fence        acquire      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
2639
2640                                                            - If OpenCL and
2641                                                              address space is
2642                                                              not generic, omit.
2643                                                            - However, since LLVM
2644                                                              currently has no
2645                                                              address space on
2646                                                              the fence need to
2647                                                              conservatively
2648                                                              always generate. If
2649                                                              fence had an
2650                                                              address space then
2651                                                              set to address
2652                                                              space of OpenCL
2653                                                              fence flag, or to
2654                                                              generic if both
2655                                                              local and global
2656                                                              flags are
2657                                                              specified.
2658                                                            - Must happen after
2659                                                              any preceding
2660                                                              local/generic load
2661                                                              atomic/atomicrmw
2662                                                              with an equal or
2663                                                              wider sync scope
2664                                                              and memory ordering
2665                                                              stronger than
2666                                                              unordered (this is
2667                                                              termed the
2668                                                              fence-paired-atomic).
2669                                                            - Must happen before
2670                                                              any following
2671                                                              global/generic
2672                                                              load/load
2673                                                              atomic/store/store
2674                                                              atomic/atomicrmw.
2675                                                            - Ensures any
2676                                                              following global
2677                                                              data read is no
2678                                                              older than the
2679                                                              value read by the
2680                                                              fence-paired-atomic.
2681
2682      fence        acquire      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
2683                                - system                     vmcnt(0)
2684
2685                                                            - If OpenCL and
2686                                                              address space is
2687                                                              not generic, omit
2688                                                              lgkmcnt(0).
2689                                                            - However, since LLVM
2690                                                              currently has no
2691                                                              address space on
2692                                                              the fence need to
2693                                                              conservatively
2694                                                              always generate
2695                                                              (see comment for
2696                                                              previous fence).
2697                                                            - Could be split into
2698                                                              separate s_waitcnt
2699                                                              vmcnt(0) and
2700                                                              s_waitcnt
2701                                                              lgkmcnt(0) to allow
2702                                                              them to be
2703                                                              independently moved
2704                                                              according to the
2705                                                              following rules.
2706                                                            - s_waitcnt vmcnt(0)
2707                                                              must happen after
2708                                                              any preceding
2709                                                              global/generic load
2710                                                              atomic/atomicrmw
2711                                                              with an equal or
2712                                                              wider sync scope
2713                                                              and memory ordering
2714                                                              stronger than
2715                                                              unordered (this is
2716                                                              termed the
2717                                                              fence-paired-atomic).
2718                                                            - s_waitcnt lgkmcnt(0)
2719                                                              must happen after
2720                                                              any preceding
2721                                                              local/generic load
2722                                                              atomic/atomicrmw
2723                                                              with an equal or
2724                                                              wider sync scope
2725                                                              and memory ordering
2726                                                              stronger than
2727                                                              unordered (this is
2728                                                              termed the
2729                                                              fence-paired-atomic).
2730                                                            - Must happen before
2731                                                              the following
2732                                                              buffer_wbinvl1_vol.
2733                                                            - Ensures that the
2734                                                              fence-paired atomic
2735                                                              has completed
2736                                                              before invalidating
2737                                                              the
2738                                                              cache. Therefore
2739                                                              any following
2740                                                              locations read must
2741                                                              be no older than
2742                                                              the value read by
2743                                                              the
2744                                                              fence-paired-atomic.
2745
2746                                                          2. buffer_wbinvl1_vol
2747
2748                                                            - Must happen before any
2749                                                              following global/generic
2750                                                              load/load
2751                                                              atomic/store/store
2752                                                              atomic/atomicrmw.
2753                                                            - Ensures that
2754                                                              following loads
2755                                                              will not see stale
2756                                                              global data.
2757
2758      **Release Atomic**
2759      -----------------------------------------------------------------------------------
2760      store atomic release      - singlethread - global   1. buffer/global/ds/flat_store
2761                                - wavefront    - local
2762                                               - generic
2763      store atomic release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
2764
2765                                                            - If OpenCL, omit.
2766                                                            - Must happen after
2767                                                              any preceding
2768                                                              local/generic
2769                                                              load/store/load
2770                                                              atomic/store
2771                                                              atomic/atomicrmw.
2772                                                            - Must happen before
2773                                                              the following
2774                                                              store.
2775                                                            - Ensures that all
2776                                                              memory operations
2777                                                              to local have
2778                                                              completed before
2779                                                              performing the
2780                                                              store that is being
2781                                                              released.
2782
2783                                                          2. buffer/global/flat_store
2784      store atomic release      - workgroup    - local    1. ds_store
2785      store atomic release      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)
2786
2787                                                            - If OpenCL, omit.
2788                                                            - Must happen after
2789                                                              any preceding
2790                                                              local/generic
2791                                                              load/store/load
2792                                                              atomic/store
2793                                                              atomic/atomicrmw.
2794                                                            - Must happen before
2795                                                              the following
2796                                                              store.
2797                                                            - Ensures that all
2798                                                              memory operations
2799                                                              to local have
2800                                                              completed before
2801                                                              performing the
2802                                                              store that is being
2803                                                              released.
2804
2805                                                          2. flat_store
2806      store atomic release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
2807                                - system       - generic     vmcnt(0)
2808
2809                                                            - If OpenCL, omit
2810                                                              lgkmcnt(0).
2811                                                            - Could be split into
2812                                                              separate s_waitcnt
2813                                                              vmcnt(0) and
2814                                                              s_waitcnt
2815                                                              lgkmcnt(0) to allow
2816                                                              them to be
2817                                                              independently moved
2818                                                              according to the
2819                                                              following rules.
2820                                                            - s_waitcnt vmcnt(0)
2821                                                              must happen after
2822                                                              any preceding
2823                                                              global/generic
2824                                                              load/store/load
2825                                                              atomic/store
2826                                                              atomic/atomicrmw.
2827                                                            - s_waitcnt lgkmcnt(0)
2828                                                              must happen after
2829                                                              any preceding
2830                                                              local/generic
2831                                                              load/store/load
2832                                                              atomic/store
2833                                                              atomic/atomicrmw.
2834                                                            - Must happen before
2835                                                              the following
2836                                                              store.
2837                                                            - Ensures that all
2838                                                              memory operations
2839                                                              to memory have
2840                                                              completed before
2841                                                              performing the
2842                                                              store that is being
2843                                                              released.
2844
2845                                                          2. buffer/global/ds/flat_store
2846      atomicrmw    release      - singlethread - global   1. buffer/global/ds/flat_atomic
2847                                - wavefront    - local
2848                                               - generic
2849      atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
2850
2851                                                            - If OpenCL, omit.
2852                                                            - Must happen after
2853                                                              any preceding
2854                                                              local/generic
2855                                                              load/store/load
2856                                                              atomic/store
2857                                                              atomic/atomicrmw.
2858                                                            - Must happen before
2859                                                              the following
2860                                                              atomicrmw.
2861                                                            - Ensures that all
2862                                                              memory operations
2863                                                              to local have
2864                                                              completed before
2865                                                              performing the
2866                                                              atomicrmw that is
2867                                                              being released.
2868
2869                                                          2. buffer/global/flat_atomic
2870      atomicrmw    release      - workgroup    - local    1. ds_atomic
2871      atomicrmw    release      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)
2872
2873                                                            - If OpenCL, omit.
2874                                                            - Must happen after
2875                                                              any preceding
2876                                                              local/generic
2877                                                              load/store/load
2878                                                              atomic/store
2879                                                              atomic/atomicrmw.
2880                                                            - Must happen before
2881                                                              the following
2882                                                              atomicrmw.
2883                                                            - Ensures that all
2884                                                              memory operations
2885                                                              to local have
2886                                                              completed before
2887                                                              performing the
2888                                                              atomicrmw that is
2889                                                              being released.
2890
2891                                                          2. flat_atomic
2892      atomicrmw    release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
2893                                - system       - generic     vmcnt(0)
2894
2895                                                            - If OpenCL, omit
2896                                                              lgkmcnt(0).
2897                                                            - Could be split into
2898                                                              separate s_waitcnt
2899                                                              vmcnt(0) and
2900                                                              s_waitcnt
2901                                                              lgkmcnt(0) to allow
2902                                                              them to be
2903                                                              independently moved
2904                                                              according to the
2905                                                              following rules.
2906                                                            - s_waitcnt vmcnt(0)
2907                                                              must happen after
2908                                                              any preceding
2909                                                              global/generic
2910                                                              load/store/load
2911                                                              atomic/store
2912                                                              atomic/atomicrmw.
2913                                                            - s_waitcnt lgkmcnt(0)
2914                                                              must happen after
2915                                                              any preceding
2916                                                              local/generic
2917                                                              load/store/load
2918                                                              atomic/store
2919                                                              atomic/atomicrmw.
2920                                                            - Must happen before
2921                                                              the following
2922                                                              atomicrmw.
2923                                                            - Ensures that all
2924                                                              memory operations
2925                                                              to global and local
2926                                                              have completed
2927                                                              before performing
2928                                                              the atomicrmw that
2929                                                              is being released.
2930
2931                                                          2. buffer/global/ds/flat_atomic
2932      fence        release      - singlethread *none*     *none*
2933                                - wavefront
2934      fence        release      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
2935
2936                                                            - If OpenCL and
2937                                                              address space is
2938                                                              not generic, omit.
2939                                                            - However, since LLVM
2940                                                              currently has no
2941                                                              address space on
2942                                                              the fence need to
2943                                                              conservatively
2944                                                              always generate. If
2945                                                              fence had an
2946                                                              address space then
2947                                                              set to address
2948                                                              space of OpenCL
2949                                                              fence flag, or to
2950                                                              generic if both
2951                                                              local and global
2952                                                              flags are
2953                                                              specified.
2954                                                            - Must happen after
2955                                                              any preceding
2956                                                              local/generic
2957                                                              load/load
2958                                                              atomic/store/store
2959                                                              atomic/atomicrmw.
2960                                                            - Must happen before
2961                                                              any following store
2962                                                              atomic/atomicrmw
2963                                                              with an equal or
2964                                                              wider sync scope
2965                                                              and memory ordering
2966                                                              stronger than
2967                                                              unordered (this is
2968                                                              termed the
2969                                                              fence-paired-atomic).
2970                                                            - Ensures that all
2971                                                              memory operations
2972                                                              to local have
2973                                                              completed before
2974                                                              performing the
2975                                                              following
2976                                                              fence-paired-atomic.
2977
2978      fence        release      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
2979                                - system                     vmcnt(0)
2980
2981                                                            - If OpenCL and
2982                                                              address space is
2983                                                              not generic, omit
2984                                                              lgkmcnt(0).
2985                                                            - If OpenCL and
2986                                                              address space is
2987                                                              local, omit
2988                                                              vmcnt(0).
2989                                                            - However, since LLVM
2990                                                              currently has no
2991                                                              address space on
2992                                                              the fence need to
2993                                                              conservatively
2994                                                              always generate. If
2995                                                              fence had an
2996                                                              address space then
2997                                                              set to address
2998                                                              space of OpenCL
2999                                                              fence flag, or to
3000                                                              generic if both
3001                                                              local and global
3002                                                              flags are
3003                                                              specified.
3004                                                            - Could be split into
3005                                                              separate s_waitcnt
3006                                                              vmcnt(0) and
3007                                                              s_waitcnt
3008                                                              lgkmcnt(0) to allow
3009                                                              them to be
3010                                                              independently moved
3011                                                              according to the
3012                                                              following rules.
3013                                                            - s_waitcnt vmcnt(0)
3014                                                              must happen after
3015                                                              any preceding
3016                                                              global/generic
3017                                                              load/store/load
3018                                                              atomic/store
3019                                                              atomic/atomicrmw.
3020                                                            - s_waitcnt lgkmcnt(0)
3021                                                              must happen after
3022                                                              any preceding
3023                                                              local/generic
3024                                                              load/store/load
3025                                                              atomic/store
3026                                                              atomic/atomicrmw.
3027                                                            - Must happen before
3028                                                              any following store
3029                                                              atomic/atomicrmw
3030                                                              with an equal or
3031                                                              wider sync scope
3032                                                              and memory ordering
3033                                                              stronger than
3034                                                              unordered (this is
3035                                                              termed the
3036                                                              fence-paired-atomic).
3037                                                            - Ensures that all
3038                                                              memory operations
3039                                                              have
3040                                                              completed before
3041                                                              performing the
3042                                                              following
3043                                                              fence-paired-atomic.
3044
3045      **Acquire-Release Atomic**
3046      -----------------------------------------------------------------------------------
3047      atomicrmw    acq_rel      - singlethread - global   1. buffer/global/ds/flat_atomic
3048                                - wavefront    - local
3049                                               - generic
3050      atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
3051
3052                                                            - If OpenCL, omit.
3053                                                            - Must happen after
3054                                                              any preceding
3055                                                              local/generic
3056                                                              load/store/load
3057                                                              atomic/store
3058                                                              atomic/atomicrmw.
3059                                                            - Must happen before
3060                                                              the following
3061                                                              atomicrmw.
3062                                                            - Ensures that all
3063                                                              memory operations
3064                                                              to local have
3065                                                              completed before
3066                                                              performing the
3067                                                              atomicrmw that is
3068                                                              being released.
3069
3070                                                          2. buffer/global/flat_atomic
3071      atomicrmw    acq_rel      - workgroup    - local    1. ds_atomic
3072                                                          2. s_waitcnt lgkmcnt(0)
3073
3074                                                            - If OpenCL, omit.
3075                                                            - Must happen before
3076                                                              any following
3077                                                              global/generic
3078                                                              load/load
3079                                                              atomic/store/store
3080                                                              atomic/atomicrmw.
3081                                                            - Ensures any
3082                                                              following global
3083                                                              data read is no
3084                                                              older than the load
3085                                                              atomic value being
3086                                                              acquired.
3087
3088      atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)
3089
3090                                                            - If OpenCL, omit.
3091                                                            - Must happen after
3092                                                              any preceding
3093                                                              local/generic
3094                                                              load/store/load
3095                                                              atomic/store
3096                                                              atomic/atomicrmw.
3097                                                            - Must happen before
3098                                                              the following
3099                                                              atomicrmw.
3100                                                            - Ensures that all
3101                                                              memory operations
3102                                                              to local have
3103                                                              completed before
3104                                                              performing the
3105                                                              atomicrmw that is
3106                                                              being released.
3107
3108                                                          2. flat_atomic
3109                                                          3. s_waitcnt lgkmcnt(0)
3110
3111                                                            - If OpenCL, omit.
3112                                                            - Must happen before
3113                                                              any following
3114                                                              global/generic
3115                                                              load/load
3116                                                              atomic/store/store
3117                                                              atomic/atomicrmw.
3118                                                            - Ensures any
3119                                                              following global
3120                                                              data read is no
3121                                                              older than the load
3122                                                              atomic value being
3123                                                              acquired.
3124
3125      atomicrmw    acq_rel      - agent        - global   1. s_waitcnt lgkmcnt(0) &
3126                                - system                     vmcnt(0)
3127
3128                                                            - If OpenCL, omit
3129                                                              lgkmcnt(0).
3130                                                            - Could be split into
3131                                                              separate s_waitcnt
3132                                                              vmcnt(0) and
3133                                                              s_waitcnt
3134                                                              lgkmcnt(0) to allow
3135                                                              them to be
3136                                                              independently moved
3137                                                              according to the
3138                                                              following rules.
3139                                                            - s_waitcnt vmcnt(0)
3140                                                              must happen after
3141                                                              any preceding
3142                                                              global/generic
3143                                                              load/store/load
3144                                                              atomic/store
3145                                                              atomic/atomicrmw.
3146                                                            - s_waitcnt lgkmcnt(0)
3147                                                              must happen after
3148                                                              any preceding
3149                                                              local/generic
3150                                                              load/store/load
3151                                                              atomic/store
3152                                                              atomic/atomicrmw.
3153                                                            - Must happen before
3154                                                              the following
3155                                                              atomicrmw.
3156                                                            - Ensures that all
3157                                                              memory operations
3158                                                              to global have
3159                                                              completed before
3160                                                              performing the
3161                                                              atomicrmw that is
3162                                                              being released.
3163
3164                                                          2. buffer/global/flat_atomic
3165                                                          3. s_waitcnt vmcnt(0)
3166
3167                                                            - Must happen before
3168                                                              following
3169                                                              buffer_wbinvl1_vol.
3170                                                            - Ensures the
3171                                                              atomicrmw has
3172                                                              completed before
3173                                                              invalidating the
3174                                                              cache.
3175
3176                                                          4. buffer_wbinvl1_vol
3177
3178                                                            - Must happen before
3179                                                              any following
3180                                                              global/generic
3181                                                              load/load
3182                                                              atomic/atomicrmw.
3183                                                            - Ensures that
3184                                                              following loads
3185                                                              will not see stale
3186                                                              global data.
3187
3188      atomicrmw    acq_rel      - agent        - generic  1. s_waitcnt lgkmcnt(0) &
3189                                - system                     vmcnt(0)
3190
3191                                                            - If OpenCL, omit
3192                                                              lgkmcnt(0).
3193                                                            - Could be split into
3194                                                              separate s_waitcnt
3195                                                              vmcnt(0) and
3196                                                              s_waitcnt
3197                                                              lgkmcnt(0) to allow
3198                                                              them to be
3199                                                              independently moved
3200                                                              according to the
3201                                                              following rules.
3202                                                            - s_waitcnt vmcnt(0)
3203                                                              must happen after
3204                                                              any preceding
3205                                                              global/generic
3206                                                              load/store/load
3207                                                              atomic/store
3208                                                              atomic/atomicrmw.
3209                                                            - s_waitcnt lgkmcnt(0)
3210                                                              must happen after
3211                                                              any preceding
3212                                                              local/generic
3213                                                              load/store/load
3214                                                              atomic/store
3215                                                              atomic/atomicrmw.
3216                                                            - Must happen before
3217                                                              the following
3218                                                              atomicrmw.
3219                                                            - Ensures that all
3220                                                              memory operations
3221                                                              to global have
3222                                                              completed before
3223                                                              performing the
3224                                                              atomicrmw that is
3225                                                              being released.
3226
3227                                                          2. flat_atomic
3228                                                          3. s_waitcnt vmcnt(0) &
3229                                                             lgkmcnt(0)
3230
3231                                                            - If OpenCL, omit
3232                                                              lgkmcnt(0).
3233                                                            - Must happen before
3234                                                              following
3235                                                              buffer_wbinvl1_vol.
3236                                                            - Ensures the
3237                                                              atomicrmw has
3238                                                              completed before
3239                                                              invalidating the
3240                                                              cache.
3241
3242                                                          4. buffer_wbinvl1_vol
3243
3244                                                            - Must happen before
3245                                                              any following
3246                                                              global/generic
3247                                                              load/load
3248                                                              atomic/atomicrmw.
3249                                                            - Ensures that
3250                                                              following loads
3251                                                              will not see stale
3252                                                              global data.
3253
3254      fence        acq_rel      - singlethread *none*     *none*
3255                                - wavefront
3256      fence        acq_rel      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
3257
3258                                                            - If OpenCL and
3259                                                              address space is
3260                                                              not generic, omit.
3261                                                            - However,
3262                                                              since LLVM
3263                                                              currently has no
3264                                                              address space on
3265                                                              the fence need to
3266                                                              conservatively
3267                                                              always generate
3268                                                              (see comment for
3269                                                              previous fence).
3270                                                            - Must happen after
3271                                                              any preceding
3272                                                              local/generic
3273                                                              load/load
3274                                                              atomic/store/store
3275                                                              atomic/atomicrmw.
3276                                                            - Must happen before
3277                                                              any following
3278                                                              global/generic
3279                                                              load/load
3280                                                              atomic/store/store
3281                                                              atomic/atomicrmw.
3282                                                            - Ensures that all
3283                                                              memory operations
3284                                                              to local have
3285                                                              completed before
3286                                                              performing any
3287                                                              following global
3288                                                              memory operations.
3289                                                            - Ensures that the
3290                                                              preceding
3291                                                              local/generic load
3292                                                              atomic/atomicrmw
3293                                                              with an equal or
3294                                                              wider sync scope
3295                                                              and memory ordering
3296                                                              stronger than
3297                                                              unordered (this is
3298                                                              termed the
3299                                                              acquire-fence-paired-atomic
3300                                                              ) has completed
3301                                                              before following
3302                                                              global memory
3303                                                              operations. This
3304                                                              satisfies the
3305                                                              requirements of
3306                                                              acquire.
3307                                                            - Ensures that all
3308                                                              previous memory
3309                                                              operations have
3310                                                              completed before a
3311                                                              following
3312                                                              local/generic store
3313                                                              atomic/atomicrmw
3314                                                              with an equal or
3315                                                              wider sync scope
3316                                                              and memory ordering
3317                                                              stronger than
3318                                                              unordered (this is
3319                                                              termed the
3320                                                              release-fence-paired-atomic
3321                                                              ). This satisfies the
3322                                                              requirements of
3323                                                              release.
3324
3325      fence        acq_rel      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
3326                                - system                     vmcnt(0)
3327
3328                                                            - If OpenCL and
3329                                                              address space is
3330                                                              not generic, omit
3331                                                              lgkmcnt(0).
3332                                                            - However, since LLVM
3333                                                              currently has no
3334                                                              address space on
3335                                                              the fence need to
3336                                                              conservatively
3337                                                              always generate
3338                                                              (see comment for
3339                                                              previous fence).
3340                                                            - Could be split into
3341                                                              separate s_waitcnt
3342                                                              vmcnt(0) and
3343                                                              s_waitcnt
3344                                                              lgkmcnt(0) to allow
3345                                                              them to be
3346                                                              independently moved
3347                                                              according to the
3348                                                              following rules.
3349                                                            - s_waitcnt vmcnt(0)
3350                                                              must happen after
3351                                                              any preceding
3352                                                              global/generic
3353                                                              load/store/load
3354                                                              atomic/store
3355                                                              atomic/atomicrmw.
3356                                                            - s_waitcnt lgkmcnt(0)
3357                                                              must happen after
3358                                                              any preceding
3359                                                              local/generic
3360                                                              load/store/load
3361                                                              atomic/store
3362                                                              atomic/atomicrmw.
3363                                                            - Must happen before
3364                                                              the following
3365                                                              buffer_wbinvl1_vol.
3366                                                            - Ensures that the
3367                                                              preceding
3368                                                              global/local/generic
3369                                                              load
3370                                                              atomic/atomicrmw
3371                                                              with an equal or
3372                                                              wider sync scope
3373                                                              and memory ordering
3374                                                              stronger than
3375                                                              unordered (this is
3376                                                              termed the
3377                                                              acquire-fence-paired-atomic
3378                                                              ) has completed
3379                                                              before invalidating
3380                                                              the cache. This
3381                                                              satisfies the
3382                                                              requirements of
3383                                                              acquire.
3384                                                            - Ensures that all
3385                                                              previous memory
3386                                                              operations have
3387                                                              completed before a
3388                                                              following
3389                                                              global/local/generic
3390                                                              store
3391                                                              atomic/atomicrmw
3392                                                              with an equal or
3393                                                              wider sync scope
3394                                                              and memory ordering
3395                                                              stronger than
3396                                                              unordered (this is
3397                                                              termed the
3398                                                              release-fence-paired-atomic
3399                                                              ). This satisfies the
3400                                                              requirements of
3401                                                              release.
3402
3403                                                          2. buffer_wbinvl1_vol
3404
3405                                                            - Must happen before
3406                                                              any following
3407                                                              global/generic
3408                                                              load/load
3409                                                              atomic/store/store
3410                                                              atomic/atomicrmw.
3411                                                            - Ensures that
3412                                                              following loads
3413                                                              will not see stale
3414                                                              global data. This
3415                                                              satisfies the
3416                                                              requirements of
3417                                                              acquire.
3418
3419      **Sequential Consistent Atomic**
3420      -----------------------------------------------------------------------------------
3421      load atomic  seq_cst      - singlethread - global   *Same as corresponding
3422                                - wavefront    - local    load atomic acquire,
3423                                               - generic  except must generated
3424                                                          all instructions even
3425                                                          for OpenCL.*
3426      load atomic  seq_cst      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
3427                                               - generic
3428                                                            - Must
3429                                                              happen after
3430                                                              preceding
3431                                                              global/generic load
3432                                                              atomic/store
3433                                                              atomic/atomicrmw
3434                                                              with memory
3435                                                              ordering of seq_cst
3436                                                              and with equal or
3437                                                              wider sync scope.
3438                                                              (Note that seq_cst
3439                                                              fences have their
3440                                                              own s_waitcnt
3441                                                              lgkmcnt(0) and so do
3442                                                              not need to be
3443                                                              considered.)
3444                                                            - Ensures any
3445                                                              preceding
3446                                                              sequential
3447                                                              consistent local
3448                                                              memory instructions
3449                                                              have completed
3450                                                              before executing
3451                                                              this sequentially
3452                                                              consistent
3453                                                              instruction. This
3454                                                              prevents reordering
3455                                                              a seq_cst store
3456                                                              followed by a
3457                                                              seq_cst load. (Note
3458                                                              that seq_cst is
3459                                                              stronger than
3460                                                              acquire/release as
3461                                                              the reordering of
3462                                                              load acquire
3463                                                              followed by a store
3464                                                              release is
3465                                                              prevented by the
3466                                                              waitcnt of
3467                                                              the release, but
3468                                                              there is nothing
3469                                                              preventing a store
3470                                                              release followed by
3471                                                              load acquire from
3472                                                              competing out of
3473                                                              order.)
3474
3475                                                          2. *Following
3476                                                             instructions same as
3477                                                             corresponding load
3478                                                             atomic acquire,
3479                                                             except must generated
3480                                                             all instructions even
3481                                                             for OpenCL.*
3482      load atomic  seq_cst      - workgroup    - local    *Same as corresponding
3483                                                          load atomic acquire,
3484                                                          except must generated
3485                                                          all instructions even
3486                                                          for OpenCL.*
3487      load atomic  seq_cst      - agent        - global   1. s_waitcnt lgkmcnt(0) &
3488                                - system       - generic     vmcnt(0)
3489
3490                                                            - Could be split into
3491                                                              separate s_waitcnt
3492                                                              vmcnt(0)
3493                                                              and s_waitcnt
3494                                                              lgkmcnt(0) to allow
3495                                                              them to be
3496                                                              independently moved
3497                                                              according to the
3498                                                              following rules.
3499                                                            - waitcnt lgkmcnt(0)
3500                                                              must happen after
3501                                                              preceding
3502                                                              global/generic load
3503                                                              atomic/store
3504                                                              atomic/atomicrmw
3505                                                              with memory
3506                                                              ordering of seq_cst
3507                                                              and with equal or
3508                                                              wider sync scope.
3509                                                              (Note that seq_cst
3510                                                              fences have their
3511                                                              own s_waitcnt
3512                                                              lgkmcnt(0) and so do
3513                                                              not need to be
3514                                                              considered.)
3515                                                            - waitcnt vmcnt(0)
3516                                                              must happen after
3517                                                              preceding
3518                                                              global/generic load
3519                                                              atomic/store
3520                                                              atomic/atomicrmw
3521                                                              with memory
3522                                                              ordering of seq_cst
3523                                                              and with equal or
3524                                                              wider sync scope.
3525                                                              (Note that seq_cst
3526                                                              fences have their
3527                                                              own s_waitcnt
3528                                                              vmcnt(0) and so do
3529                                                              not need to be
3530                                                              considered.)
3531                                                            - Ensures any
3532                                                              preceding
3533                                                              sequential
3534                                                              consistent global
3535                                                              memory instructions
3536                                                              have completed
3537                                                              before executing
3538                                                              this sequentially
3539                                                              consistent
3540                                                              instruction. This
3541                                                              prevents reordering
3542                                                              a seq_cst store
3543                                                              followed by a
3544                                                              seq_cst load. (Note
3545                                                              that seq_cst is
3546                                                              stronger than
3547                                                              acquire/release as
3548                                                              the reordering of
3549                                                              load acquire
3550                                                              followed by a store
3551                                                              release is
3552                                                              prevented by the
3553                                                              waitcnt of
3554                                                              the release, but
3555                                                              there is nothing
3556                                                              preventing a store
3557                                                              release followed by
3558                                                              load acquire from
3559                                                              competing out of
3560                                                              order.)
3561
3562                                                          2. *Following
3563                                                             instructions same as
3564                                                             corresponding load
3565                                                             atomic acquire,
3566                                                             except must generated
3567                                                             all instructions even
3568                                                             for OpenCL.*
3569      store atomic seq_cst      - singlethread - global   *Same as corresponding
3570                                - wavefront    - local    store atomic release,
3571                                - workgroup    - generic  except must generated
3572                                                          all instructions even
3573                                                          for OpenCL.*
3574      store atomic seq_cst      - agent        - global   *Same as corresponding
3575                                - system       - generic  store atomic release,
3576                                                          except must generated
3577                                                          all instructions even
3578                                                          for OpenCL.*
3579      atomicrmw    seq_cst      - singlethread - global   *Same as corresponding
3580                                - wavefront    - local    atomicrmw acq_rel,
3581                                - workgroup    - generic  except must generated
3582                                                          all instructions even
3583                                                          for OpenCL.*
3584      atomicrmw    seq_cst      - agent        - global   *Same as corresponding
3585                                - system       - generic  atomicrmw acq_rel,
3586                                                          except must generated
3587                                                          all instructions even
3588                                                          for OpenCL.*
3589      fence        seq_cst      - singlethread *none*     *Same as corresponding
3590                                - wavefront               fence acq_rel,
3591                                - workgroup               except must generated
3592                                - agent                   all instructions even
3593                                - system                  for OpenCL.*
3594      ============ ============ ============== ========== ===============================
3595
3596 The memory order also adds the single thread optimization constrains defined in
3597 table
3598 :ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
3599
3600   .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
3601      :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
3602
3603      ============ ==============================================================
3604      LLVM Memory  Optimization Constraints
3605      Ordering
3606      ============ ==============================================================
3607      unordered    *none*
3608      monotonic    *none*
3609      acquire      - If a load atomic/atomicrmw then no following load/load
3610                     atomic/store/ store atomic/atomicrmw/fence instruction can
3611                     be moved before the acquire.
3612                   - If a fence then same as load atomic, plus no preceding
3613                     associated fence-paired-atomic can be moved after the fence.
3614      release      - If a store atomic/atomicrmw then no preceding load/load
3615                     atomic/store/ store atomic/atomicrmw/fence instruction can
3616                     be moved after the release.
3617                   - If a fence then same as store atomic, plus no following
3618                     associated fence-paired-atomic can be moved before the
3619                     fence.
3620      acq_rel      Same constraints as both acquire and release.
3621      seq_cst      - If a load atomic then same constraints as acquire, plus no
3622                     preceding sequentially consistent load atomic/store
3623                     atomic/atomicrmw/fence instruction can be moved after the
3624                     seq_cst.
3625                   - If a store atomic then the same constraints as release, plus
3626                     no following sequentially consistent load atomic/store
3627                     atomic/atomicrmw/fence instruction can be moved before the
3628                     seq_cst.
3629                   - If an atomicrmw/fence then same constraints as acq_rel.
3630      ============ ==============================================================
3631
3632 Trap Handler ABI
3633 ~~~~~~~~~~~~~~~~
3634
3635 For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
3636 (such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
3637 the ``s_trap`` instruction with the following usage:
3638
3639   .. table:: AMDGPU Trap Handler for AMDHSA OS
3640      :name: amdgpu-trap-handler-for-amdhsa-os-table
3641
3642      =================== =============== =============== =======================
3643      Usage               Code Sequence   Trap Handler    Description
3644                                          Inputs
3645      =================== =============== =============== =======================
3646      reserved            ``s_trap 0x00``                 Reserved by hardware.
3647      ``debugtrap(arg)``  ``s_trap 0x01`` ``SGPR0-1``:    Reserved for HSA
3648                                            ``queue_ptr`` ``debugtrap``
3649                                          ``VGPR0``:      intrinsic (not
3650                                            ``arg``       implemented).
3651      ``llvm.trap``       ``s_trap 0x02`` ``SGPR0-1``:    Causes dispatch to be
3652                                            ``queue_ptr`` terminated and its
3653                                                          associated queue put
3654                                                          into the error state.
3655      ``llvm.debugtrap``  ``s_trap 0x03`` ``SGPR0-1``:    If debugger not
3656                                            ``queue_ptr`` installed handled
3657                                                          same as ``llvm.trap``.
3658      debugger breakpoint ``s_trap 0x07``                 Reserved for  debugger
3659                                                          breakpoints.
3660      debugger            ``s_trap 0x08``                 Reserved for debugger.
3661      debugger            ``s_trap 0xfe``                 Reserved for debugger.
3662      debugger            ``s_trap 0xff``                 Reserved for debugger.
3663      =================== =============== =============== =======================
3664
3665 Unspecified OS
3666 --------------
3667
3668 This section provides code conventions used when the target triple OS is
3669 empty (see :ref:`amdgpu-target-triples`).
3670
3671 Trap Handler ABI
3672 ~~~~~~~~~~~~~~~~
3673
3674 For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
3675 not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
3676 instructions are handled as follows:
3677
3678   .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
3679      :name: amdgpu-trap-handler-for-non-amdhsa-os-table
3680
3681      =============== =============== ===========================================
3682      Usage           Code Sequence   Description
3683      =============== =============== ===========================================
3684      llvm.trap       s_endpgm        Causes wavefront to be terminated.
3685      llvm.debugtrap  *none*          Compiler warning given that there is no
3686                                      trap handler installed.
3687      =============== =============== ===========================================
3688
3689 Source Languages
3690 ================
3691
3692 .. _amdgpu-opencl:
3693
3694 OpenCL
3695 ------
3696
3697 When generating code for the OpenCL language the target triple environment
3698 should be ``opencl`` or ``amdgizcl`` (see :ref:`amdgpu-target-triples`).
3699
3700 When the language is OpenCL the following differences occur:
3701
3702 1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
3703 2. The AMDGPU backend adds additional arguments to the kernel.
3704 3. Additional metadata is generated
3705    (:ref:`amdgpu-amdhsa-hsa-code-object-metadata`).
3706
3707 .. TODO
3708    Specify what affect this has. Hidden arguments added. Additional metadata
3709    generated.
3710
3711 .. _amdgpu-hcc:
3712
3713 HCC
3714 ---
3715
3716 When generating code for the OpenCL language the target triple environment
3717 should be ``hcc`` (see :ref:`amdgpu-target-triples`).
3718
3719 When the language is OpenCL the following differences occur:
3720
3721 1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
3722
3723 .. TODO
3724    Specify what affect this has.
3725
3726 Assembler
3727 ---------
3728
3729 AMDGPU backend has LLVM-MC based assembler which is currently in development.
3730 It supports AMDGCN GFX6-GFX8.
3731
3732 This section describes general syntax for instructions and operands. For more
3733 information about instructions, their semantics and supported combinations of
3734 operands, refer to one of instruction set architecture manuals
3735 [AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_ and [AMD-GCN-GFX9]_.
3736
3737 An instruction has the following syntax (register operands are normally
3738 comma-separated while extra operands are space-separated):
3739
3740 *<opcode> <register_operand0>, ... <extra_operand0> ...*
3741
3742 Operands
3743 ~~~~~~~~
3744
3745 The following syntax for register operands is supported:
3746
3747 * SGPR registers: s0, ... or s[0], ...
3748 * VGPR registers: v0, ... or v[0], ...
3749 * TTMP registers: ttmp0, ... or ttmp[0], ...
3750 * Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
3751 * Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
3752 * Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
3753 * Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
3754 * Register index expressions: v[2*2], s[1-1:2-1]
3755 * 'off' indicates that an operand is not enabled
3756
3757 The following extra operands are supported:
3758
3759 * offset, offset0, offset1
3760 * idxen, offen bits
3761 * glc, slc, tfe bits
3762 * waitcnt: integer or combination of counter values
3763 * VOP3 modifiers:
3764
3765   - abs (\| \|), neg (\-)
3766
3767 * DPP modifiers:
3768
3769   - row_shl, row_shr, row_ror, row_rol
3770   - row_mirror, row_half_mirror, row_bcast
3771   - wave_shl, wave_shr, wave_ror, wave_rol, quad_perm
3772   - row_mask, bank_mask, bound_ctrl
3773
3774 * SDWA modifiers:
3775
3776   - dst_sel, src0_sel, src1_sel (BYTE_N, WORD_M, DWORD)
3777   - dst_unused (UNUSED_PAD, UNUSED_SEXT, UNUSED_PRESERVE)
3778   - abs, neg, sext
3779
3780 Instruction Examples
3781 ~~~~~~~~~~~~~~~~~~~~
3782
3783 DS
3784 ~~
3785
3786 .. code-block:: nasm
3787
3788   ds_add_u32 v2, v4 offset:16
3789   ds_write_src2_b64 v2 offset0:4 offset1:8
3790   ds_cmpst_f32 v2, v4, v6
3791   ds_min_rtn_f64 v[8:9], v2, v[4:5]
3792
3793
3794 For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
3795
3796 FLAT
3797 ++++
3798
3799 .. code-block:: nasm
3800
3801   flat_load_dword v1, v[3:4]
3802   flat_store_dwordx3 v[3:4], v[5:7]
3803   flat_atomic_swap v1, v[3:4], v5 glc
3804   flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
3805   flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
3806
3807 For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
3808
3809 MUBUF
3810 +++++
3811
3812 .. code-block:: nasm
3813
3814   buffer_load_dword v1, off, s[4:7], s1
3815   buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
3816   buffer_store_format_xy v[1:2], off, s[4:7], s1
3817   buffer_wbinvl1
3818   buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
3819
3820 For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
3821
3822 SMRD/SMEM
3823 +++++++++
3824
3825 .. code-block:: nasm
3826
3827   s_load_dword s1, s[2:3], 0xfc
3828   s_load_dwordx8 s[8:15], s[2:3], s4
3829   s_load_dwordx16 s[88:103], s[2:3], s4
3830   s_dcache_inv_vol
3831   s_memtime s[4:5]
3832
3833 For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
3834
3835 SOP1
3836 ++++
3837
3838 .. code-block:: nasm
3839
3840   s_mov_b32 s1, s2
3841   s_mov_b64 s[0:1], 0x80000000
3842   s_cmov_b32 s1, 200
3843   s_wqm_b64 s[2:3], s[4:5]
3844   s_bcnt0_i32_b64 s1, s[2:3]
3845   s_swappc_b64 s[2:3], s[4:5]
3846   s_cbranch_join s[4:5]
3847
3848 For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
3849
3850 SOP2
3851 ++++
3852
3853 .. code-block:: nasm
3854
3855   s_add_u32 s1, s2, s3
3856   s_and_b64 s[2:3], s[4:5], s[6:7]
3857   s_cselect_b32 s1, s2, s3
3858   s_andn2_b32 s2, s4, s6
3859   s_lshr_b64 s[2:3], s[4:5], s6
3860   s_ashr_i32 s2, s4, s6
3861   s_bfm_b64 s[2:3], s4, s6
3862   s_bfe_i64 s[2:3], s[4:5], s6
3863   s_cbranch_g_fork s[4:5], s[6:7]
3864
3865 For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
3866
3867 SOPC
3868 ++++
3869
3870 .. code-block:: nasm
3871
3872   s_cmp_eq_i32 s1, s2
3873   s_bitcmp1_b32 s1, s2
3874   s_bitcmp0_b64 s[2:3], s4
3875   s_setvskip s3, s5
3876
3877 For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
3878
3879 SOPP
3880 ++++
3881
3882 .. code-block:: nasm
3883
3884   s_barrier
3885   s_nop 2
3886   s_endpgm
3887   s_waitcnt 0 ; Wait for all counters to be 0
3888   s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
3889   s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
3890   s_sethalt 9
3891   s_sleep 10
3892   s_sendmsg 0x1
3893   s_sendmsg sendmsg(MSG_INTERRUPT)
3894   s_trap 1
3895
3896 For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
3897
3898 Unless otherwise mentioned, little verification is performed on the operands
3899 of SOPP Instructions, so it is up to the programmer to be familiar with the
3900 range or acceptable values.
3901
3902 VALU
3903 ++++
3904
3905 For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
3906 the assembler will automatically use optimal encoding based on its operands.
3907 To force specific encoding, one can add a suffix to the opcode of the instruction:
3908
3909 * _e32 for 32-bit VOP1/VOP2/VOPC
3910 * _e64 for 64-bit VOP3
3911 * _dpp for VOP_DPP
3912 * _sdwa for VOP_SDWA
3913
3914 VOP1/VOP2/VOP3/VOPC examples:
3915
3916 .. code-block:: nasm
3917
3918   v_mov_b32 v1, v2
3919   v_mov_b32_e32 v1, v2
3920   v_nop
3921   v_cvt_f64_i32_e32 v[1:2], v2
3922   v_floor_f32_e32 v1, v2
3923   v_bfrev_b32_e32 v1, v2
3924   v_add_f32_e32 v1, v2, v3
3925   v_mul_i32_i24_e64 v1, v2, 3
3926   v_mul_i32_i24_e32 v1, -3, v3
3927   v_mul_i32_i24_e32 v1, -100, v3
3928   v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
3929   v_max_f16_e32 v1, v2, v3
3930
3931 VOP_DPP examples:
3932
3933 .. code-block:: nasm
3934
3935   v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
3936   v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
3937   v_mov_b32 v0, v0 wave_shl:1
3938   v_mov_b32 v0, v0 row_mirror
3939   v_mov_b32 v0, v0 row_bcast:31
3940   v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
3941   v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
3942   v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
3943
3944 VOP_SDWA examples:
3945
3946 .. code-block:: nasm
3947
3948   v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
3949   v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
3950   v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
3951   v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
3952   v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
3953
3954 For full list of supported instructions, refer to "Vector ALU instructions".
3955
3956 HSA Code Object Directives
3957 ~~~~~~~~~~~~~~~~~~~~~~~~~~
3958
3959 AMDGPU ABI defines auxiliary data in output code object. In assembly source,
3960 one can specify them with assembler directives.
3961
3962 .hsa_code_object_version major, minor
3963 +++++++++++++++++++++++++++++++++++++
3964
3965 *major* and *minor* are integers that specify the version of the HSA code
3966 object that will be generated by the assembler.
3967
3968 .hsa_code_object_isa [major, minor, stepping, vendor, arch]
3969 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3970
3971
3972 *major*, *minor*, and *stepping* are all integers that describe the instruction
3973 set architecture (ISA) version of the assembly program.
3974
3975 *vendor* and *arch* are quoted strings.  *vendor* should always be equal to
3976 "AMD" and *arch* should always be equal to "AMDGPU".
3977
3978 By default, the assembler will derive the ISA version, *vendor*, and *arch*
3979 from the value of the -mcpu option that is passed to the assembler.
3980
3981 .amdgpu_hsa_kernel (name)
3982 +++++++++++++++++++++++++
3983
3984 This directives specifies that the symbol with given name is a kernel entry point
3985 (label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
3986
3987 .amd_kernel_code_t
3988 ++++++++++++++++++
3989
3990 This directive marks the beginning of a list of key / value pairs that are used
3991 to specify the amd_kernel_code_t object that will be emitted by the assembler.
3992 The list must be terminated by the *.end_amd_kernel_code_t* directive.  For
3993 any amd_kernel_code_t values that are unspecified a default value will be
3994 used.  The default value for all keys is 0, with the following exceptions:
3995
3996 - *kernel_code_version_major* defaults to 1.
3997 - *machine_kind* defaults to 1.
3998 - *machine_version_major*, *machine_version_minor*, and
3999   *machine_version_stepping* are derived from the value of the -mcpu option
4000   that is passed to the assembler.
4001 - *kernel_code_entry_byte_offset* defaults to 256.
4002 - *wavefront_size* defaults to 6.
4003 - *kernarg_segment_alignment*, *group_segment_alignment*, and
4004   *private_segment_alignment* default to 4. Note that alignments are specified
4005   as a power of two, so a value of **n** means an alignment of 2^ **n**.
4006
4007 The *.amd_kernel_code_t* directive must be placed immediately after the
4008 function label and before any instructions.
4009
4010 For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
4011 comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
4012
4013 Here is an example of a minimal amd_kernel_code_t specification:
4014
4015 .. code-block:: none
4016
4017    .hsa_code_object_version 1,0
4018    .hsa_code_object_isa
4019
4020    .hsatext
4021    .globl  hello_world
4022    .p2align 8
4023    .amdgpu_hsa_kernel hello_world
4024
4025    hello_world:
4026
4027       .amd_kernel_code_t
4028          enable_sgpr_kernarg_segment_ptr = 1
4029          is_ptr64 = 1
4030          compute_pgm_rsrc1_vgprs = 0
4031          compute_pgm_rsrc1_sgprs = 0
4032          compute_pgm_rsrc2_user_sgpr = 2
4033          kernarg_segment_byte_size = 8
4034          wavefront_sgpr_count = 2
4035          workitem_vgpr_count = 3
4036      .end_amd_kernel_code_t
4037
4038      s_load_dwordx2 s[0:1], s[0:1] 0x0
4039      v_mov_b32 v0, 3.14159
4040      s_waitcnt lgkmcnt(0)
4041      v_mov_b32 v1, s0
4042      v_mov_b32 v2, s1
4043      flat_store_dword v[1:2], v0
4044      s_endpgm
4045    .Lfunc_end0:
4046         .size   hello_world, .Lfunc_end0-hello_world
4047
4048 Additional Documentation
4049 ========================
4050
4051 .. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
4052 .. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
4053 .. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
4054 .. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
4055 .. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
4056 .. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
4057 .. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
4058 .. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
4059 .. [AMD-OpenCL_Programming-Guide]  `AMD Accelerated Parallel Processing OpenCL Programming Guide <http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf>`_
4060 .. [AMD-APP-SDK] `AMD Accelerated Parallel Processing APP SDK Documentation <http://developer.amd.com/tools/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/documentation/>`__
4061 .. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
4062 .. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
4063 .. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
4064 .. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
4065 .. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
4066 .. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
4067 .. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
4068 .. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
4069 .. [AMD-AMDGPU-Compute-Application-Binary-Interface] `AMDGPU Compute Application Binary Interface <https://github.com/RadeonOpenCompute/ROCm-ComputeABI-Doc/blob/master/AMDGPU-ABI.md>`__