docs/AMDGPUUsage.rst

   1 ==============================
   2 User Guide for AMDGPU Back-end
   3 ==============================
   4
   5 Introduction
   6 ============
   7
   8 The AMDGPU back-end provides ISA code generation for AMD GPUs, starting with
   9 the R600 family up until the current Volcanic Islands (GCN Gen 3).
  10
  11 Refer to `AMDGPU section in Architecture & Platform Information for Compiler Writers <CompilerWriterInfo.html#amdgpu>`_
  12 for additional documentation.
  13
  14 Conventions
  15 ===========
  16
  17 Address Spaces
  18 --------------
  19
  20 The AMDGPU back-end uses the following address space mapping:
  21
  22    ================== =================== ==============
  23    LLVM Address Space DWARF Address Space Memory Space
  24    ================== =================== ==============
  25    0                  1                   Private
  26    1                  N/A                 Global
  27    2                  N/A                 Constant
  28    3                  2                   Local
  29    4                  N/A                 Generic (Flat)
  30    5                  N/A                 Region
  31    ================== =================== ==============
  32
  33 The terminology in the table, aside from the region memory space, is from the
  34 OpenCL standard.
  35
  36 LLVM Address Space is used throughout LLVM (for example, in LLVM IR). DWARF
  37 Address Space is emitted in DWARF, and is used by tools, such as debugger,
  38 profiler and others.
  39
  40 Trap Handler ABI
  41 ----------------
  42 The OS element of the target triple controls the trap handler behavior.
  43
  44 HSA OS
  45 ^^^^^^
  46 For code objects generated by AMDGPU back-end for the HSA OS, the runtime
  47 installs a trap handler that supports the s_trap instruction with the following
  48 usage:
  49
  50  +--------------+-------------+-------------------+----------------------------+
  51  |Usage         |Code Sequence|Trap Handler Inputs|Description                 |
  52  +==============+=============+===================+============================+
  53  |reserved      |s_trap 0x00  |                   |Reserved by hardware.       |
  54  +--------------+-------------+-------------------+----------------------------+
  55  |HSA debugtrap |s_trap 0x01  |SGPR0-1: queue_ptr |Reserved for HSA debugtrap  |
  56  |(arg)         |             |VGPR0: arg         |intrinsic (not implemented).|
  57  +--------------+-------------+-------------------+----------------------------+
  58  |llvm.trap     |s_trap 0x02  |SGPR0-1: queue_ptr |Causes dispatch to be       |
  59  |              |             |                   |terminated and its          |
  60  |              |             |                   |associated queue put into   |
  61  |              |             |                   |the error state.            |
  62  +--------------+-------------+-------------------+----------------------------+
  63  |llvm.debugtrap| s_trap 0x03 |SGPR0-1: queue_ptr |If debugger not installed   |
  64  |              |             |                   |handled same as llvm.trap.  |
  65  +--------------+-------------+-------------------+----------------------------+
  66  |debugger      |s_trap 0x07  |                   |Reserved for debugger       |
  67  |breakpoint    |             |                   |breakpoints.                |
  68  +--------------+-------------+-------------------+----------------------------+
  69  |debugger      |s_trap 0x08  |                   |Reserved for debugger.      |
  70  +--------------+-------------+-------------------+----------------------------+
  71  |debugger      |s_trap 0xfe  |                   |Reserved for debugger.      |
  72  +--------------+-------------+-------------------+----------------------------+
  73  |debugger      |s_trap 0xff  |                   |Reserved for debugger.      |
  74  +--------------+-------------+-------------------+----------------------------+
  75
  76 Non-HSA OS
  77 ^^^^^^^^^^
  78 For code objects generated by AMDGPU back-end for non-HSA OS, the runtime does
  79 not install a trap handler. The llvm.trap and llvm.debugtrap instructions are
  80 handler as follows:
  81
  82    =============== ============= ===============================================
  83    Usage           Code Sequence Description
  84    =============== ============= ===============================================
  85    llvm.trap       s_endpgm      Causes wavefront to be terminated.
  86    llvm.debugtrap  s_nop         No operation. Compiler warning generated that
  87                                  there is no trap handler installed.
  88    =============== ============= ===============================================
  89
  90 Assembler
  91 =========
  92
  93 AMDGPU backend has LLVM-MC based assembler which is currently in development.
  94 It supports Southern Islands ISA, Sea Islands and Volcanic Islands.
  95
  96 This document describes general syntax for instructions and operands. For more
  97 information about instructions, their semantics and supported combinations
  98 of operands, refer to one of Instruction Set Architecture manuals.
  99
 100 An instruction has the following syntax (register operands are
 101 normally comma-separated while extra operands are space-separated):
 102
 103 *<opcode> <register_operand0>, ... <extra_operand0> ...*
 104
 105
 106 Operands
 107 --------
 108
 109 The following syntax for register operands is supported:
 110
 111 * SGPR registers: s0, ... or s[0], ...
 112 * VGPR registers: v0, ... or v[0], ...
 113 * TTMP registers: ttmp0, ... or ttmp[0], ...
 114 * Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
 115 * Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
 116 * Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
 117 * Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
 118 * Register index expressions: v[2*2], s[1-1:2-1]
 119 * 'off' indicates that an operand is not enabled
 120
 121 The following extra operands are supported:
 122
 123 * offset, offset0, offset1
 124 * idxen, offen bits
 125 * glc, slc, tfe bits
 126 * waitcnt: integer or combination of counter values
 127 * VOP3 modifiers:
 128
 129   - abs (\| \|), neg (\-)
 130
 131 * DPP modifiers:
 132
 133   - row_shl, row_shr, row_ror, row_rol
 134   - row_mirror, row_half_mirror, row_bcast
 135   - wave_shl, wave_shr, wave_ror, wave_rol, quad_perm
 136   - row_mask, bank_mask, bound_ctrl
 137
 138 * SDWA modifiers:
 139
 140   - dst_sel, src0_sel, src1_sel (BYTE_N, WORD_M, DWORD)
 141   - dst_unused (UNUSED_PAD, UNUSED_SEXT, UNUSED_PRESERVE)
 142   - abs, neg, sext
 143
 144 DS Instructions Examples
 145 ------------------------
 146
 147 .. code-block:: nasm
 148
 149   ds_add_u32 v2, v4 offset:16
 150   ds_write_src2_b64 v2 offset0:4 offset1:8
 151   ds_cmpst_f32 v2, v4, v6
 152   ds_min_rtn_f64 v[8:9], v2, v[4:5]
 153
 154
 155 For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
 156
 157 FLAT Instruction Examples
 158 --------------------------
 159
 160 .. code-block:: nasm
 161
 162   flat_load_dword v1, v[3:4]
 163   flat_store_dwordx3 v[3:4], v[5:7]
 164   flat_atomic_swap v1, v[3:4], v5 glc
 165   flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
 166   flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
 167
 168 For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
 169
 170 MUBUF Instruction Examples
 171 ---------------------------
 172
 173 .. code-block:: nasm
 174
 175   buffer_load_dword v1, off, s[4:7], s1
 176   buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
 177   buffer_store_format_xy v[1:2], off, s[4:7], s1
 178   buffer_wbinvl1
 179   buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
 180
 181 For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
 182
 183 SMRD/SMEM Instruction Examples
 184 -------------------------------
 185
 186 .. code-block:: nasm
 187
 188   s_load_dword s1, s[2:3], 0xfc
 189   s_load_dwordx8 s[8:15], s[2:3], s4
 190   s_load_dwordx16 s[88:103], s[2:3], s4
 191   s_dcache_inv_vol
 192   s_memtime s[4:5]
 193
 194 For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
 195
 196 SOP1 Instruction Examples
 197 --------------------------
 198
 199 .. code-block:: nasm
 200
 201   s_mov_b32 s1, s2
 202   s_mov_b64 s[0:1], 0x80000000
 203   s_cmov_b32 s1, 200
 204   s_wqm_b64 s[2:3], s[4:5]
 205   s_bcnt0_i32_b64 s1, s[2:3]
 206   s_swappc_b64 s[2:3], s[4:5]
 207   s_cbranch_join s[4:5]
 208
 209 For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
 210
 211 SOP2 Instruction Examples
 212 -------------------------
 213
 214 .. code-block:: nasm
 215
 216   s_add_u32 s1, s2, s3
 217   s_and_b64 s[2:3], s[4:5], s[6:7]
 218   s_cselect_b32 s1, s2, s3
 219   s_andn2_b32 s2, s4, s6
 220   s_lshr_b64 s[2:3], s[4:5], s6
 221   s_ashr_i32 s2, s4, s6
 222   s_bfm_b64 s[2:3], s4, s6
 223   s_bfe_i64 s[2:3], s[4:5], s6
 224   s_cbranch_g_fork s[4:5], s[6:7]
 225
 226 For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
 227
 228 SOPC Instruction Examples
 229 --------------------------
 230
 231 .. code-block:: nasm
 232
 233   s_cmp_eq_i32 s1, s2
 234   s_bitcmp1_b32 s1, s2
 235   s_bitcmp0_b64 s[2:3], s4
 236   s_setvskip s3, s5
 237
 238 For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
 239
 240 SOPP Instruction Examples
 241 --------------------------
 242
 243 .. code-block:: nasm
 244
 245   s_barrier
 246   s_nop 2
 247   s_endpgm
 248   s_waitcnt 0 ; Wait for all counters to be 0
 249   s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
 250   s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
 251   s_sethalt 9
 252   s_sleep 10
 253   s_sendmsg 0x1
 254   s_sendmsg sendmsg(MSG_INTERRUPT)
 255   s_trap 1
 256
 257 For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
 258
 259 Unless otherwise mentioned, little verification is performed on the operands
 260 of SOPP Instructions, so it is up to the programmer to be familiar with the
 261 range or acceptable values.
 262
 263 Vector ALU Instruction Examples
 264 -------------------------------
 265
 266 For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
 267 the assembler will automatically use optimal encoding based on its operands.
 268 To force specific encoding, one can add a suffix to the opcode of the instruction:
 269
 270 * _e32 for 32-bit VOP1/VOP2/VOPC
 271 * _e64 for 64-bit VOP3
 272 * _dpp for VOP_DPP
 273 * _sdwa for VOP_SDWA
 274
 275 VOP1/VOP2/VOP3/VOPC examples:
 276
 277 .. code-block:: nasm
 278
 279   v_mov_b32 v1, v2
 280   v_mov_b32_e32 v1, v2
 281   v_nop
 282   v_cvt_f64_i32_e32 v[1:2], v2
 283   v_floor_f32_e32 v1, v2
 284   v_bfrev_b32_e32 v1, v2
 285   v_add_f32_e32 v1, v2, v3
 286   v_mul_i32_i24_e64 v1, v2, 3
 287   v_mul_i32_i24_e32 v1, -3, v3
 288   v_mul_i32_i24_e32 v1, -100, v3
 289   v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
 290   v_max_f16_e32 v1, v2, v3
 291
 292 VOP_DPP examples:
 293
 294 .. code-block:: nasm
 295
 296   v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
 297   v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
 298   v_mov_b32 v0, v0 wave_shl:1
 299   v_mov_b32 v0, v0 row_mirror
 300   v_mov_b32 v0, v0 row_bcast:31
 301   v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
 302   v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
 303   v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
 304
 305 VOP_SDWA examples:
 306
 307 .. code-block:: nasm
 308
 309   v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
 310   v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
 311   v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
 312   v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
 313   v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
 314
 315 For full list of supported instructions, refer to "Vector ALU instructions".
 316
 317 HSA Code Object Directives
 318 --------------------------
 319
 320 AMDGPU ABI defines auxiliary data in output code object. In assembly source,
 321 one can specify them with assembler directives.
 322
 323 .hsa_code_object_version major, minor
 324 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 325
 326 *major* and *minor* are integers that specify the version of the HSA code
 327 object that will be generated by the assembler.
 328
 329 .hsa_code_object_isa [major, minor, stepping, vendor, arch]
 330 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 331
 332 *major*, *minor*, and *stepping* are all integers that describe the instruction
 333 set architecture (ISA) version of the assembly program.
 334
 335 *vendor* and *arch* are quoted strings.  *vendor* should always be equal to
 336 "AMD" and *arch* should always be equal to "AMDGPU".
 337
 338 By default, the assembler will derive the ISA version, *vendor*, and *arch*
 339 from the value of the -mcpu option that is passed to the assembler.
 340
 341 .amdgpu_hsa_kernel (name)
 342 ^^^^^^^^^^^^^^^^^^^^^^^^^
 343
 344 This directives specifies that the symbol with given name is a kernel entry point
 345 (label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
 346
 347 .amd_kernel_code_t
 348 ^^^^^^^^^^^^^^^^^^
 349
 350 This directive marks the beginning of a list of key / value pairs that are used
 351 to specify the amd_kernel_code_t object that will be emitted by the assembler.
 352 The list must be terminated by the *.end_amd_kernel_code_t* directive.  For
 353 any amd_kernel_code_t values that are unspecified a default value will be
 354 used.  The default value for all keys is 0, with the following exceptions:
 355
 356 - *kernel_code_version_major* defaults to 1.
 357 - *machine_kind* defaults to 1.
 358 - *machine_version_major*, *machine_version_minor*, and
 359   *machine_version_stepping* are derived from the value of the -mcpu option
 360   that is passed to the assembler.
 361 - *kernel_code_entry_byte_offset* defaults to 256.
 362 - *wavefront_size* defaults to 6.
 363 - *kernarg_segment_alignment*, *group_segment_alignment*, and
 364   *private_segment_alignment* default to 4.  Note that alignments are specified
 365   as a power of two, so a value of **n** means an alignment of 2^ **n**.
 366
 367 The *.amd_kernel_code_t* directive must be placed immediately after the
 368 function label and before any instructions.
 369
 370 For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
 371 comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
 372
 373 Here is an example of a minimal amd_kernel_code_t specification:
 374
 375 .. code-block:: none
 376
 377    .hsa_code_object_version 1,0
 378    .hsa_code_object_isa
 379
 380    .hsatext
 381    .globl  hello_world
 382    .p2align 8
 383    .amdgpu_hsa_kernel hello_world
 384
 385    hello_world:
 386
 387       .amd_kernel_code_t
 388          enable_sgpr_kernarg_segment_ptr = 1
 389          is_ptr64 = 1
 390          compute_pgm_rsrc1_vgprs = 0
 391          compute_pgm_rsrc1_sgprs = 0
 392          compute_pgm_rsrc2_user_sgpr = 2
 393          kernarg_segment_byte_size = 8
 394          wavefront_sgpr_count = 2
 395          workitem_vgpr_count = 3
 396      .end_amd_kernel_code_t
 397
 398      s_load_dwordx2 s[0:1], s[0:1] 0x0
 399      v_mov_b32 v0, 3.14159
 400      s_waitcnt lgkmcnt(0)
 401      v_mov_b32 v1, s0
 402      v_mov_b32 v2, s1
 403      flat_store_dword v[1:2], v0
 404      s_endpgm
 405    .Lfunc_end0:
 406         .size   hello_world, .Lfunc_end0-hello_world