1 ==============================
2 User Guide for AMDGPU Back-end
3 ==============================
8 The AMDGPU back-end provides ISA code generation for AMD GPUs, starting with
9 the R600 family up until the current Volcanic Islands (GCN Gen 3).
11 Refer to `AMDGPU section in Architecture & Platform Information for Compiler Writers <CompilerWriterInfo.html#amdgpu>`_
12 for additional documentation.
20 The AMDGPU back-end uses the following address space mapping:
22 ================== =================== ==============
23 LLVM Address Space DWARF Address Space Memory Space
24 ================== =================== ==============
31 ================== =================== ==============
33 The terminology in the table, aside from the region memory space, is from the
36 LLVM Address Space is used throughout LLVM (for example, in LLVM IR). DWARF
37 Address Space is emitted in DWARF, and is used by tools, such as debugger,
42 The OS element of the target triple controls the trap handler behavior.
46 For code objects generated by AMDGPU back-end for the HSA OS, the runtime
47 installs a trap handler that supports the s_trap instruction with the following
50 +--------------+-------------+-------------------+----------------------------+
51 |Usage |Code Sequence|Trap Handler Inputs|Description |
52 +==============+=============+===================+============================+
53 |reserved |s_trap 0x00 | |Reserved by hardware. |
54 +--------------+-------------+-------------------+----------------------------+
55 |HSA debugtrap |s_trap 0x01 |SGPR0-1: queue_ptr |Reserved for HSA debugtrap |
56 |(arg) | |VGPR0: arg |intrinsic (not implemented).|
57 +--------------+-------------+-------------------+----------------------------+
58 |llvm.trap |s_trap 0x02 |SGPR0-1: queue_ptr |Causes dispatch to be |
59 | | | |terminated and its |
60 | | | |associated queue put into |
61 | | | |the error state. |
62 +--------------+-------------+-------------------+----------------------------+
63 |llvm.debugtrap| s_trap 0x03 |SGPR0-1: queue_ptr |If debugger not installed |
64 | | | |handled same as llvm.trap. |
65 +--------------+-------------+-------------------+----------------------------+
66 |debugger |s_trap 0x07 | |Reserved for debugger |
67 |breakpoint | | |breakpoints. |
68 +--------------+-------------+-------------------+----------------------------+
69 |debugger |s_trap 0x08 | |Reserved for debugger. |
70 +--------------+-------------+-------------------+----------------------------+
71 |debugger |s_trap 0xfe | |Reserved for debugger. |
72 +--------------+-------------+-------------------+----------------------------+
73 |debugger |s_trap 0xff | |Reserved for debugger. |
74 +--------------+-------------+-------------------+----------------------------+
78 For code objects generated by AMDGPU back-end for non-HSA OS, the runtime does
79 not install a trap handler. The llvm.trap and llvm.debugtrap instructions are
82 =============== ============= ===============================================
83 Usage Code Sequence Description
84 =============== ============= ===============================================
85 llvm.trap s_endpgm Causes wavefront to be terminated.
86 llvm.debugtrap s_nop No operation. Compiler warning generated that
87 there is no trap handler installed.
88 =============== ============= ===============================================
93 AMDGPU backend has LLVM-MC based assembler which is currently in development.
94 It supports Southern Islands ISA, Sea Islands and Volcanic Islands.
96 This document describes general syntax for instructions and operands. For more
97 information about instructions, their semantics and supported combinations
98 of operands, refer to one of Instruction Set Architecture manuals.
100 An instruction has the following syntax (register operands are
101 normally comma-separated while extra operands are space-separated):
103 *<opcode> <register_operand0>, ... <extra_operand0> ...*
109 The following syntax for register operands is supported:
111 * SGPR registers: s0, ... or s[0], ...
112 * VGPR registers: v0, ... or v[0], ...
113 * TTMP registers: ttmp0, ... or ttmp[0], ...
114 * Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
115 * Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
116 * Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
117 * Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
118 * Register index expressions: v[2*2], s[1-1:2-1]
119 * 'off' indicates that an operand is not enabled
121 The following extra operands are supported:
123 * offset, offset0, offset1
126 * waitcnt: integer or combination of counter values
129 - abs (\| \|), neg (\-)
133 - row_shl, row_shr, row_ror, row_rol
134 - row_mirror, row_half_mirror, row_bcast
135 - wave_shl, wave_shr, wave_ror, wave_rol, quad_perm
136 - row_mask, bank_mask, bound_ctrl
140 - dst_sel, src0_sel, src1_sel (BYTE_N, WORD_M, DWORD)
141 - dst_unused (UNUSED_PAD, UNUSED_SEXT, UNUSED_PRESERVE)
144 DS Instructions Examples
145 ------------------------
149 ds_add_u32 v2, v4 offset:16
150 ds_write_src2_b64 v2 offset0:4 offset1:8
151 ds_cmpst_f32 v2, v4, v6
152 ds_min_rtn_f64 v[8:9], v2, v[4:5]
155 For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
157 FLAT Instruction Examples
158 --------------------------
162 flat_load_dword v1, v[3:4]
163 flat_store_dwordx3 v[3:4], v[5:7]
164 flat_atomic_swap v1, v[3:4], v5 glc
165 flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
166 flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
168 For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
170 MUBUF Instruction Examples
171 ---------------------------
175 buffer_load_dword v1, off, s[4:7], s1
176 buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
177 buffer_store_format_xy v[1:2], off, s[4:7], s1
179 buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
181 For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
183 SMRD/SMEM Instruction Examples
184 -------------------------------
188 s_load_dword s1, s[2:3], 0xfc
189 s_load_dwordx8 s[8:15], s[2:3], s4
190 s_load_dwordx16 s[88:103], s[2:3], s4
194 For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
196 SOP1 Instruction Examples
197 --------------------------
202 s_mov_b64 s[0:1], 0x80000000
204 s_wqm_b64 s[2:3], s[4:5]
205 s_bcnt0_i32_b64 s1, s[2:3]
206 s_swappc_b64 s[2:3], s[4:5]
207 s_cbranch_join s[4:5]
209 For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
211 SOP2 Instruction Examples
212 -------------------------
217 s_and_b64 s[2:3], s[4:5], s[6:7]
218 s_cselect_b32 s1, s2, s3
219 s_andn2_b32 s2, s4, s6
220 s_lshr_b64 s[2:3], s[4:5], s6
221 s_ashr_i32 s2, s4, s6
222 s_bfm_b64 s[2:3], s4, s6
223 s_bfe_i64 s[2:3], s[4:5], s6
224 s_cbranch_g_fork s[4:5], s[6:7]
226 For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
228 SOPC Instruction Examples
229 --------------------------
235 s_bitcmp0_b64 s[2:3], s4
238 For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
240 SOPP Instruction Examples
241 --------------------------
248 s_waitcnt 0 ; Wait for all counters to be 0
249 s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
250 s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
254 s_sendmsg sendmsg(MSG_INTERRUPT)
257 For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
259 Unless otherwise mentioned, little verification is performed on the operands
260 of SOPP Instructions, so it is up to the programmer to be familiar with the
261 range or acceptable values.
263 Vector ALU Instruction Examples
264 -------------------------------
266 For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
267 the assembler will automatically use optimal encoding based on its operands.
268 To force specific encoding, one can add a suffix to the opcode of the instruction:
270 * _e32 for 32-bit VOP1/VOP2/VOPC
271 * _e64 for 64-bit VOP3
275 VOP1/VOP2/VOP3/VOPC examples:
282 v_cvt_f64_i32_e32 v[1:2], v2
283 v_floor_f32_e32 v1, v2
284 v_bfrev_b32_e32 v1, v2
285 v_add_f32_e32 v1, v2, v3
286 v_mul_i32_i24_e64 v1, v2, 3
287 v_mul_i32_i24_e32 v1, -3, v3
288 v_mul_i32_i24_e32 v1, -100, v3
289 v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
290 v_max_f16_e32 v1, v2, v3
296 v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
297 v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
298 v_mov_b32 v0, v0 wave_shl:1
299 v_mov_b32 v0, v0 row_mirror
300 v_mov_b32 v0, v0 row_bcast:31
301 v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
302 v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
303 v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
309 v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
310 v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
311 v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
312 v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
313 v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
315 For full list of supported instructions, refer to "Vector ALU instructions".
317 HSA Code Object Directives
318 --------------------------
320 AMDGPU ABI defines auxiliary data in output code object. In assembly source,
321 one can specify them with assembler directives.
323 .hsa_code_object_version major, minor
324 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
326 *major* and *minor* are integers that specify the version of the HSA code
327 object that will be generated by the assembler.
329 .hsa_code_object_isa [major, minor, stepping, vendor, arch]
330 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
332 *major*, *minor*, and *stepping* are all integers that describe the instruction
333 set architecture (ISA) version of the assembly program.
335 *vendor* and *arch* are quoted strings. *vendor* should always be equal to
336 "AMD" and *arch* should always be equal to "AMDGPU".
338 By default, the assembler will derive the ISA version, *vendor*, and *arch*
339 from the value of the -mcpu option that is passed to the assembler.
341 .amdgpu_hsa_kernel (name)
342 ^^^^^^^^^^^^^^^^^^^^^^^^^
344 This directives specifies that the symbol with given name is a kernel entry point
345 (label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
350 This directive marks the beginning of a list of key / value pairs that are used
351 to specify the amd_kernel_code_t object that will be emitted by the assembler.
352 The list must be terminated by the *.end_amd_kernel_code_t* directive. For
353 any amd_kernel_code_t values that are unspecified a default value will be
354 used. The default value for all keys is 0, with the following exceptions:
356 - *kernel_code_version_major* defaults to 1.
357 - *machine_kind* defaults to 1.
358 - *machine_version_major*, *machine_version_minor*, and
359 *machine_version_stepping* are derived from the value of the -mcpu option
360 that is passed to the assembler.
361 - *kernel_code_entry_byte_offset* defaults to 256.
362 - *wavefront_size* defaults to 6.
363 - *kernarg_segment_alignment*, *group_segment_alignment*, and
364 *private_segment_alignment* default to 4. Note that alignments are specified
365 as a power of two, so a value of **n** means an alignment of 2^ **n**.
367 The *.amd_kernel_code_t* directive must be placed immediately after the
368 function label and before any instructions.
370 For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
371 comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
373 Here is an example of a minimal amd_kernel_code_t specification:
377 .hsa_code_object_version 1,0
383 .amdgpu_hsa_kernel hello_world
388 enable_sgpr_kernarg_segment_ptr = 1
390 compute_pgm_rsrc1_vgprs = 0
391 compute_pgm_rsrc1_sgprs = 0
392 compute_pgm_rsrc2_user_sgpr = 2
393 kernarg_segment_byte_size = 8
394 wavefront_sgpr_count = 2
395 workitem_vgpr_count = 3
396 .end_amd_kernel_code_t
398 s_load_dwordx2 s[0:1], s[0:1] 0x0
399 v_mov_b32 v0, 3.14159
403 flat_store_dword v[1:2], v0
406 .size hello_world, .Lfunc_end0-hello_world