docs/NVPTXUsage.rst

   1 =============================
   2 User Guide for NVPTX Back-end
   3 =============================
   4
   5 .. contents::
   6    :local:
   7    :depth: 3
   8
   9
  10 Introduction
  11 ============
  12
  13 To support GPU programming, the NVPTX back-end supports a subset of LLVM IR
  14 along with a defined set of conventions used to represent GPU programming
  15 concepts. This document provides an overview of the general usage of the back-
  16 end, including a description of the conventions used and the set of accepted
  17 LLVM IR.
  18
  19 .. note::
  20
  21    This document assumes a basic familiarity with CUDA and the PTX
  22    assembly language. Information about the CUDA Driver API and the PTX assembly
  23    language can be found in the `CUDA documentation
  24    <http://docs.nvidia.com/cuda/index.html>`_.
  25
  26
  27
  28 Conventions
  29 ===========
  30
  31 Marking Functions as Kernels
  32 ----------------------------
  33
  34 In PTX, there are two types of functions: *device functions*, which are only
  35 callable by device code, and *kernel functions*, which are callable by host
  36 code. By default, the back-end will emit device functions. Metadata is used to
  37 declare a function as a kernel function. This metadata is attached to the
  38 ``nvvm.annotations`` named metadata object, and has the following format:
  39
  40 .. code-block:: text
  41
  42    !0 = !{<function-ref>, metadata !"kernel", i32 1}
  43
  44 The first parameter is a reference to the kernel function. The following
  45 example shows a kernel function calling a device function in LLVM IR. The
  46 function ``@my_kernel`` is callable from host code, but ``@my_fmad`` is not.
  47
  48 .. code-block:: llvm
  49
  50     define float @my_fmad(float %x, float %y, float %z) {
  51       %mul = fmul float %x, %y
  52       %add = fadd float %mul, %z
  53       ret float %add
  54     }
  55
  56     define void @my_kernel(float* %ptr) {
  57       %val = load float, float* %ptr
  58       %ret = call float @my_fmad(float %val, float %val, float %val)
  59       store float %ret, float* %ptr
  60       ret void
  61     }
  62
  63     !nvvm.annotations = !{!1}
  64     !1 = !{void (float*)* @my_kernel, !"kernel", i32 1}
  65
  66 When compiled, the PTX kernel functions are callable by host-side code.
  67
  68
  69 .. _address_spaces:
  70
  71 Address Spaces
  72 --------------
  73
  74 The NVPTX back-end uses the following address space mapping:
  75
  76    ============= ======================
  77    Address Space Memory Space
  78    ============= ======================
  79    0             Generic
  80    1             Global
  81    2             Internal Use
  82    3             Shared
  83    4             Constant
  84    5             Local
  85    ============= ======================
  86
  87 Every global variable and pointer type is assigned to one of these address
  88 spaces, with 0 being the default address space. Intrinsics are provided which
  89 can be used to convert pointers between the generic and non-generic address
  90 spaces.
  91
  92 As an example, the following IR will define an array ``@g`` that resides in
  93 global device memory.
  94
  95 .. code-block:: llvm
  96
  97     @g = internal addrspace(1) global [4 x i32] [ i32 0, i32 1, i32 2, i32 3 ]
  98
  99 LLVM IR functions can read and write to this array, and host-side code can
 100 copy data to it by name with the CUDA Driver API.
 101
 102 Note that since address space 0 is the generic space, it is illegal to have
 103 global variables in address space 0.  Address space 0 is the default address
 104 space in LLVM, so the ``addrspace(N)`` annotation is *required* for global
 105 variables.
 106
 107
 108 Triples
 109 -------
 110
 111 The NVPTX target uses the module triple to select between 32/64-bit code
 112 generation and the driver-compiler interface to use. The triple architecture
 113 can be one of ``nvptx`` (32-bit PTX) or ``nvptx64`` (64-bit PTX). The
 114 operating system should be one of ``cuda`` or ``nvcl``, which determines the
 115 interface used by the generated code to communicate with the driver.  Most
 116 users will want to use ``cuda`` as the operating system, which makes the
 117 generated PTX compatible with the CUDA Driver API.
 118
 119 Example: 32-bit PTX for CUDA Driver API: ``nvptx-nvidia-cuda``
 120
 121 Example: 64-bit PTX for CUDA Driver API: ``nvptx64-nvidia-cuda``
 122
 123
 124
 125 .. _nvptx_intrinsics:
 126
 127 NVPTX Intrinsics
 128 ================
 129
 130 Address Space Conversion
 131 ------------------------
 132
 133 '``llvm.nvvm.ptr.*.to.gen``' Intrinsics
 134 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 135
 136 Syntax:
 137 """""""
 138
 139 These are overloaded intrinsics.  You can use these on any pointer types.
 140
 141 .. code-block:: llvm
 142
 143     declare i8* @llvm.nvvm.ptr.global.to.gen.p0i8.p1i8(i8 addrspace(1)*)
 144     declare i8* @llvm.nvvm.ptr.shared.to.gen.p0i8.p3i8(i8 addrspace(3)*)
 145     declare i8* @llvm.nvvm.ptr.constant.to.gen.p0i8.p4i8(i8 addrspace(4)*)
 146     declare i8* @llvm.nvvm.ptr.local.to.gen.p0i8.p5i8(i8 addrspace(5)*)
 147
 148 Overview:
 149 """""""""
 150
 151 The '``llvm.nvvm.ptr.*.to.gen``' intrinsics convert a pointer in a non-generic
 152 address space to a generic address space pointer.
 153
 154 Semantics:
 155 """"""""""
 156
 157 These intrinsics modify the pointer value to be a valid generic address space
 158 pointer.
 159
 160
 161 '``llvm.nvvm.ptr.gen.to.*``' Intrinsics
 162 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 163
 164 Syntax:
 165 """""""
 166
 167 These are overloaded intrinsics.  You can use these on any pointer types.
 168
 169 .. code-block:: llvm
 170
 171     declare i8 addrspace(1)* @llvm.nvvm.ptr.gen.to.global.p1i8.p0i8(i8*)
 172     declare i8 addrspace(3)* @llvm.nvvm.ptr.gen.to.shared.p3i8.p0i8(i8*)
 173     declare i8 addrspace(4)* @llvm.nvvm.ptr.gen.to.constant.p4i8.p0i8(i8*)
 174     declare i8 addrspace(5)* @llvm.nvvm.ptr.gen.to.local.p5i8.p0i8(i8*)
 175
 176 Overview:
 177 """""""""
 178
 179 The '``llvm.nvvm.ptr.gen.to.*``' intrinsics convert a pointer in the generic
 180 address space to a pointer in the target address space.  Note that these
 181 intrinsics are only useful if the address space of the target address space of
 182 the pointer is known.  It is not legal to use address space conversion
 183 intrinsics to convert a pointer from one non-generic address space to another
 184 non-generic address space.
 185
 186 Semantics:
 187 """"""""""
 188
 189 These intrinsics modify the pointer value to be a valid pointer in the target
 190 non-generic address space.
 191
 192
 193 Reading PTX Special Registers
 194 -----------------------------
 195
 196 '``llvm.nvvm.read.ptx.sreg.*``'
 197 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 198
 199 Syntax:
 200 """""""
 201
 202 .. code-block:: llvm
 203
 204     declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()
 205     declare i32 @llvm.nvvm.read.ptx.sreg.tid.y()
 206     declare i32 @llvm.nvvm.read.ptx.sreg.tid.z()
 207     declare i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
 208     declare i32 @llvm.nvvm.read.ptx.sreg.ntid.y()
 209     declare i32 @llvm.nvvm.read.ptx.sreg.ntid.z()
 210     declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
 211     declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.y()
 212     declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.z()
 213     declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.x()
 214     declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.y()
 215     declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.z()
 216     declare i32 @llvm.nvvm.read.ptx.sreg.warpsize()
 217
 218 Overview:
 219 """""""""
 220
 221 The '``@llvm.nvvm.read.ptx.sreg.*``' intrinsics provide access to the PTX
 222 special registers, in particular the kernel launch bounds.  These registers
 223 map in the following way to CUDA builtins:
 224
 225    ============ =====================================
 226    CUDA Builtin PTX Special Register Intrinsic
 227    ============ =====================================
 228    ``threadId`` ``@llvm.nvvm.read.ptx.sreg.tid.*``
 229    ``blockIdx`` ``@llvm.nvvm.read.ptx.sreg.ctaid.*``
 230    ``blockDim`` ``@llvm.nvvm.read.ptx.sreg.ntid.*``
 231    ``gridDim``  ``@llvm.nvvm.read.ptx.sreg.nctaid.*``
 232    ============ =====================================
 233
 234
 235 Barriers
 236 --------
 237
 238 '``llvm.nvvm.barrier0``'
 239 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
 240
 241 Syntax:
 242 """""""
 243
 244 .. code-block:: llvm
 245
 246   declare void @llvm.nvvm.barrier0()
 247
 248 Overview:
 249 """""""""
 250
 251 The '``@llvm.nvvm.barrier0()``' intrinsic emits a PTX ``bar.sync 0``
 252 instruction, equivalent to the ``__syncthreads()`` call in CUDA.
 253
 254
 255 Other Intrinsics
 256 ----------------
 257
 258 For the full set of NVPTX intrinsics, please see the
 259 ``include/llvm/IR/IntrinsicsNVVM.td`` file in the LLVM source tree.
 260
 261
 262 .. _libdevice:
 263
 264 Linking with Libdevice
 265 ======================
 266
 267 The CUDA Toolkit comes with an LLVM bitcode library called ``libdevice`` that
 268 implements many common mathematical functions. This library can be used as a
 269 high-performance math library for any compilers using the LLVM NVPTX target.
 270 The library can be found under ``nvvm/libdevice/`` in the CUDA Toolkit and
 271 there is a separate version for each compute architecture.
 272
 273 For a list of all math functions implemented in libdevice, see
 274 `libdevice Users Guide <http://docs.nvidia.com/cuda/libdevice-users-guide/index.html>`_.
 275
 276 To accommodate various math-related compiler flags that can affect code
 277 generation of libdevice code, the library code depends on a special LLVM IR
 278 pass (``NVVMReflect``) to handle conditional compilation within LLVM IR. This
 279 pass looks for calls to the ``@__nvvm_reflect`` function and replaces them
 280 with constants based on the defined reflection parameters. Such conditional
 281 code often follows a pattern:
 282
 283 .. code-block:: c++
 284
 285   float my_function(float a) {
 286     if (__nvvm_reflect("FASTMATH"))
 287       return my_function_fast(a);
 288     else
 289       return my_function_precise(a);
 290   }
 291
 292 The default value for all unspecified reflection parameters is zero.
 293
 294 The ``NVVMReflect`` pass should be executed early in the optimization
 295 pipeline, immediately after the link stage. The ``internalize`` pass is also
 296 recommended to remove unused math functions from the resulting PTX. For an
 297 input IR module ``module.bc``, the following compilation flow is recommended:
 298
 299 1. Save list of external functions in ``module.bc``
 300 2. Link ``module.bc`` with ``libdevice.compute_XX.YY.bc``
 301 3. Internalize all functions not in list from (1)
 302 4. Eliminate all unused internal functions
 303 5. Run ``NVVMReflect`` pass
 304 6. Run standard optimization pipeline
 305
 306 .. note::
 307
 308   ``linkonce`` and ``linkonce_odr`` linkage types are not suitable for the
 309   libdevice functions. It is possible to link two IR modules that have been
 310   linked against libdevice using different reflection variables.
 311
 312 Since the ``NVVMReflect`` pass replaces conditionals with constants, it will
 313 often leave behind dead code of the form:
 314
 315 .. code-block:: llvm
 316
 317   entry:
 318     ..
 319     br i1 true, label %foo, label %bar
 320   foo:
 321     ..
 322   bar:
 323     ; Dead code
 324     ..
 325
 326 Therefore, it is recommended that ``NVVMReflect`` is executed early in the
 327 optimization pipeline before dead-code elimination.
 328
 329 The NVPTX TargetMachine knows how to schedule ``NVVMReflect`` at the beginning
 330 of your pass manager; just use the following code when setting up your pass
 331 manager:
 332
 333 .. code-block:: c++
 334     std::unique_ptr<TargetMachine> TM = ...;
 335     PassManagerBuilder PMBuilder(...);
 336     PMBuilder.addExtension(
 337         PassManagerBuilder::EP_EarlyAsPossible,
 338         [&](const PassManagerBuilder &, legacy::PassManagerBase &PM) {
 339           TM->addEarlyAsPossiblePasses(PM);
 340         });
 341
 342 Reflection Parameters
 343 ---------------------
 344
 345 The libdevice library currently uses the following reflection parameters to
 346 control code generation:
 347
 348 ==================== ======================================================
 349 Flag                 Description
 350 ==================== ======================================================
 351 ``__CUDA_FTZ=[0,1]`` Use optimized code paths that flush subnormals to zero
 352 ==================== ======================================================
 353
 354 The value of this flag is determined by the "nvvm-reflect-ftz" module flag.
 355 The following sets the ftz flag to 1.
 356
 357 .. code-block:: llvm
 358     !llvm.module.flag = !{!0}
 359     !0 = !{i32 4, !"nvvm-reflect-ftz", i32 1}
 360
 361 (``i32 4`` indicates that the value set here overrides the value in another
 362 module we link with.  See the `LangRef <LangRef.html#module-flags-metadata>`
 363 for details.)
 364
 365 Executing PTX
 366 =============
 367
 368 The most common way to execute PTX assembly on a GPU device is to use the CUDA
 369 Driver API. This API is a low-level interface to the GPU driver and allows for
 370 JIT compilation of PTX code to native GPU machine code.
 371
 372 Initializing the Driver API:
 373
 374 .. code-block:: c++
 375
 376     CUdevice device;
 377     CUcontext context;
 378
 379     // Initialize the driver API
 380     cuInit(0);
 381     // Get a handle to the first compute device
 382     cuDeviceGet(&device, 0);
 383     // Create a compute device context
 384     cuCtxCreate(&context, 0, device);
 385
 386 JIT compiling a PTX string to a device binary:
 387
 388 .. code-block:: c++
 389
 390     CUmodule module;
 391     CUfunction function;
 392
 393     // JIT compile a null-terminated PTX string
 394     cuModuleLoadData(&module, (void*)PTXString);
 395
 396     // Get a handle to the "myfunction" kernel function
 397     cuModuleGetFunction(&function, module, "myfunction");
 398
 399 For full examples of executing PTX assembly, please see the `CUDA Samples
 400 <https://developer.nvidia.com/cuda-downloads>`_ distribution.
 401
 402
 403 Common Issues
 404 =============
 405
 406 ptxas complains of undefined function: __nvvm_reflect
 407 -----------------------------------------------------
 408
 409 When linking with libdevice, the ``NVVMReflect`` pass must be used. See
 410 :ref:`libdevice` for more information.
 411
 412
 413 Tutorial: A Simple Compute Kernel
 414 =================================
 415
 416 To start, let us take a look at a simple compute kernel written directly in
 417 LLVM IR. The kernel implements vector addition, where each thread computes one
 418 element of the output vector C from the input vectors A and B.  To make this
 419 easier, we also assume that only a single CTA (thread block) will be launched,
 420 and that it will be one dimensional.
 421
 422
 423 The Kernel
 424 ----------
 425
 426 .. code-block:: llvm
 427
 428   target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
 429   target triple = "nvptx64-nvidia-cuda"
 430
 431   ; Intrinsic to read X component of thread ID
 432   declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
 433
 434   define void @kernel(float addrspace(1)* %A,
 435                       float addrspace(1)* %B,
 436                       float addrspace(1)* %C) {
 437   entry:
 438     ; What is my ID?
 439     %id = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
 440
 441     ; Compute pointers into A, B, and C
 442     %ptrA = getelementptr float, float addrspace(1)* %A, i32 %id
 443     %ptrB = getelementptr float, float addrspace(1)* %B, i32 %id
 444     %ptrC = getelementptr float, float addrspace(1)* %C, i32 %id
 445
 446     ; Read A, B
 447     %valA = load float, float addrspace(1)* %ptrA, align 4
 448     %valB = load float, float addrspace(1)* %ptrB, align 4
 449
 450     ; Compute C = A + B
 451     %valC = fadd float %valA, %valB
 452
 453     ; Store back to C
 454     store float %valC, float addrspace(1)* %ptrC, align 4
 455
 456     ret void
 457   }
 458
 459   !nvvm.annotations = !{!0}
 460   !0 = !{void (float addrspace(1)*,
 461                float addrspace(1)*,
 462                float addrspace(1)*)* @kernel, !"kernel", i32 1}
 463
 464
 465 We can use the LLVM ``llc`` tool to directly run the NVPTX code generator:
 466
 467 .. code-block:: text
 468
 469   # llc -mcpu=sm_20 kernel.ll -o kernel.ptx
 470
 471
 472 .. note::
 473
 474   If you want to generate 32-bit code, change ``p:64:64:64`` to ``p:32:32:32``
 475   in the module data layout string and use ``nvptx-nvidia-cuda`` as the
 476   target triple.
 477
 478
 479 The output we get from ``llc`` (as of LLVM 3.4):
 480
 481 .. code-block:: text
 482
 483   //
 484   // Generated by LLVM NVPTX Back-End
 485   //
 486
 487   .version 3.1
 488   .target sm_20
 489   .address_size 64
 490
 491     // .globl kernel
 492                                           // @kernel
 493   .visible .entry kernel(
 494     .param .u64 kernel_param_0,
 495     .param .u64 kernel_param_1,
 496     .param .u64 kernel_param_2
 497   )
 498   {
 499     .reg .f32   %f<4>;
 500     .reg .s32   %r<2>;
 501     .reg .s64   %rl<8>;
 502
 503   // BB#0:                                // %entry
 504     ld.param.u64    %rl1, [kernel_param_0];
 505     mov.u32         %r1, %tid.x;
 506     mul.wide.s32    %rl2, %r1, 4;
 507     add.s64         %rl3, %rl1, %rl2;
 508     ld.param.u64    %rl4, [kernel_param_1];
 509     add.s64         %rl5, %rl4, %rl2;
 510     ld.param.u64    %rl6, [kernel_param_2];
 511     add.s64         %rl7, %rl6, %rl2;
 512     ld.global.f32   %f1, [%rl3];
 513     ld.global.f32   %f2, [%rl5];
 514     add.f32         %f3, %f1, %f2;
 515     st.global.f32   [%rl7], %f3;
 516     ret;
 517   }
 518
 519
 520 Dissecting the Kernel
 521 ---------------------
 522
 523 Now let us dissect the LLVM IR that makes up this kernel.
 524
 525 Data Layout
 526 ^^^^^^^^^^^
 527
 528 The data layout string determines the size in bits of common data types, their
 529 ABI alignment, and their storage size.  For NVPTX, you should use one of the
 530 following:
 531
 532 32-bit PTX:
 533
 534 .. code-block:: llvm
 535
 536   target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
 537
 538 64-bit PTX:
 539
 540 .. code-block:: llvm
 541
 542   target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
 543
 544
 545 Target Intrinsics
 546 ^^^^^^^^^^^^^^^^^
 547
 548 In this example, we use the ``@llvm.nvvm.read.ptx.sreg.tid.x`` intrinsic to
 549 read the X component of the current thread's ID, which corresponds to a read
 550 of register ``%tid.x`` in PTX. The NVPTX back-end supports a large set of
 551 intrinsics.  A short list is shown below; please see
 552 ``include/llvm/IR/IntrinsicsNVVM.td`` for the full list.
 553
 554
 555 ================================================ ====================
 556 Intrinsic                                        CUDA Equivalent
 557 ================================================ ====================
 558 ``i32 @llvm.nvvm.read.ptx.sreg.tid.{x,y,z}``     threadIdx.{x,y,z}
 559 ``i32 @llvm.nvvm.read.ptx.sreg.ctaid.{x,y,z}``   blockIdx.{x,y,z}
 560 ``i32 @llvm.nvvm.read.ptx.sreg.ntid.{x,y,z}``    blockDim.{x,y,z}
 561 ``i32 @llvm.nvvm.read.ptx.sreg.nctaid.{x,y,z}``  gridDim.{x,y,z}
 562 ``void @llvm.nvvm.barrier0()``                   __syncthreads()
 563 ================================================ ====================
 564
 565
 566 Address Spaces
 567 ^^^^^^^^^^^^^^
 568
 569 You may have noticed that all of the pointer types in the LLVM IR example had
 570 an explicit address space specifier. What is address space 1? NVIDIA GPU
 571 devices (generally) have four types of memory:
 572
 573 - Global: Large, off-chip memory
 574 - Shared: Small, on-chip memory shared among all threads in a CTA
 575 - Local: Per-thread, private memory
 576 - Constant: Read-only memory shared across all threads
 577
 578 These different types of memory are represented in LLVM IR as address spaces.
 579 There is also a fifth address space used by the NVPTX code generator that
 580 corresponds to the "generic" address space.  This address space can represent
 581 addresses in any other address space (with a few exceptions).  This allows
 582 users to write IR functions that can load/store memory using the same
 583 instructions. Intrinsics are provided to convert pointers between the generic
 584 and non-generic address spaces.
 585
 586 See :ref:`address_spaces` and :ref:`nvptx_intrinsics` for more information.
 587
 588
 589 Kernel Metadata
 590 ^^^^^^^^^^^^^^^
 591
 592 In PTX, a function can be either a `kernel` function (callable from the host
 593 program), or a `device` function (callable only from GPU code). You can think
 594 of `kernel` functions as entry-points in the GPU program. To mark an LLVM IR
 595 function as a `kernel` function, we make use of special LLVM metadata. The
 596 NVPTX back-end will look for a named metadata node called
 597 ``nvvm.annotations``. This named metadata must contain a list of metadata that
 598 describe the IR. For our purposes, we need to declare a metadata node that
 599 assigns the "kernel" attribute to the LLVM IR function that should be emitted
 600 as a PTX `kernel` function. These metadata nodes take the form:
 601
 602 .. code-block:: text
 603
 604   !{<function ref>, metadata !"kernel", i32 1}
 605
 606 For the previous example, we have:
 607
 608 .. code-block:: llvm
 609
 610   !nvvm.annotations = !{!0}
 611   !0 = !{void (float addrspace(1)*,
 612                float addrspace(1)*,
 613                float addrspace(1)*)* @kernel, !"kernel", i32 1}
 614
 615 Here, we have a single metadata declaration in ``nvvm.annotations``. This
 616 metadata annotates our ``@kernel`` function with the ``kernel`` attribute.
 617
 618
 619 Running the Kernel
 620 ------------------
 621
 622 Generating PTX from LLVM IR is all well and good, but how do we execute it on
 623 a real GPU device? The CUDA Driver API provides a convenient mechanism for
 624 loading and JIT compiling PTX to a native GPU device, and launching a kernel.
 625 The API is similar to OpenCL.  A simple example showing how to load and
 626 execute our vector addition code is shown below. Note that for brevity this
 627 code does not perform much error checking!
 628
 629 .. note::
 630
 631   You can also use the ``ptxas`` tool provided by the CUDA Toolkit to offline
 632   compile PTX to machine code (SASS) for a specific GPU architecture. Such
 633   binaries can be loaded by the CUDA Driver API in the same way as PTX. This
 634   can be useful for reducing startup time by precompiling the PTX kernels.
 635
 636
 637 .. code-block:: c++
 638
 639   #include <iostream>
 640   #include <fstream>
 641   #include <cassert>
 642   #include "cuda.h"
 643
 644
 645   void checkCudaErrors(CUresult err) {
 646     assert(err == CUDA_SUCCESS);
 647   }
 648
 649   /// main - Program entry point
 650   int main(int argc, char **argv) {
 651     CUdevice    device;
 652     CUmodule    cudaModule;
 653     CUcontext   context;
 654     CUfunction  function;
 655     CUlinkState linker;
 656     int         devCount;
 657
 658     // CUDA initialization
 659     checkCudaErrors(cuInit(0));
 660     checkCudaErrors(cuDeviceGetCount(&devCount));
 661     checkCudaErrors(cuDeviceGet(&device, 0));
 662
 663     char name[128];
 664     checkCudaErrors(cuDeviceGetName(name, 128, device));
 665     std::cout << "Using CUDA Device [0]: " << name << "\n";
 666
 667     int devMajor, devMinor;
 668     checkCudaErrors(cuDeviceComputeCapability(&devMajor, &devMinor, device));
 669     std::cout << "Device Compute Capability: "
 670               << devMajor << "." << devMinor << "\n";
 671     if (devMajor < 2) {
 672       std::cerr << "ERROR: Device 0 is not SM 2.0 or greater\n";
 673       return 1;
 674     }
 675
 676     std::ifstream t("kernel.ptx");
 677     if (!t.is_open()) {
 678       std::cerr << "kernel.ptx not found\n";
 679       return 1;
 680     }
 681     std::string str((std::istreambuf_iterator<char>(t)),
 682                       std::istreambuf_iterator<char>());
 683
 684     // Create driver context
 685     checkCudaErrors(cuCtxCreate(&context, 0, device));
 686
 687     // Create module for object
 688     checkCudaErrors(cuModuleLoadDataEx(&cudaModule, str.c_str(), 0, 0, 0));
 689
 690     // Get kernel function
 691     checkCudaErrors(cuModuleGetFunction(&function, cudaModule, "kernel"));
 692
 693     // Device data
 694     CUdeviceptr devBufferA;
 695     CUdeviceptr devBufferB;
 696     CUdeviceptr devBufferC;
 697
 698     checkCudaErrors(cuMemAlloc(&devBufferA, sizeof(float)*16));
 699     checkCudaErrors(cuMemAlloc(&devBufferB, sizeof(float)*16));
 700     checkCudaErrors(cuMemAlloc(&devBufferC, sizeof(float)*16));
 701
 702     float* hostA = new float[16];
 703     float* hostB = new float[16];
 704     float* hostC = new float[16];
 705
 706     // Populate input
 707     for (unsigned i = 0; i != 16; ++i) {
 708       hostA[i] = (float)i;
 709       hostB[i] = (float)(2*i);
 710       hostC[i] = 0.0f;
 711     }
 712
 713     checkCudaErrors(cuMemcpyHtoD(devBufferA, &hostA[0], sizeof(float)*16));
 714     checkCudaErrors(cuMemcpyHtoD(devBufferB, &hostB[0], sizeof(float)*16));
 715
 716
 717     unsigned blockSizeX = 16;
 718     unsigned blockSizeY = 1;
 719     unsigned blockSizeZ = 1;
 720     unsigned gridSizeX  = 1;
 721     unsigned gridSizeY  = 1;
 722     unsigned gridSizeZ  = 1;
 723
 724     // Kernel parameters
 725     void *KernelParams[] = { &devBufferA, &devBufferB, &devBufferC };
 726
 727     std::cout << "Launching kernel\n";
 728
 729     // Kernel launch
 730     checkCudaErrors(cuLaunchKernel(function, gridSizeX, gridSizeY, gridSizeZ,
 731                                    blockSizeX, blockSizeY, blockSizeZ,
 732                                    0, NULL, KernelParams, NULL));
 733
 734     // Retrieve device data
 735     checkCudaErrors(cuMemcpyDtoH(&hostC[0], devBufferC, sizeof(float)*16));
 736
 737
 738     std::cout << "Results:\n";
 739     for (unsigned i = 0; i != 16; ++i) {
 740       std::cout << hostA[i] << " + " << hostB[i] << " = " << hostC[i] << "\n";
 741     }
 742
 743
 744     // Clean up after ourselves
 745     delete [] hostA;
 746     delete [] hostB;
 747     delete [] hostC;
 748
 749     // Clean-up
 750     checkCudaErrors(cuMemFree(devBufferA));
 751     checkCudaErrors(cuMemFree(devBufferB));
 752     checkCudaErrors(cuMemFree(devBufferC));
 753     checkCudaErrors(cuModuleUnload(cudaModule));
 754     checkCudaErrors(cuCtxDestroy(context));
 755
 756     return 0;
 757   }
 758
 759
 760 You will need to link with the CUDA driver and specify the path to cuda.h.
 761
 762 .. code-block:: text
 763
 764   # clang++ sample.cpp -o sample -O2 -g -I/usr/local/cuda-5.5/include -lcuda
 765
 766 We don't need to specify a path to ``libcuda.so`` since this is installed in a
 767 system location by the driver, not the CUDA toolkit.
 768
 769 If everything goes as planned, you should see the following output when
 770 running the compiled program:
 771
 772 .. code-block:: text
 773
 774   Using CUDA Device [0]: GeForce GTX 680
 775   Device Compute Capability: 3.0
 776   Launching kernel
 777   Results:
 778   0 + 0 = 0
 779   1 + 2 = 3
 780   2 + 4 = 6
 781   3 + 6 = 9
 782   4 + 8 = 12
 783   5 + 10 = 15
 784   6 + 12 = 18
 785   7 + 14 = 21
 786   8 + 16 = 24
 787   9 + 18 = 27
 788   10 + 20 = 30
 789   11 + 22 = 33
 790   12 + 24 = 36
 791   13 + 26 = 39
 792   14 + 28 = 42
 793   15 + 30 = 45
 794
 795 .. note::
 796
 797   You will likely see a different device identifier based on your hardware
 798
 799
 800 Tutorial: Linking with Libdevice
 801 ================================
 802
 803 In this tutorial, we show a simple example of linking LLVM IR with the
 804 libdevice library. We will use the same kernel as the previous tutorial,
 805 except that we will compute ``C = pow(A, B)`` instead of ``C = A + B``.
 806 Libdevice provides an ``__nv_powf`` function that we will use.
 807
 808 .. code-block:: llvm
 809
 810   target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
 811   target triple = "nvptx64-nvidia-cuda"
 812
 813   ; Intrinsic to read X component of thread ID
 814   declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
 815   ; libdevice function
 816   declare float @__nv_powf(float, float)
 817
 818   define void @kernel(float addrspace(1)* %A,
 819                       float addrspace(1)* %B,
 820                       float addrspace(1)* %C) {
 821   entry:
 822     ; What is my ID?
 823     %id = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
 824
 825     ; Compute pointers into A, B, and C
 826     %ptrA = getelementptr float, float addrspace(1)* %A, i32 %id
 827     %ptrB = getelementptr float, float addrspace(1)* %B, i32 %id
 828     %ptrC = getelementptr float, float addrspace(1)* %C, i32 %id
 829
 830     ; Read A, B
 831     %valA = load float, float addrspace(1)* %ptrA, align 4
 832     %valB = load float, float addrspace(1)* %ptrB, align 4
 833
 834     ; Compute C = pow(A, B)
 835     %valC = call float @__nv_powf(float %valA, float %valB)
 836
 837     ; Store back to C
 838     store float %valC, float addrspace(1)* %ptrC, align 4
 839
 840     ret void
 841   }
 842
 843   !nvvm.annotations = !{!0}
 844   !0 = !{void (float addrspace(1)*,
 845                float addrspace(1)*,
 846                float addrspace(1)*)* @kernel, !"kernel", i32 1}
 847
 848
 849 To compile this kernel, we perform the following steps:
 850
 851 1. Link with libdevice
 852 2. Internalize all but the public kernel function
 853 3. Run ``NVVMReflect`` and set ``__CUDA_FTZ`` to 0
 854 4. Optimize the linked module
 855 5. Codegen the module
 856
 857
 858 These steps can be performed by the LLVM ``llvm-link``, ``opt``, and ``llc``
 859 tools. In a complete compiler, these steps can also be performed entirely
 860 programmatically by setting up an appropriate pass configuration (see
 861 :ref:`libdevice`).
 862
 863 .. code-block:: text
 864
 865   # llvm-link t2.bc libdevice.compute_20.10.bc -o t2.linked.bc
 866   # opt -internalize -internalize-public-api-list=kernel -nvvm-reflect-list=__CUDA_FTZ=0 -nvvm-reflect -O3 t2.linked.bc -o t2.opt.bc
 867   # llc -mcpu=sm_20 t2.opt.bc -o t2.ptx
 868
 869 .. note::
 870
 871   The ``-nvvm-reflect-list=_CUDA_FTZ=0`` is not strictly required, as any
 872   undefined variables will default to zero. It is shown here for evaluation
 873   purposes.
 874
 875
 876 This gives us the following PTX (excerpt):
 877
 878 .. code-block:: text
 879
 880   //
 881   // Generated by LLVM NVPTX Back-End
 882   //
 883
 884   .version 3.1
 885   .target sm_20
 886   .address_size 64
 887
 888     // .globl kernel
 889                                           // @kernel
 890   .visible .entry kernel(
 891     .param .u64 kernel_param_0,
 892     .param .u64 kernel_param_1,
 893     .param .u64 kernel_param_2
 894   )
 895   {
 896     .reg .pred  %p<30>;
 897     .reg .f32   %f<111>;
 898     .reg .s32   %r<21>;
 899     .reg .s64   %rl<8>;
 900
 901   // BB#0:                                // %entry
 902     ld.param.u64  %rl2, [kernel_param_0];
 903     mov.u32   %r3, %tid.x;
 904     ld.param.u64  %rl3, [kernel_param_1];
 905     mul.wide.s32  %rl4, %r3, 4;
 906     add.s64   %rl5, %rl2, %rl4;
 907     ld.param.u64  %rl6, [kernel_param_2];
 908     add.s64   %rl7, %rl3, %rl4;
 909     add.s64   %rl1, %rl6, %rl4;
 910     ld.global.f32   %f1, [%rl5];
 911     ld.global.f32   %f2, [%rl7];
 912     setp.eq.f32 %p1, %f1, 0f3F800000;
 913     setp.eq.f32 %p2, %f2, 0f00000000;
 914     or.pred   %p3, %p1, %p2;
 915     @%p3 bra  BB0_1;
 916     bra.uni   BB0_2;
 917   BB0_1:
 918     mov.f32   %f110, 0f3F800000;
 919     st.global.f32   [%rl1], %f110;
 920     ret;
 921   BB0_2:                                  // %__nv_isnanf.exit.i
 922     abs.f32   %f4, %f1;
 923     setp.gtu.f32  %p4, %f4, 0f7F800000;
 924     @%p4 bra  BB0_4;
 925   // BB#3:                                // %__nv_isnanf.exit5.i
 926     abs.f32   %f5, %f2;
 927     setp.le.f32 %p5, %f5, 0f7F800000;
 928     @%p5 bra  BB0_5;
 929   BB0_4:                                  // %.critedge1.i
 930     add.f32   %f110, %f1, %f2;
 931     st.global.f32   [%rl1], %f110;
 932     ret;
 933   BB0_5:                                  // %__nv_isinff.exit.i
 934
 935     ...
 936
 937   BB0_26:                                 // %__nv_truncf.exit.i.i.i.i.i
 938     mul.f32   %f90, %f107, 0f3FB8AA3B;
 939     cvt.rzi.f32.f32 %f91, %f90;
 940     mov.f32   %f92, 0fBF317200;
 941     fma.rn.f32  %f93, %f91, %f92, %f107;
 942     mov.f32   %f94, 0fB5BFBE8E;
 943     fma.rn.f32  %f95, %f91, %f94, %f93;
 944     mul.f32   %f89, %f95, 0f3FB8AA3B;
 945     // inline asm
 946     ex2.approx.ftz.f32 %f88,%f89;
 947     // inline asm
 948     add.f32   %f96, %f91, 0f00000000;
 949     ex2.approx.f32  %f97, %f96;
 950     mul.f32   %f98, %f88, %f97;
 951     setp.lt.f32 %p15, %f107, 0fC2D20000;
 952     selp.f32  %f99, 0f00000000, %f98, %p15;
 953     setp.gt.f32 %p16, %f107, 0f42D20000;
 954     selp.f32  %f110, 0f7F800000, %f99, %p16;
 955     setp.eq.f32 %p17, %f110, 0f7F800000;
 956     @%p17 bra   BB0_28;
 957   // BB#27:
 958     fma.rn.f32  %f110, %f110, %f108, %f110;
 959   BB0_28:                                 // %__internal_accurate_powf.exit.i
 960     setp.lt.f32 %p18, %f1, 0f00000000;
 961     setp.eq.f32 %p19, %f3, 0f3F800000;
 962     and.pred    %p20, %p18, %p19;
 963     @!%p20 bra  BB0_30;
 964     bra.uni   BB0_29;
 965   BB0_29:
 966     mov.b32    %r9, %f110;
 967     xor.b32   %r10, %r9, -2147483648;
 968     mov.b32    %f110, %r10;
 969   BB0_30:                                 // %__nv_powf.exit
 970     st.global.f32   [%rl1], %f110;
 971     ret;
 972   }
 973