android: shared_llvm.mk: add libLLVMOrcJIT to llvm_device_static_libraries

[android-x86/external-llvm.git] / docs / CompileCudaWithLLVM.rst
diff --git a/docs/CompileCudaWithLLVM.rst b/docs/CompileCudaWithLLVM.rst

index e3b9d87..6e181c8 100644 (file)
--- a/docs/CompileCudaWithLLVM.rst
+++ b/docs/CompileCudaWithLLVM.rst
@@ -1,6 +1,6 @@
-===================================
-Compiling CUDA C/C++ with LLVM
-===================================
+=========================
+Compiling CUDA with clang
+=========================
  
  .. contents::
     :local:
@@ -8,185 +8,553 @@ Compiling CUDA C/C++ with LLVM
  Introduction
  ============
  
-This document contains the user guides and the internals of compiling CUDA
-C/C++ with LLVM. It is aimed at both users who want to compile CUDA with LLVM
-and developers who want to improve LLVM for GPUs. This document assumes a basic
-familiarity with CUDA. Information about CUDA programming can be found in the
+This document describes how to compile CUDA code with clang, and gives some
+details about LLVM and clang's CUDA implementations.
+
+This document assumes a basic familiarity with CUDA. Information about CUDA
+programming can be found in the
  `CUDA programming guide
  <http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html>`_.
  
-How to Build LLVM with CUDA Support
-===================================
+Compiling CUDA Code
+===================
+
+Prerequisites
+-------------
+
+CUDA is supported since llvm 3.9. Current release of clang (7.0.0) supports CUDA
+7.0 through 9.2. If you need support for CUDA 10, you will need to use clang
+built from r342924 or newer.
+
+Before you build CUDA code, you'll need to have installed the appropriate driver
+for your nvidia GPU and the CUDA SDK.  See `NVIDIA's CUDA installation guide
+<https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html>`_ for
+details.  Note that clang `does not support
+<https://llvm.org/bugs/show_bug.cgi?id=26966>`_ the CUDA toolkit as installed by
+many Linux package managers; you probably need to install CUDA in a single
+directory from NVIDIA's package.
+
+CUDA compilation is supported on Linux. Compilation on MacOS and Windows may or
+may not work and currently have no maintainers. Compilation with CUDA-9.x is
+`currently broken on Windows <https://bugs.llvm.org/show_bug.cgi?id=38811>`_.
+
+Invoking clang
+--------------
+
+Invoking clang for CUDA compilation works similarly to compiling regular C++.
+You just need to be aware of a few additional flags.
+
+You can use `this <https://gist.github.com/855e277884eb6b388cd2f00d956c2fd4>`_
+program as a toy example.  Save it as ``axpy.cu``.  (Clang detects that you're
+compiling CUDA code by noticing that your filename ends with ``.cu``.
+Alternatively, you can pass ``-x cuda``.)
+
+To build and run, run the following commands, filling in the parts in angle
+brackets as described below:
+
+.. code-block:: console
+
+  $ clang++ axpy.cu -o axpy --cuda-gpu-arch=<GPU arch> \
+      -L<CUDA install path>/<lib64 or lib>             \
+      -lcudart_static -ldl -lrt -pthread
+  $ ./axpy
+  y[0] = 2
+  y[1] = 4
+  y[2] = 6
+  y[3] = 8
+
+On MacOS, replace `-lcudart_static` with `-lcudart`; otherwise, you may get
+"CUDA driver version is insufficient for CUDA runtime version" errors when you
+run your program.
  
-The support for CUDA is still in progress and temporarily relies on `this patch
-<http://reviews.llvm.org/D14452>`_. Below is a quick summary of downloading and
-building LLVM with CUDA support. Consult the `Getting Started
-<http://llvm.org/docs/GettingStarted.html>`_ page for more details on setting
-up LLVM.
+* ``<CUDA install path>`` -- the directory where you installed CUDA SDK.
+  Typically, ``/usr/local/cuda``.
  
-#. Checkout LLVM
+  Pass e.g. ``-L/usr/local/cuda/lib64`` if compiling in 64-bit mode; otherwise,
+  pass e.g. ``-L/usr/local/cuda/lib``.  (In CUDA, the device code and host code
+  always have the same pointer widths, so if you're compiling 64-bit code for
+  the host, you're also compiling 64-bit code for the device.) Note that as of
+  v10.0 CUDA SDK `no longer supports compilation of 32-bit
+  applications <https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#deprecated-features>`_.
  
-   .. code-block:: console
+* ``<GPU arch>`` -- the `compute capability
+  <https://developer.nvidia.com/cuda-gpus>`_ of your GPU. For example, if you
+  want to run your program on a GPU with compute capability of 3.5, specify
+  ``--cuda-gpu-arch=sm_35``.
  
-     $ cd where-you-want-llvm-to-live
-     $ svn co http://llvm.org/svn/llvm-project/llvm/trunk llvm
+  Note: You cannot pass ``compute_XX`` as an argument to ``--cuda-gpu-arch``;
+  only ``sm_XX`` is currently supported.  However, clang always includes PTX in
+  its binaries, so e.g. a binary compiled with ``--cuda-gpu-arch=sm_30`` would be
+  forwards-compatible with e.g. ``sm_35`` GPUs.
  
-#. Checkout Clang
+  You can pass ``--cuda-gpu-arch`` multiple times to compile for multiple archs.
  
-   .. code-block:: console
+The `-L` and `-l` flags only need to be passed when linking.  When compiling,
+you may also need to pass ``--cuda-path=/path/to/cuda`` if you didn't install
+the CUDA SDK into ``/usr/local/cuda`` or ``/usr/local/cuda-X.Y``.
  
-     $ cd where-you-want-llvm-to-live
-     $ cd llvm/tools
-     $ svn co http://llvm.org/svn/llvm-project/cfe/trunk clang
+Flags that control numerical code
+---------------------------------
  
-#. Apply the temporary patch for CUDA support.
+If you're using GPUs, you probably care about making numerical code run fast.
+GPU hardware allows for more control over numerical operations than most CPUs,
+but this results in more compiler options for you to juggle.
  
-   If you have installed `Arcanist
-   <http://llvm.org/docs/Phabricator.html#requesting-a-review-via-the-command-line>`_,
-   you can apply this patch using
+Flags you may wish to tweak include:
  
-   .. code-block:: console
+* ``-ffp-contract={on,off,fast}`` (defaults to ``fast`` on host and device when
+  compiling CUDA) Controls whether the compiler emits fused multiply-add
+  operations.
  
-     $ cd where-you-want-llvm-to-live
-     $ cd llvm/tools/clang
-     $ arc patch D14452
+  * ``off``: never emit fma operations, and prevent ptxas from fusing multiply
+    and add instructions.
+  * ``on``: fuse multiplies and adds within a single statement, but never
+    across statements (C11 semantics).  Prevent ptxas from fusing other
+    multiplies and adds.
+  * ``fast``: fuse multiplies and adds wherever profitable, even across
+    statements.  Doesn't prevent ptxas from fusing additional multiplies and
+    adds.
  
-   Otherwise, go to `its review page <http://reviews.llvm.org/D14452>`_,
-   download the raw diff, and apply it manually using
+  Fused multiply-add instructions can be much faster than the unfused
+  equivalents, but because the intermediate result in an fma is not rounded,
+  this flag can affect numerical code.
  
-   .. code-block:: console
+* ``-fcuda-flush-denormals-to-zero`` (default: off) When this is enabled,
+  floating point operations may flush `denormal
+  <https://en.wikipedia.org/wiki/Denormal_number>`_ inputs and/or outputs to 0.
+  Operations on denormal numbers are often much slower than the same operations
+  on normal numbers.
  
-     $ cd where-you-want-llvm-to-live
-     $ cd llvm/tools/clang
-     $ patch -p0 < D14452.diff
+* ``-fcuda-approx-transcendentals`` (default: off) When this is enabled, the
+  compiler may emit calls to faster, approximate versions of transcendental
+  functions, instead of using the slower, fully IEEE-compliant versions.  For
+  example, this flag allows clang to emit the ptx ``sin.approx.f32``
+  instruction.
  
-#. Configure and build LLVM and Clang
+  This is implied by ``-ffast-math``.
  
-   .. code-block:: console
+Standard library support
+========================
  
-     $ cd where-you-want-llvm-to-live
-     $ mkdir build
-     $ cd build
-     $ cmake [options] ..
-     $ make
+In clang and nvcc, most of the C++ standard library is not supported on the
+device side.
  
-How to Compile CUDA C/C++ with LLVM
-===================================
+``<math.h>`` and ``<cmath>``
+----------------------------
  
-We assume you have installed the CUDA driver and runtime. Consult the `NVIDIA
-CUDA installation Guide
-<https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html>`_ if
-you have not.
+In clang, ``math.h`` and ``cmath`` are available and `pass
+<https://github.com/llvm/llvm-test-suite/blob/master/External/CUDA/math_h.cu>`_
+`tests
+<https://github.com/llvm/llvm-test-suite/blob/master/External/CUDA/cmath.cu>`_
+adapted from libc++'s test suite.
  
-Suppose you want to compile and run the following CUDA program (``axpy.cu``)
-which multiplies a ``float`` array by a ``float`` scalar (AXPY).
+In nvcc ``math.h`` and ``cmath`` are mostly available.  Versions of ``::foof``
+in namespace std (e.g. ``std::sinf``) are not available, and where the standard
+calls for overloads that take integral arguments, these are usually not
+available.
  
  .. code-block:: c++
  
-  #include <helper_cuda.h> // for checkCudaErrors
+  #include <math.h>
+  #include <cmath.h>
  
-  #include <iostream>
+  // clang is OK with everything in this function.
+  __device__ void test() {
+    std::sin(0.); // nvcc - ok
+    std::sin(0);  // nvcc - error, because no std::sin(int) override is available.
+    sin(0);       // nvcc - same as above.
  
-  __global__ void axpy(float a, float* x, float* y) {
-    y[threadIdx.x] = a * x[threadIdx.x];
+    sinf(0.);       // nvcc - ok
+    std::sinf(0.);  // nvcc - no such function
    }
  
-  int main(int argc, char* argv[]) {
-    const int kDataLen = 4;
+``<std::complex>``
+------------------
+
+nvcc does not officially support ``std::complex``.  It's an error to use
+``std::complex`` in ``__device__`` code, but it often works in ``__host__
+__device__`` code due to nvcc's interpretation of the "wrong-side rule" (see
+below).  However, we have heard from implementers that it's possible to get
+into situations where nvcc will omit a call to an ``std::complex`` function,
+especially when compiling without optimizations.
+
+As of 2016-11-16, clang supports ``std::complex`` without these caveats.  It is
+tested with libstdc++ 4.8.5 and newer, but is known to work only with libc++
+newer than 2016-11-16.
+
+``<algorithm>``
+---------------
+
+In C++14, many useful functions from ``<algorithm>`` (notably, ``std::min`` and
+``std::max``) become constexpr.  You can therefore use these in device code,
+when compiling with clang.
+
+Detecting clang vs NVCC from code
+=================================
+
+Although clang's CUDA implementation is largely compatible with NVCC's, you may
+still want to detect when you're compiling CUDA code specifically with clang.
+
+This is tricky, because NVCC may invoke clang as part of its own compilation
+process!  For example, NVCC uses the host compiler's preprocessor when
+compiling for device code, and that host compiler may in fact be clang.
+
+When clang is actually compiling CUDA code -- rather than being used as a
+subtool of NVCC's -- it defines the ``__CUDA__`` macro.  ``__CUDA_ARCH__`` is
+defined only in device mode (but will be defined if NVCC is using clang as a
+preprocessor).  So you can use the following incantations to detect clang CUDA
+compilation, in host and device modes:
+
+.. code-block:: c++
+
+  #if defined(__clang__) && defined(__CUDA__) && !defined(__CUDA_ARCH__)
+  // clang compiling CUDA code, host mode.
+  #endif
+
+  #if defined(__clang__) && defined(__CUDA__) && defined(__CUDA_ARCH__)
+  // clang compiling CUDA code, device mode.
+  #endif
  
-    float a = 2.0f;
-    float host_x[kDataLen] = {1.0f, 2.0f, 3.0f, 4.0f};
-    float host_y[kDataLen];
+Both clang and nvcc define ``__CUDACC__`` during CUDA compilation.  You can
+detect NVCC specifically by looking for ``__NVCC__``.
  
-    // Copy input data to device.
-    float* device_x;
-    float* device_y;
-    checkCudaErrors(cudaMalloc(&device_x, kDataLen * sizeof(float)));
-    checkCudaErrors(cudaMalloc(&device_y, kDataLen * sizeof(float)));
-    checkCudaErrors(cudaMemcpy(device_x, host_x, kDataLen * sizeof(float),
-                               cudaMemcpyHostToDevice));
+Dialect Differences Between clang and nvcc
+==========================================
  
-    // Launch the kernel.
-    axpy<<<1, kDataLen>>>(a, device_x, device_y);
+There is no formal CUDA spec, and clang and nvcc speak slightly different
+dialects of the language.  Below, we describe some of the differences.
  
-    // Copy output data to host.
-    checkCudaErrors(cudaDeviceSynchronize());
-    checkCudaErrors(cudaMemcpy(host_y, device_y, kDataLen * sizeof(float),
-                               cudaMemcpyDeviceToHost));
+This section is painful; hopefully you can skip this section and live your life
+blissfully unaware.
  
-    // Print the results.
-    for (int i = 0; i < kDataLen; ++i) {
-      std::cout << "y[" << i << "] = " << host_y[i] << "\n";
-    }
+Compilation Models
+------------------
  
-    checkCudaErrors(cudaDeviceReset());
-    return 0;
+Most of the differences between clang and nvcc stem from the different
+compilation models used by clang and nvcc.  nvcc uses *split compilation*,
+which works roughly as follows:
+
+ * Run a preprocessor over the input ``.cu`` file to split it into two source
+   files: ``H``, containing source code for the host, and ``D``, containing
+   source code for the device.
+
+ * For each GPU architecture ``arch`` that we're compiling for, do:
+
+   * Compile ``D`` using nvcc proper.  The result of this is a ``ptx`` file for
+     ``P_arch``.
+
+   * Optionally, invoke ``ptxas``, the PTX assembler, to generate a file,
+     ``S_arch``, containing GPU machine code (SASS) for ``arch``.
+
+ * Invoke ``fatbin`` to combine all ``P_arch`` and ``S_arch`` files into a
+   single "fat binary" file, ``F``.
+
+ * Compile ``H`` using an external host compiler (gcc, clang, or whatever you
+   like).  ``F`` is packaged up into a header file which is force-included into
+   ``H``; nvcc generates code that calls into this header to e.g. launch
+   kernels.
+
+clang uses *merged parsing*.  This is similar to split compilation, except all
+of the host and device code is present and must be semantically-correct in both
+compilation steps.
+
+  * For each GPU architecture ``arch`` that we're compiling for, do:
+
+    * Compile the input ``.cu`` file for device, using clang.  ``__host__`` code
+      is parsed and must be semantically correct, even though we're not
+      generating code for the host at this time.
+
+      The output of this step is a ``ptx`` file ``P_arch``.
+
+    * Invoke ``ptxas`` to generate a SASS file, ``S_arch``.  Note that, unlike
+      nvcc, clang always generates SASS code.
+
+  * Invoke ``fatbin`` to combine all ``P_arch`` and ``S_arch`` files into a
+    single fat binary file, ``F``.
+
+  * Compile ``H`` using clang.  ``__device__`` code is parsed and must be
+    semantically correct, even though we're not generating code for the device
+    at this time.
+
+    ``F`` is passed to this compilation, and clang includes it in a special ELF
+    section, where it can be found by tools like ``cuobjdump``.
+
+(You may ask at this point, why does clang need to parse the input file
+multiple times?  Why not parse it just once, and then use the AST to generate
+code for the host and each device architecture?
+
+Unfortunately this can't work because we have to define different macros during
+host compilation and during device compilation for each GPU architecture.)
+
+clang's approach allows it to be highly robust to C++ edge cases, as it doesn't
+need to decide at an early stage which declarations to keep and which to throw
+away.  But it has some consequences you should be aware of.
+
+Overloading Based on ``__host__`` and ``__device__`` Attributes
+---------------------------------------------------------------
+
+Let "H", "D", and "HD" stand for "``__host__`` functions", "``__device__``
+functions", and "``__host__ __device__`` functions", respectively.  Functions
+with no attributes behave the same as H.
+
+nvcc does not allow you to create H and D functions with the same signature:
+
+.. code-block:: c++
+
+  // nvcc: error - function "foo" has already been defined
+  __host__ void foo() {}
+  __device__ void foo() {}
+
+However, nvcc allows you to "overload" H and D functions with different
+signatures:
+
+.. code-block:: c++
+
+  // nvcc: no error
+  __host__ void foo(int) {}
+  __device__ void foo() {}
+
+In clang, the ``__host__`` and ``__device__`` attributes are part of a
+function's signature, and so it's legal to have H and D functions with
+(otherwise) the same signature:
+
+.. code-block:: c++
+
+  // clang: no error
+  __host__ void foo() {}
+  __device__ void foo() {}
+
+HD functions cannot be overloaded by H or D functions with the same signature:
+
+.. code-block:: c++
+
+  // nvcc: error - function "foo" has already been defined
+  // clang: error - redefinition of 'foo'
+  __host__ __device__ void foo() {}
+  __device__ void foo() {}
+
+  // nvcc: no error
+  // clang: no error
+  __host__ __device__ void bar(int) {}
+  __device__ void bar() {}
+
+When resolving an overloaded function, clang considers the host/device
+attributes of the caller and callee.  These are used as a tiebreaker during
+overload resolution.  See `IdentifyCUDAPreference
+<http://clang.llvm.org/doxygen/SemaCUDA_8cpp.html>`_ for the full set of rules,
+but at a high level they are:
+
+ * D functions prefer to call other Ds.  HDs are given lower priority.
+
+ * Similarly, H functions prefer to call other Hs, or ``__global__`` functions
+   (with equal priority).  HDs are given lower priority.
+
+ * HD functions prefer to call other HDs.
+
+   When compiling for device, HDs will call Ds with lower priority than HD, and
+   will call Hs with still lower priority.  If it's forced to call an H, the
+   program is malformed if we emit code for this HD function.  We call this the
+   "wrong-side rule", see example below.
+
+   The rules are symmetrical when compiling for host.
+
+Some examples:
+
+.. code-block:: c++
+
+   __host__ void foo();
+   __device__ void foo();
+
+   __host__ void bar();
+   __host__ __device__ void bar();
+
+   __host__ void test_host() {
+     foo();  // calls H overload
+     bar();  // calls H overload
+   }
+
+   __device__ void test_device() {
+     foo();  // calls D overload
+     bar();  // calls HD overload
+   }
+
+   __host__ __device__ void test_hd() {
+     foo();  // calls H overload when compiling for host, otherwise D overload
+     bar();  // always calls HD overload
+   }
+
+Wrong-side rule example:
+
+.. code-block:: c++
+
+  __host__ void host_only();
+
+  // We don't codegen inline functions unless they're referenced by a
+  // non-inline function.  inline_hd1() is called only from the host side, so
+  // does not generate an error.  inline_hd2() is called from the device side,
+  // so it generates an error.
+  inline __host__ __device__ void inline_hd1() { host_only(); }  // no error
+  inline __host__ __device__ void inline_hd2() { host_only(); }  // error
+
+  __host__ void host_fn() { inline_hd1(); }
+  __device__ void device_fn() { inline_hd2(); }
+
+  // This function is not inline, so it's always codegen'ed on both the host
+  // and the device.  Therefore, it generates an error.
+  __host__ __device__ void not_inline_hd() { host_only(); }
+
+For the purposes of the wrong-side rule, templated functions also behave like
+``inline`` functions: They aren't codegen'ed unless they're instantiated
+(usually as part of the process of invoking them).
+
+clang's behavior with respect to the wrong-side rule matches nvcc's, except
+nvcc only emits a warning for ``not_inline_hd``; device code is allowed to call
+``not_inline_hd``.  In its generated code, nvcc may omit ``not_inline_hd``'s
+call to ``host_only`` entirely, or it may try to generate code for
+``host_only`` on the device.  What you get seems to depend on whether or not
+the compiler chooses to inline ``host_only``.
+
+Member functions, including constructors, may be overloaded using H and D
+attributes.  However, destructors cannot be overloaded.
+
+Using a Different Class on Host/Device
+--------------------------------------
+
+Occasionally you may want to have a class with different host/device versions.
+
+If all of the class's members are the same on the host and device, you can just
+provide overloads for the class's member functions.
+
+However, if you want your class to have different members on host/device, you
+won't be able to provide working H and D overloads in both classes. In this
+case, clang is likely to be unhappy with you.
+
+.. code-block:: c++
+
+  #ifdef __CUDA_ARCH__
+  struct S {
+    __device__ void foo() { /* use device_only */ }
+    int device_only;
+  };
+  #else
+  struct S {
+    __host__ void foo() { /* use host_only */ }
+    double host_only;
+  };
+
+  __device__ void test() {
+    S s;
+    // clang generates an error here, because during host compilation, we
+    // have ifdef'ed away the __device__ overload of S::foo().  The __device__
+    // overload must be present *even during host compilation*.
+    S.foo();
    }
+  #endif
  
-The command line for compilation is similar to what you would use for C++.
+We posit that you don't really want to have classes with different members on H
+and D.  For example, if you were to pass one of these as a parameter to a
+kernel, it would have a different layout on H and D, so would not work
+properly.
  
-.. code-block:: console
+To make code like this compatible with clang, we recommend you separate it out
+into two classes.  If you need to write code that works on both host and
+device, consider writing an overloaded wrapper function that returns different
+types on host and device.
  
-  $ clang++ -o axpy -I<CUDA install path>/samples/common/inc -L<CUDA install path>/<lib64 or lib> axpy.cu -lcudart_static -lcuda -ldl -lrt -pthread
-  $ ./axpy
-  y[0] = 2
-  y[1] = 4
-  y[2] = 6
-  y[3] = 8
+.. code-block:: c++
  
-Note that ``helper_cuda.h`` comes from the CUDA samples, so you need the
-samples installed for this example. ``<CUDA install path>`` is the root
-directory where you installed CUDA SDK, typically ``/usr/local/cuda``.
+  struct HostS { ... };
+  struct DeviceS { ... };
  
-Optimizations
-=============
+  __host__ HostS MakeStruct() { return HostS(); }
+  __device__ DeviceS MakeStruct() { return DeviceS(); }
+
+  // Now host and device code can call MakeStruct().
  
-CPU and GPU have different design philosophies and architectures. For example, a
-typical CPU has branch prediction, out-of-order execution, and is superscalar,
-whereas a typical GPU has none of these. Due to such differences, an
-optimization pipeline well-tuned for CPUs may be not suitable for GPUs.
+Unfortunately, this idiom isn't compatible with nvcc, because it doesn't allow
+you to overload based on the H/D attributes.  Here's an idiom that works with
+both clang and nvcc:
+
+.. code-block:: c++
  
-LLVM performs several general and CUDA-specific optimizations for GPUs. The
-list below shows some of the more important optimizations for GPUs. Most of
-them have been upstreamed to ``lib/Transforms/Scalar`` and
-``lib/Target/NVPTX``. A few of them have not been upstreamed due to lack of a
-customizable target-independent optimization pipeline.
+  struct HostS { ... };
+  struct DeviceS { ... };
  
-* **Straight-line scalar optimizations**. These optimizations reduce redundancy
-  in straight-line code. Details can be found in the `design document for
-  straight-line scalar optimizations <https://goo.gl/4Rb9As>`_.
+  #ifdef __NVCC__
+    #ifndef __CUDA_ARCH__
+      __host__ HostS MakeStruct() { return HostS(); }
+    #else
+      __device__ DeviceS MakeStruct() { return DeviceS(); }
+    #endif
+  #else
+    __host__ HostS MakeStruct() { return HostS(); }
+    __device__ DeviceS MakeStruct() { return DeviceS(); }
+  #endif
  
-* **Inferring memory spaces**. `This optimization
-  <http://www.llvm.org/docs/doxygen/html/NVPTXFavorNonGenericAddrSpaces_8cpp_source.html>`_
-  infers the memory space of an address so that the backend can emit faster
-  special loads and stores from it. Details can be found in the `design
-  document for memory space inference <https://goo.gl/5wH2Ct>`_.
+  // Now host and device code can call MakeStruct().
+
+Hopefully you don't have to do this sort of thing often.
+
+Optimizations
+=============
  
-* **Aggressive loop unrooling and function inlining**. Loop unrolling and
+Modern CPUs and GPUs are architecturally quite different, so code that's fast
+on a CPU isn't necessarily fast on a GPU.  We've made a number of changes to
+LLVM to make it generate good GPU code.  Among these changes are:
+
+* `Straight-line scalar optimizations <https://goo.gl/4Rb9As>`_ -- These
+  reduce redundancy within straight-line code.
+
+* `Aggressive speculative execution
+  <http://llvm.org/docs/doxygen/html/SpeculativeExecution_8cpp_source.html>`_
+  -- This is mainly for promoting straight-line scalar optimizations, which are
+  most effective on code along dominator paths.
+
+* `Memory space inference
+  <http://llvm.org/doxygen/NVPTXInferAddressSpaces_8cpp_source.html>`_ --
+  In PTX, we can operate on pointers that are in a paricular "address space"
+  (global, shared, constant, or local), or we can operate on pointers in the
+  "generic" address space, which can point to anything.  Operations in a
+  non-generic address space are faster, but pointers in CUDA are not explicitly
+  annotated with their address space, so it's up to LLVM to infer it where
+  possible.
+
+* `Bypassing 64-bit divides
+  <http://llvm.org/docs/doxygen/html/BypassSlowDivision_8cpp_source.html>`_ --
+  This was an existing optimization that we enabled for the PTX backend.
+
+  64-bit integer divides are much slower than 32-bit ones on NVIDIA GPUs.
+  Many of the 64-bit divides in our benchmarks have a divisor and dividend
+  which fit in 32-bits at runtime. This optimization provides a fast path for
+  this common case.
+
+* Aggressive loop unrooling and function inlining -- Loop unrolling and
    function inlining need to be more aggressive for GPUs than for CPUs because
-  control flow transfer in GPU is more expensive. They also promote other
-  optimizations such as constant propagation and SROA which sometimes speed up
-  code by over 10x. An empirical inline threshold for GPUs is 1100. This
-  configuration has yet to be upstreamed with a target-specific optimization
-  pipeline. LLVM also provides `loop unrolling pragmas
+  control flow transfer in GPU is more expensive. More aggressive unrolling and
+  inlining also promote other optimizations, such as constant propagation and
+  SROA, which sometimes speed up code by over 10x.
+
+  (Programmers can force unrolling and inline using clang's `loop unrolling pragmas
    <http://clang.llvm.org/docs/AttributeReference.html#pragma-unroll-pragma-nounroll>`_
-  and ``__attribute__((always_inline))`` for programmers to force unrolling and
-  inling.
-
-* **Aggressive speculative execution**. `This transformation
-  <http://llvm.org/docs/doxygen/html/SpeculativeExecution_8cpp_source.html>`_ is
-  mainly for promoting straight-line scalar optimizations which are most
-  effective on code along dominator paths.
-
-* **Memory-space alias analysis**. `This alias analysis
-  <http://llvm.org/docs/NVPTXUsage.html>`_ infers that two pointers in different
-  special memory spaces do not alias. It has yet to be integrated to the new
-  alias analysis infrastructure; the new infrastructure does not run
-  target-specific alias analysis.
-
-* **Bypassing 64-bit divides**. `An existing optimization
-  <http://llvm.org/docs/doxygen/html/BypassSlowDivision_8cpp_source.html>`_
-  enabled in the NVPTX backend. 64-bit integer divides are much slower than
-  32-bit ones on NVIDIA GPUs due to lack of a divide unit. Many of the 64-bit
-  divides in our benchmarks have a divisor and dividend which fit in 32-bits at
-  runtime. This optimization provides a fast path for this common case.
+  and ``__attribute__((always_inline))``.)
+
+Publication
+===========
+
+The team at Google published a paper in CGO 2016 detailing the optimizations
+they'd made to clang/LLVM.  Note that "gpucc" is no longer a meaningful name:
+The relevant tools are now just vanilla clang/LLVM.
+
+| `gpucc: An Open-Source GPGPU Compiler <http://dl.acm.org/citation.cfm?id=2854041>`_
+| Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, Robert Hundt
+| *Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO 2016)*
+|
+| `Slides from the CGO talk <http://wujingyue.github.io/docs/gpucc-talk.pdf>`_
+|
+| `Tutorial given at CGO <http://wujingyue.github.io/docs/gpucc-tutorial.pdf>`_
+
+Obtaining Help
+==============
+
+To obtain help on LLVM in general and its CUDA support, see `the LLVM
+community <http://llvm.org/docs/#mailing-lists>`_.