move .rst to docs

author Reed Kotler <rkotlerimgtec@gmail.com>

Thu, 7 Jan 2016 01:22:21 +0000 (17:22 -0800)

committer Jim Stichnoth <stichnot@chromium.org>

Thu, 7 Jan 2016 01:22:21 +0000 (17:22 -0800)
author Reed Kotler <rkotlerimgtec@gmail.com>
Thu, 7 Jan 2016 01:22:21 +0000 (17:22 -0800)
committer Jim Stichnoth <stichnot@chromium.org>
Thu, 7 Jan 2016 01:22:21 +0000 (17:22 -0800)
diff --git a/.gitignore b/.gitignore

index 8cda9be..6c5964a 100644 (file)
--- a/.gitignore
+++ b/.gitignore
@@ -10,5 +10,5 @@
  # Ignore specific patterns at the top-level directory
  /pnacl-sz
  /build/
-/docs/
+/docs/html/
  /crosstest/Output/
diff --git a/DESIGN.rst b/DESIGN.rst

deleted file mode 100644 (file)

index 3d324f64ba81798b5a7a038b24f265bbe292254b..0000000000000000000000000000000000000000
--- a/DESIGN.rst
+++ /dev/null
@@ -1,1494 +0,0 @@
-Design of the Subzero fast code generator
-=========================================
-
-Introduction
-------------
-
-The `Portable Native Client (PNaCl) <http://gonacl.com>`_ project includes
-compiler technology based on `LLVM <http://llvm.org/>`_.  The developer uses the
-PNaCl toolchain to compile their application to architecture-neutral PNaCl
-bitcode (a ``.pexe`` file), using as much architecture-neutral optimization as
-possible.  The ``.pexe`` file is downloaded to the user's browser where the
-PNaCl translator (a component of Chrome) compiles the ``.pexe`` file to
-`sandboxed
-<https://developer.chrome.com/native-client/reference/sandbox_internals/index>`_
-native code.  The translator uses architecture-specific optimizations as much as
-practical to generate good native code.
-
-The native code can be cached by the browser to avoid repeating translation on
-future page loads.  However, first-time user experience is hampered by long
-translation times.  The LLVM-based PNaCl translator is pretty slow, even when
-using ``-O0`` to minimize optimizations, so delays are especially noticeable on
-slow browser platforms such as ARM-based Chromebooks.
-
-Translator slowness can be mitigated or hidden in a number of ways.
-
-- Parallel translation.  However, slow machines where this matters most, e.g.
-  ARM-based Chromebooks, are likely to have fewer cores to parallelize across,
-  and are likely to less memory available for multiple translation threads to
-  use.
-
-- Streaming translation, i.e. start translating as soon as the download starts.
-  This doesn't help much when translation speed is 10× slower than download
-  speed, or the ``.pexe`` file is already cached while the translated binary was
-  flushed from the cache.
-
-- Arrange the web page such that translation is done in parallel with
-  downloading large assets.
-
-- Arrange the web page to distract the user with `cat videos
-  <https://www.youtube.com/watch?v=tLt5rBfNucc>`_ while translation is in
-  progress.
-
-Or, improve translator performance to something more reasonable.
-
-This document describes Subzero's attempt to improve translation speed by an
-order of magnitude while rivaling LLVM's code quality.  Subzero does this
-through minimal IR layering, lean data structures and passes, and a careful
-selection of fast optimization passes.  It has two optimization recipes: full
-optimizations (``O2``) and minimal optimizations (``Om1``).  The recipes are the
-following (described in more detail below):
-
-+-----------------------------------+-----------------------+
-| O2 recipe                         | Om1 recipe            |
-+===================================+=======================+
-| Parse .pexe file                  | Parse .pexe file      |
-+-----------------------------------+-----------------------+
-| Loop nest analysis                |                       |
-+-----------------------------------+-----------------------+
-| Address mode inference            |                       |
-+-----------------------------------+-----------------------+
-| Read-modify-write (RMW) transform |                       |
-+-----------------------------------+-----------------------+
-| Basic liveness analysis           |                       |
-+-----------------------------------+-----------------------+
-| Load optimization                 |                       |
-+-----------------------------------+-----------------------+
-|                                   | Phi lowering (simple) |
-+-----------------------------------+-----------------------+
-| Target lowering                   | Target lowering       |
-+-----------------------------------+-----------------------+
-| Full liveness analysis            |                       |
-+-----------------------------------+-----------------------+
-| Register allocation               |                       |
-+-----------------------------------+-----------------------+
-| Phi lowering (advanced)           |                       |
-+-----------------------------------+-----------------------+
-| Post-phi register allocation      |                       |
-+-----------------------------------+-----------------------+
-| Branch optimization               |                       |
-+-----------------------------------+-----------------------+
-| Code emission                     | Code emission         |
-+-----------------------------------+-----------------------+
-
-Goals
-=====
-
-Translation speed
------------------
-
-We'd like to be able to translate a ``.pexe`` file as fast as download speed.
-Any faster is in a sense wasted effort.  Download speed varies greatly, but
-we'll arbitrarily say 1 MB/sec.  We'll pick the ARM A15 CPU as the example of a
-slow machine.  We observe a 3× single-thread performance difference between A15
-and a high-end x86 Xeon E5-2690 based workstation, and aggressively assume a
-``.pexe`` file could be compressed to 50% on the web server using gzip transport
-compression, so we set the translation speed goal to 6 MB/sec on the high-end
-Xeon workstation.
-
-Currently, at the ``-O0`` level, the LLVM-based PNaCl translation translates at
-⅒ the target rate.  The ``-O2`` mode takes 3× as long as the ``-O0`` mode.
-
-In other words, Subzero's goal is to improve over LLVM's translation speed by
-10×.
-
-Code quality
-------------
-
-Subzero's initial goal is to produce code that meets or exceeds LLVM's ``-O0``
-code quality.  The stretch goal is to approach LLVM ``-O2`` code quality.  On
-average, LLVM ``-O2`` performs twice as well as LLVM ``-O0``.
-
-It's important to note that the quality of Subzero-generated code depends on
-target-neutral optimizations and simplifications being run beforehand in the
-developer environment.  The ``.pexe`` file reflects these optimizations.  For
-example, Subzero assumes that the basic blocks are ordered topologically where
-possible (which makes liveness analysis converge fastest), and Subzero does not
-do any function inlining because it should already have been done.
-
-Translator size
----------------
-
-The current LLVM-based translator binary (``pnacl-llc``) is about 10 MB in size.
-We think 1 MB is a more reasonable size -- especially for such a component that
-is distributed to a billion Chrome users.  Thus we target a 10× reduction in
-binary size.
-
-For development, Subzero can be built for all target architectures, and all
-debugging and diagnostic options enabled.  For a smaller translator, we restrict
-to a single target architecture, and define a ``MINIMAL`` build where
-unnecessary features are compiled out.
-
-Subzero leverages some data structures from LLVM's ``ADT`` and ``Support``
-include directories, which have little impact on translator size.  It also uses
-some of LLVM's bitcode decoding code (for binary-format ``.pexe`` files), again
-with little size impact.  In non-``MINIMAL`` builds, the translator size is much
-larger due to including code for parsing text-format bitcode files and forming
-LLVM IR.
-
-Memory footprint
-----------------
-
-The current LLVM-based translator suffers from an issue in which some
-function-specific data has to be retained in memory until all translation
-completes, and therefore the memory footprint grows without bound.  Large
-``.pexe`` files can lead to the translator process holding hundreds of MB of
-memory by the end.  The translator runs in a separate process, so this memory
-growth doesn't *directly* affect other processes, but it does dirty the physical
-memory and contributes to a perception of bloat and sometimes a reality of
-out-of-memory tab killing, especially noticeable on weaker systems.
-
-Subzero should maintain a stable memory footprint throughout translation.  It's
-not really practical to set a specific limit, because there is not really a
-practical limit on a single function's size, but the footprint should be
-"reasonable" and be proportional to the largest input function size, not the
-total ``.pexe`` file size.  Simply put, Subzero should not have memory leaks or
-inexorable memory growth.  (We use ASAN builds to test for leaks.)
-
-Multithreaded translation
--------------------------
-
-It should be practical to translate different functions concurrently and see
-good scalability.  Some locking may be needed, such as accessing output buffers
-or constant pools, but that should be fairly minimal.  In contrast, LLVM was
-only designed for module-level parallelism, and as such, the PNaCl translator
-internally splits a ``.pexe`` file into several modules for concurrent
-translation.  All output needs to be deterministic regardless of the level of
-multithreading, i.e. functions and data should always be output in the same
-order.
-
-Target architectures
---------------------
-
-Initial target architectures are x86-32, x86-64, ARM32, and MIPS32.  Future
-targets include ARM64 and MIPS64, though these targets lack NaCl support
-including a sandbox model or a validator.
-
-The first implementation is for x86-32, because it was expected to be
-particularly challenging, and thus more likely to draw out any design problems
-early:
-
-- There are a number of special cases, asymmetries, and warts in the x86
-  instruction set.
-
-- Complex addressing modes may be leveraged for better code quality.
-
-- 64-bit integer operations have to be lowered into longer sequences of 32-bit
-  operations.
-
-- Paucity of physical registers may reveal code quality issues early in the
-  design.
-
-Detailed design
-===============
-
-Intermediate representation - ICE
----------------------------------
-
-Subzero's IR is called ICE.  It is designed to be reasonably similar to LLVM's
-IR, which is reflected in the ``.pexe`` file's bitcode structure.  It has a
-representation of global variables and initializers, and a set of functions.
-Each function contains a list of basic blocks, and each basic block constains a
-list of instructions.  Instructions that operate on stack and register variables
-do so using static single assignment (SSA) form.
-
-The ``.pexe`` file is translated one function at a time (or in parallel by
-multiple translation threads).  The recipe for optimization passes depends on
-the specific target and optimization level, and is described in detail below.
-Global variables (types and initializers) are simply and directly translated to
-object code, without any meaningful attempts at optimization.
-
-A function's control flow graph (CFG) is represented by the ``Ice::Cfg`` class.
-Its key contents include:
-
-- A list of ``CfgNode`` pointers, generally held in topological order.
-
-- A list of ``Variable`` pointers corresponding to local variables used in the
-  function plus compiler-generated temporaries.
-
-A basic block is represented by the ``Ice::CfgNode`` class.  Its key contents
-include:
-
-- A linear list of instructions, in the same style as LLVM.  The last
-  instruction of the list is always a terminator instruction: branch, switch,
-  return, unreachable.
-
-- A list of Phi instructions, also in the same style as LLVM.  They are held as
-  a linear list for convenience, though per Phi semantics, they are executed "in
-  parallel" without dependencies on each other.
-
-- An unordered list of ``CfgNode`` pointers corresponding to incoming edges, and
-  another list for outgoing edges.
-
-- The node's unique, 0-based index into the CFG's node list.
-
-An instruction is represented by the ``Ice::Inst`` class.  Its key contents
-include:
-
-- A list of source operands.
-
-- Its destination variable, if the instruction produces a result in an
-  ``Ice::Variable``.
-
-- A bitvector indicating which variables' live ranges this instruction ends.
-  This is computed during liveness analysis.
-
-Instructions kinds are divided into high-level ICE instructions and low-level
-ICE instructions.  High-level instructions consist of the PNaCl/LLVM bitcode
-instruction kinds.  Each target architecture implementation extends the
-instruction space with its own set of low-level instructions.  Generally,
-low-level instructions correspond to individual machine instructions.  The
-high-level ICE instruction space includes a few additional instruction kinds
-that are not part of LLVM but are generally useful (e.g., an Assignment
-instruction), or are useful across targets (e.g., BundleLock and BundleUnlock
-instructions for sandboxing).
-
-Specifically, high-level ICE instructions that derive from LLVM (but with PNaCl
-ABI restrictions as documented in the `PNaCl Bitcode Reference Manual
-<https://developer.chrome.com/native-client/reference/pnacl-bitcode-abi>`_) are
-the following:
-
-- Alloca: allocate data on the stack
-
-- Arithmetic: binary operations of the form ``A = B op C``
-
-- Br: conditional or unconditional branch
-
-- Call: function call
-
-- Cast: unary type-conversion operations
-
-- ExtractElement: extract a scalar element from a vector-type value
-
-- Fcmp: floating-point comparison
-
-- Icmp: integer comparison
-
-- IntrinsicCall: call a known intrinsic
-
-- InsertElement: insert a scalar element into a vector-type value
-
-- Load: load a value from memory
-
-- Phi: implement the SSA phi node
-
-- Ret: return from the function
-
-- Select: essentially the C language operation of the form ``X = C ? Y : Z``
-
-- Store: store a value into memory
-
-- Switch: generalized branch to multiple possible locations
-
-- Unreachable: indicate that this portion of the code is unreachable
-
-The additional high-level ICE instructions are the following:
-
-- Assign: a simple ``A=B`` assignment.  This is useful for e.g. lowering Phi
-  instructions to non-SSA assignments, before lowering to machine code.
-
-- BundleLock, BundleUnlock.  These are markers used for sandboxing, but are
-  common across all targets and so they are elevated to the high-level
-  instruction set.
-
-- FakeDef, FakeUse, FakeKill.  These are tools used to preserve consistency in
-  liveness analysis, elevated to the high-level because they are used by all
-  targets.  They are described in more detail at the end of this section.
-
-- JumpTable: this represents the result of switch optimization analysis, where
-  some switch instructions may use jump tables instead of cascading
-  compare/branches.
-
-An operand is represented by the ``Ice::Operand`` class.  In high-level ICE, an
-operand is either an ``Ice::Constant`` or an ``Ice::Variable``.  Constants
-include scalar integer constants, scalar floating point constants, Undef (an
-unspecified constant of a particular scalar or vector type), and symbol
-constants (essentially addresses of globals).  Note that the PNaCl ABI does not
-include vector-type constants besides Undef, and as such, Subzero (so far) has
-no reason to represent vector-type constants internally.  A variable represents
-a value allocated on the stack (though not including alloca-derived storage).
-Among other things, a variable holds its unique, 0-based index into the CFG's
-variable list.
-
-Each target can extend the ``Constant`` and ``Variable`` classes for its own
-needs.  In addition, the ``Operand`` class may be extended, e.g. to define an
-x86 ``MemOperand`` that encodes a base register, an index register, an index
-register shift amount, and a constant offset.
-
-Register allocation and liveness analysis are restricted to Variable operands.
-Because of the importance of register allocation to code quality, and the
-translation-time cost of liveness analysis, Variable operands get some special
-treatment in ICE.  Most notably, a frequent pattern in Subzero is to iterate
-across all the Variables of an instruction.  An instruction holds a list of
-operands, but an operand may contain 0, 1, or more Variables.  As such, the
-``Operand`` class specially holds a list of Variables contained within, for
-quick access.
-
-A Subzero transformation pass may work by deleting an existing instruction and
-replacing it with zero or more new instructions.  Instead of actually deleting
-the existing instruction, we generally mark it as deleted and insert the new
-instructions right after the deleted instruction.  When printing the IR for
-debugging, this is a big help because it makes it much more clear how the
-non-deleted instructions came about.
-
-Subzero has a few special instructions to help with liveness analysis
-consistency.
-
-- The FakeDef instruction gives a fake definition of some variable.  For
-  example, on x86-32, a divide instruction defines both ``%eax`` and ``%edx``
-  but an ICE instruction can represent only one destination variable.  This is
-  similar for multiply instructions, and for function calls that return a 64-bit
-  integer result in the ``%edx:%eax`` pair.  Also, using the ``xor %eax, %eax``
-  trick to set ``%eax`` to 0 requires an initial FakeDef of ``%eax``.
-
-- The FakeUse instruction registers a use of a variable, typically to prevent an
-  earlier assignment to that variable from being dead-code eliminated.  For
-  example, lowering an operation like ``x=cc?y:z`` may be done using x86's
-  conditional move (cmov) instruction: ``mov z, x; cmov_cc y, x``.  Without a
-  FakeUse of ``x`` between the two instructions, the liveness analysis pass may
-  dead-code eliminate the first instruction.
-
-- The FakeKill instruction is added after a call instruction, and is a quick way
-  of indicating that caller-save registers are invalidated.
-
-Pexe parsing
-------------
-
-Subzero includes an integrated PNaCl bitcode parser for ``.pexe`` files.  It
-parses the ``.pexe`` file function by function, ultimately constructing an ICE
-CFG for each function.  After a function is parsed, its CFG is handed off to the
-translation phase.  The bitcode parser also parses global initializer data and
-hands it off to be translated to data sections in the object file.
-
-Subzero has another parsing strategy for testing/debugging.  LLVM libraries can
-be used to parse a module into LLVM IR (though very slowly relative to Subzero
-native parsing).  Then we iterate across the LLVM IR and construct high-level
-ICE, handing off each CFG to the translation phase.
-
-Overview of lowering
---------------------
-
-In general, translation goes like this:
-
-- Parse the next function from the ``.pexe`` file and construct a CFG consisting
-  of high-level ICE.
-
-- Do analysis passes and transformation passes on the high-level ICE, as
-  desired.
-
-- Lower each high-level ICE instruction into a sequence of zero or more
-  low-level ICE instructions.  Each high-level instruction is generally lowered
-  independently, though the target lowering is allowed to look ahead in the
-  CfgNode's instruction list if desired.
-
-- Do more analysis and transformation passes on the low-level ICE, as desired.
-
-- Assemble the low-level CFG into an ELF object file (alternatively, a textual
-  assembly file that is later assembled by some external tool).
-
-- Repeat for all functions, and also produce object code for data such as global
-  initializers and internal constant pools.
-
-Currently there are two optimization levels: ``O2`` and ``Om1``.  For ``O2``,
-the intention is to apply all available optimizations to get the best code
-quality (though the initial code quality goal is measured against LLVM's ``O0``
-code quality).  For ``Om1``, the intention is to apply as few optimizations as
-possible and produce code as quickly as possible, accepting poor code quality.
-``Om1`` is short for "O-minus-one", i.e. "worse than O0", or in other words,
-"sub-zero".
-
-High-level debuggability of generated code is so far not a design requirement.
-Subzero doesn't really do transformations that would obfuscate debugging; the
-main thing might be that register allocation (including stack slot coalescing
-for stack-allocated variables whose live ranges don't overlap) may render a
-variable's value unobtainable after its live range ends.  This would not be an
-issue for ``Om1`` since it doesn't register-allocate program-level variables,
-nor does it coalesce stack slots.  That said, fully supporting debuggability
-would require a few additions:
-
-- DWARF support would need to be added to Subzero's ELF file emitter.  Subzero
-  propagates global symbol names, local variable names, and function-internal
-  label names that are present in the ``.pexe`` file.  This would allow a
-  debugger to map addresses back to symbols in the ``.pexe`` file.
-
-- To map ``.pexe`` file symbols back to meaningful source-level symbol names,
-  file names, line numbers, etc., Subzero would need to handle `LLVM bitcode
-  metadata <http://llvm.org/docs/LangRef.html#metadata>`_ and ``llvm.dbg``
-  `instrinsics<http://llvm.org/docs/LangRef.html#dbg-intrinsics>`_.
-
-- The PNaCl toolchain explicitly strips all this from the ``.pexe`` file, and so
-  the toolchain would need to be modified to preserve it.
-
-Our experience so far is that ``Om1`` translates twice as fast as ``O2``, but
-produces code with one third the code quality.  ``Om1`` is good for testing and
-debugging -- during translation, it tends to expose errors in the basic lowering
-that might otherwise have been hidden by the register allocator or other
-optimization passes.  It also helps determine whether a code correctness problem
-is a fundamental problem in the basic lowering, or an error in another
-optimization pass.
-
-The implementation of target lowering also controls the recipe of passes used
-for ``Om1`` and ``O2`` translation.  For example, address mode inference may
-only be relevant for x86.
-
-Lowering strategy
------------------
-
-The core of Subzero's lowering from high-level ICE to low-level ICE is to lower
-each high-level instruction down to a sequence of low-level target-specific
-instructions, in a largely context-free setting.  That is, each high-level
-instruction conceptually has a simple template expansion into low-level
-instructions, and lowering can in theory be done in any order.  This may sound
-like a small effort, but quite a large number of templates may be needed because
-of the number of PNaCl types and instruction variants.  Furthermore, there may
-be optimized templates, e.g. to take advantage of operator commutativity (for
-example, ``x=x+1`` might allow a bettern lowering than ``x=1+x``).  This is
-similar to other template-based approaches in fast code generation or
-interpretation, though some decisions are deferred until after some global
-analysis passes, mostly related to register allocation, stack slot assignment,
-and specific choice of instruction variant and addressing mode.
-
-The key idea for a lowering template is to produce valid low-level instructions
-that are guaranteed to meet address mode and other structural requirements of
-the instruction set.  For example, on x86, the source operand of an integer
-store instruction must be an immediate or a physical register; a shift
-instruction's shift amount must be an immediate or in register ``%cl``; a
-function's integer return value is in ``%eax``; most x86 instructions are
-two-operand, in contrast to corresponding three-operand high-level instructions;
-etc.
-
-Because target lowering runs before register allocation, there is no way to know
-whether a given ``Ice::Variable`` operand lives on the stack or in a physical
-register.  When the low-level instruction calls for a physical register operand,
-the target lowering can create an infinite-weight Variable.  This tells the
-register allocator to assign infinite weight when making decisions, effectively
-guaranteeing some physical register.  Variables can also be pre-colored to a
-specific physical register (``cl`` in the shift example above), which also gives
-infinite weight.
-
-To illustrate, consider a high-level arithmetic instruction on 32-bit integer
-operands::
-
-    A = B + C
-
-X86 target lowering might produce the following::
-
-    T.inf = B  // mov instruction
-    T.inf += C // add instruction
-    A = T.inf  // mov instruction
-
-Here, ``T.inf`` is an infinite-weight temporary.  As long as ``T.inf`` has a
-physical register, the three lowered instructions are all encodable regardless
-of whether ``B`` and ``C`` are physical registers, memory, or immediates, and
-whether ``A`` is a physical register or in memory.
-
-In this example, ``A`` must be a Variable and one may be tempted to simplify the
-lowering sequence by setting ``A`` as infinite-weight and using::
-
-        A = B  // mov instruction
-        A += C // add instruction
-
-This has two problems.  First, if the original instruction was actually ``A =
-B + A``, the result would be incorrect.  Second, assigning ``A`` a physical
-register applies throughout ``A``'s entire live range.  This is probably not
-what is intended, and may ultimately lead to a failure to allocate a register
-for an infinite-weight variable.
-
-This style of lowering leads to many temporaries being generated, so in ``O2``
-mode, we rely on the register allocator to clean things up.  For example, in the
-example above, if ``B`` ends up getting a physical register and its live range
-ends at this instruction, the register allocator is likely to reuse that
-register for ``T.inf``.  This leads to ``T.inf=B`` being a redundant register
-copy, which is removed as an emission-time peephole optimization.
-
-O2 lowering
------------
-
-Currently, the ``O2`` lowering recipe is the following:
-
-- Loop nest analysis
-
-- Address mode inference
-
-- Read-modify-write (RMW) transformation
-
-- Basic liveness analysis
-
-- Load optimization
-
-- Target lowering
-
-- Full liveness analysis
-
-- Register allocation
-
-- Phi instruction lowering (advanced)
-
-- Post-phi lowering register allocation
-
-- Branch optimization
-
-These passes are described in more detail below.
-
-Om1 lowering
-------------
-
-Currently, the ``Om1`` lowering recipe is the following:
-
-- Phi instruction lowering (simple)
-
-- Target lowering
-
-- Register allocation (infinite-weight and pre-colored only)
-
-Optimization passes
--------------------
-
-Liveness analysis
-^^^^^^^^^^^^^^^^^
-
-Liveness analysis is a standard dataflow optimization, implemented as follows.
-For each node (basic block), its live-out set is computed as the union of the
-live-in sets of its successor nodes.  Then the node's instructions are processed
-in reverse order, updating the live set, until the beginning of the node is
-reached, and the node's live-in set is recorded.  If this iteration has changed
-the node's live-in set, the node's predecessors are marked for reprocessing.
-This continues until no more nodes need reprocessing.  If nodes are processed in
-reverse topological order, the number of iterations over the CFG is generally
-equal to the maximum loop nest depth.
-
-To implement this, each node records its live-in and live-out sets, initialized
-to the empty set.  Each instruction records which of its Variables' live ranges
-end in that instruction, initialized to the empty set.  A side effect of
-liveness analysis is dead instruction elimination.  Each instruction can be
-marked as tentatively dead, and after the algorithm converges, the tentatively
-dead instructions are permanently deleted.
-
-Optionally, after this liveness analysis completes, we can do live range
-construction, in which we calculate the live range of each variable in terms of
-instruction numbers.  A live range is represented as a union of segments, where
-the segment endpoints are instruction numbers.  Instruction numbers are required
-to be unique across the CFG, and monotonically increasing within a basic block.
-As a union of segments, live ranges can contain "gaps" and are therefore
-precise.  Because of SSA properties, a variable's live range can start at most
-once in a basic block, and can end at most once in a basic block.  Liveness
-analysis keeps track of which variable/instruction tuples begin live ranges and
-end live ranges, and combined with live-in and live-out sets, we can efficiently
-build up live ranges of all variables across all basic blocks.
-
-A lot of care is taken to try to make liveness analysis fast and efficient.
-Because of the lowering strategy, the number of variables is generally
-proportional to the number of instructions, leading to an O(N^2) complexity
-algorithm if implemented naively.  To improve things based on sparsity, we note
-that most variables are "local" and referenced in at most one basic block (in
-contrast to the "global" variables with multi-block usage), and therefore cannot
-be live across basic blocks.  Therefore, the live-in and live-out sets,
-typically represented as bit vectors, can be limited to the set of global
-variables, and the intra-block liveness bit vector can be compacted to hold the
-global variables plus the local variables for that block.
-
-Register allocation
-^^^^^^^^^^^^^^^^^^^
-
-Subzero implements a simple linear-scan register allocator, based on the
-allocator described by Hanspeter Mössenböck and Michael Pfeiffer in `Linear Scan
-Register Allocation in the Context of SSA Form and Register Constraints
-<ftp://ftp.ssw.uni-linz.ac.at/pub/Papers/Moe02.PDF>`_.  This allocator has
-several nice features:
-
-- Live ranges are represented as unions of segments, as described above, rather
-  than a single start/end tuple.
-
-- It allows pre-coloring of variables with specific physical registers.
-
-- It applies equally well to pre-lowered Phi instructions.
-
-The paper suggests an approach of aggressively coalescing variables across Phi
-instructions (i.e., trying to force Phi source and destination variables to have
-the same register assignment), but we reject that in favor of the more natural
-preference mechanism described below.
-
-We enhance the algorithm in the paper with the capability of automatic inference
-of register preference, and with the capability of allowing overlapping live
-ranges to safely share the same register in certain circumstances.  If we are
-considering register allocation for variable ``A``, and ``A`` has a single
-defining instruction ``A=B+C``, then the preferred register for ``A``, if
-available, would be the register assigned to ``B`` or ``C``, if any, provided
-that ``B`` or ``C``'s live range does not overlap ``A``'s live range.  In this
-way we infer a good register preference for ``A``.
-
-We allow overlapping live ranges to get the same register in certain cases.
-Suppose a high-level instruction like::
-
-    A = unary_op(B)
-
-has been target-lowered like::
-
-    T.inf = B
-    A = unary_op(T.inf)
-
-Further, assume that ``B``'s live range continues beyond this instruction
-sequence, and that ``B`` has already been assigned some register.  Normally, we
-might want to infer ``B``'s register as a good candidate for ``T.inf``, but it
-turns out that ``T.inf`` and ``B``'s live ranges overlap, requiring them to have
-different registers.  But ``T.inf`` is just a read-only copy of ``B`` that is
-guaranteed to be in a register, so in theory these overlapping live ranges could
-safely have the same register.  Our implementation allows this overlap as long
-as ``T.inf`` is never modified within ``B``'s live range, and ``B`` is never
-modified within ``T.inf``'s live range.
-
-Subzero's register allocator can be run in 3 configurations.
-
-- Normal mode.  All Variables are considered for register allocation.  It
-  requires full liveness analysis and live range construction as a prerequisite.
-  This is used by ``O2`` lowering.
-
-- Minimal mode.  Only infinite-weight or pre-colored Variables are considered.
-  All other Variables are stack-allocated.  It does not require liveness
-  analysis; instead, it quickly scans the instructions and records first
-  definitions and last uses of all relevant Variables, using that to construct a
-  single-segment live range.  Although this includes most of the Variables, the
-  live ranges are mostly simple, short, and rarely overlapping, which the
-  register allocator handles efficiently.  This is used by ``Om1`` lowering.
-
-- Post-phi lowering mode.  Advanced phi lowering is done after normal-mode
-  register allocation, and may result in new infinite-weight Variables that need
-  registers.  One would like to just run something like minimal mode to assign
-  registers to the new Variables while respecting existing register allocation
-  decisions.  However, it sometimes happens that there are no free registers.
-  In this case, some register needs to be forcibly spilled to the stack and
-  temporarily reassigned to the new Variable, and reloaded at the end of the new
-  Variable's live range.  The register must be one that has no explicit
-  references during the Variable's live range.  Since Subzero currently doesn't
-  track def/use chains (though it does record the CfgNode where a Variable is
-  defined), we just do a brute-force search across the CfgNode's instruction
-  list for the instruction numbers of interest.  This situation happens very
-  rarely, so there's little point for now in improving its performance.
-
-The basic linear-scan algorithm may, as it proceeds, rescind an early register
-allocation decision, leaving that Variable to be stack-allocated.  Some of these
-times, it turns out that the Variable could have been given a different register
-without conflict, but by this time it's too late.  The literature recognizes
-this situation and describes "second-chance bin-packing", which Subzero can do.
-We can rerun the register allocator in a mode that respects existing register
-allocation decisions, and sometimes it finds new non-conflicting opportunities.
-In fact, we can repeatedly run the register allocator until convergence.
-Unfortunately, in the current implementation, these subsequent register
-allocation passes end up being extremely expensive.  This is because of the
-treatment of the "unhandled pre-colored" Variable set, which is normally very
-small but ends up being quite large on subsequent passes.  Its performance can
-probably be made acceptable with a better choice of data structures, but for now
-this second-chance mechanism is disabled.
-
-Future work is to implement LLVM's `Greedy
-<http://blog.llvm.org/2011/09/greedy-register-allocation-in-llvm-30.html>`_
-register allocator as a replacement for the basic linear-scan algorithm, given
-LLVM's experience with its improvement in code quality.  (The blog post claims
-that the Greedy allocator also improved maintainability because a lot of hacks
-could be removed, but Subzero is probably not yet to that level of hacks, and is
-less likely to see that particular benefit.)
-
-Basic phi lowering
-^^^^^^^^^^^^^^^^^^
-
-The simplest phi lowering strategy works as follows (this is how LLVM ``-O0``
-implements it).  Consider this example::
-
-    L1:
-      ...
-      br L3
-    L2:
-      ...
-      br L3
-    L3:
-      A = phi [B, L1], [C, L2]
-      X = phi [Y, L1], [Z, L2]
-
-For each destination of a phi instruction, we can create a temporary and insert
-the temporary's assignment at the end of the predecessor block::
-
-    L1:
-      ...
-      A' = B
-      X' = Y
-      br L3
-    L2:
-      ...
-      A' = C
-      X' = Z
-      br L3
-    L2:
-      A = A'
-      X = X'
-
-This transformation is very simple and reliable.  It can be done before target
-lowering and register allocation, and it easily avoids the classic lost-copy and
-related problems.  ``Om1`` lowering uses this strategy.
-
-However, it has the disadvantage of initializing temporaries even for branches
-not taken, though that could be mitigated by splitting non-critical edges and
-putting assignments in the edge-split nodes.  Another problem is that without
-extra machinery, the assignments to ``A``, ``A'``, ``X``, and ``X'`` are given a
-specific ordering even though phi semantics are that the assignments are
-parallel or unordered.  This sometimes imposes false live range overlaps and
-leads to poorer register allocation.
-
-Advanced phi lowering
-^^^^^^^^^^^^^^^^^^^^^
-
-``O2`` lowering defers phi lowering until after register allocation to avoid the
-problem of false live range overlaps.  It works as follows.  We split each
-incoming edge and move the (parallel) phi assignments into the split nodes.  We
-linearize each set of assignments by finding a safe, topological ordering of the
-assignments, respecting register assignments as well.  For example::
-
-    A = B
-    X = Y
-
-Normally these assignments could be executed in either order, but if ``B`` and
-``X`` are assigned the same physical register, we would want to use the above
-ordering.  Dependency cycles are broken by introducing a temporary.  For
-example::
-
-    A = B
-    B = A
-
-Here, a temporary breaks the cycle::
-
-    t = A
-    A = B
-    B = t
-
-Finally, we use the existing target lowering to lower the assignments in this
-basic block, and once that is done for all basic blocks, we run the post-phi
-variant of register allocation on the edge-split basic blocks.
-
-When computing a topological order, we try to first schedule assignments whose
-source has a physical register, and last schedule assignments whose destination
-has a physical register.  This helps reduce register pressure.
-
-X86 address mode inference
-^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-We try to take advantage of the x86 addressing mode that includes a base
-register, an index register, an index register scale amount, and an immediate
-offset.  We do this through simple pattern matching.  Starting with a load or
-store instruction where the address is a variable, we initialize the base
-register to that variable, and look up the instruction where that variable is
-defined.  If that is an add instruction of two variables and the index register
-hasn't been set, we replace the base and index register with those two
-variables.  If instead it is an add instruction of a variable and a constant, we
-replace the base register with the variable and add the constant to the
-immediate offset.
-
-There are several more patterns that can be matched.  This pattern matching
-continues on the load or store instruction until no more matches are found.
-Because a program typically has few load and store instructions (not to be
-confused with instructions that manipulate stack variables), this address mode
-inference pass is fast.
-
-X86 read-modify-write inference
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-A reasonably common bitcode pattern is a non-atomic update of a memory
-location::
-
-    x = load addr
-    y = add x, 1
-    store y, addr
-
-On x86, with good register allocation, the Subzero passes described above
-generate code with only this quality::
-
-    mov [%ebx], %eax
-    add $1, %eax
-    mov %eax, [%ebx]
-
-However, x86 allows for this kind of code::
-
-    add $1, [%ebx]
-
-which requires fewer instructions, but perhaps more importantly, requires fewer
-physical registers.
-
-It's also important to note that this transformation only makes sense if the
-store instruction ends ``x``'s live range.
-
-Subzero's ``O2`` recipe includes an early pass to find read-modify-write (RMW)
-opportunities via simple pattern matching.  The only problem is that it is run
-before liveness analysis, which is needed to determine whether ``x``'s live
-range ends after the RMW.  Since liveness analysis is one of the most expensive
-passes, it's not attractive to run it an extra time just for RMW analysis.
-Instead, we essentially generate both the RMW and the non-RMW versions, and then
-during lowering, the RMW version deletes itself if it finds x still live.
-
-X86 compare-branch inference
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-In the LLVM instruction set, the compare/branch pattern works like this::
-
-    cond = icmp eq a, b
-    br cond, target
-
-The result of the icmp instruction is a single bit, and a conditional branch
-tests that bit.  By contrast, most target architectures use this pattern::
-
-    cmp a, b  // implicitly sets various bits of FLAGS register
-    br eq, target  // branch on a particular FLAGS bit
-
-A naive lowering sequence conditionally sets ``cond`` to 0 or 1, then tests
-``cond`` and conditionally branches.  Subzero has a pass that identifies
-boolean-based operations like this and folds them into a single
-compare/branch-like operation.  It is set up for more than just cmp/br though.
-Boolean producers include icmp (integer compare), fcmp (floating-point compare),
-and trunc (integer truncation when the destination has bool type).  Boolean
-consumers include branch, select (the ternary operator from the C language), and
-sign-extend and zero-extend when the source has bool type.
-
-Sandboxing
-^^^^^^^^^^
-
-Native Client's sandbox model uses software fault isolation (SFI) to provide
-safety when running untrusted code in a browser or other environment.  Subzero
-implements Native Client's `sandboxing
-<https://developer.chrome.com/native-client/reference/sandbox_internals/index>`_
-to enable Subzero-translated executables to be run inside Chrome.  Subzero also
-provides a fairly simple framework for investigating alternative sandbox models
-or other restrictions on the sandbox model.
-
-Sandboxing in Subzero is not actually implemented as a separate pass, but is
-integrated into lowering and assembly.
-
-- Indirect branches, including the ret instruction, are masked to a bundle
-  boundary and bundle-locked.
-
-- Call instructions are aligned to the end of the bundle so that the return
-  address is bundle-aligned.
-
-- Indirect branch targets, including function entry and targets in a switch
-  statement jump table, are bundle-aligned.
-
-- The intrinsic for reading the thread pointer is inlined appropriately.
-
-- For x86-64, non-stack memory accesses are with respect to the reserved sandbox
-  base register.  We reduce the aggressiveness of address mode inference to
-  leave room for the sandbox base register during lowering.  There are no memory
-  sandboxing changes for x86-32.
-
-Code emission
--------------
-
-Subzero's integrated assembler is derived from Dart's `assembler code
-<https://github.com/dart-lang/sdk/tree/master/runtime/vm>'_.  There is a pass
-that iterates through the low-level ICE instructions and invokes the relevant
-assembler functions.  Placeholders are added for later fixup of branch target
-offsets.  (Backward branches use short offsets if possible; forward branches
-generally use long offsets unless it is an intra-block branch of "known" short
-length.)  The assembler emits into a staging buffer.  Once emission into the
-staging buffer for a function is complete, the data is emitted to the output
-file as an ELF object file, and metadata such as relocations, symbol table, and
-string table, are accumulated for emission at the end.  Global data initializers
-are emitted similarly.  A key point is that at this point, the staging buffer
-can be deallocated, and only a minimum of data needs to held until the end.
-
-As a debugging alternative, Subzero can emit textual assembly code which can
-then be run through an external assembler.  This is of course super slow, but
-quite valuable when bringing up a new target.
-
-As another debugging option, the staging buffer can be emitted as textual
-assembly, primarily in the form of ".byte" lines.  This allows the assembler to
-be tested separately from the ELF related code.
-
-Memory management
------------------
-
-Where possible, we allocate from a ``CfgLocalAllocator`` which derives from
-LLVM's ``BumpPtrAllocator``.  This is an arena-style allocator where objects
-allocated from the arena are never actually freed; instead, when the CFG
-translation completes and the CFG is deleted, the entire arena memory is
-reclaimed at once.  This style of allocation works well in an environment like a
-compiler where there are distinct phases with only easily-identifiable objects
-living across phases.  It frees the developer from having to manage object
-deletion, and it amortizes deletion costs across just a single arena deletion at
-the end of the phase.  Furthermore, it helps scalability by allocating entirely
-from thread-local memory pools, and minimizing global locking of the heap.
-
-Instructions are probably the most heavily allocated complex class in Subzero.
-We represent an instruction list as an intrusive doubly linked list, allocate
-all instructions from the ``CfgLocalAllocator``, and we make sure each
-instruction subclass is basically `POD
-<http://en.cppreference.com/w/cpp/concept/PODType>`_ (Plain Old Data) with a
-trivial destructor.  This way, when the CFG is finished, we don't need to
-individually deallocate every instruction.  We do similar for Variables, which
-is probably the second most popular complex class.
-
-There are some situations where passes need to use some `STL container class
-<http://en.cppreference.com/w/cpp/container>`_.  Subzero has a way of using the
-``CfgLocalAllocator`` as the container allocator if this is needed.
-
-Multithreaded translation
--------------------------
-
-Subzero is designed to be able to translate functions in parallel.  With the
-``-threads=N`` command-line option, there is a 3-stage producer-consumer
-pipeline:
-
-- A single thread parses the ``.pexe`` file and produces a sequence of work
-  units.  A work unit can be either a fully constructed CFG, or a set of global
-  initializers.  The work unit includes its sequence number denoting its parse
-  order.  Each work unit is added to the translation queue.
-
-- There are N translation threads that draw work units from the translation
-  queue and lower them into assembler buffers.  Each assembler buffer is added
-  to the emitter queue, tagged with its sequence number.  The CFG and its
-  ``CfgLocalAllocator`` are disposed of at this point.
-
-- A single thread draws assembler buffers from the emitter queue and appends to
-  the output file.  It uses the sequence numbers to reintegrate the assembler
-  buffers according to the original parse order, such that output order is
-  always deterministic.
-
-This means that with ``-threads=N``, there are actually ``N+1`` spawned threads
-for a total of ``N+2`` execution threads, taking the parser and emitter threads
-into account.  For the special case of ``N=0``, execution is entirely sequential
--- the same thread parses, translates, and emits, one function at a time.  This
-is useful for performance measurements.
-
-Ideally, we would like to get near-linear scalability as the number of
-translation threads increases.  We expect that ``-threads=1`` should be slightly
-faster than ``-threads=0`` as the small amount of time spent parsing and
-emitting is done largely in parallel with translation.  With perfect
-scalability, we see ``-threads=N`` translating ``N`` times as fast as
-``-threads=1``, up until the point where parsing or emitting becomes the
-bottleneck, or ``N+2`` exceeds the number of CPU cores.  In reality, memory
-performance would become a bottleneck and efficiency might peak at, say, 75%.
-
-Currently, parsing takes about 11% of total sequential time.  If translation
-scalability ever gets so fast and awesomely scalable that parsing becomes a
-bottleneck, it should be possible to make parsing multithreaded as well.
-
-Internally, all shared, mutable data is held in the GlobalContext object, and
-access to each field is guarded by a mutex.
-
-Security
---------
-
-Subzero includes a number of security features in the generated code, as well as
-in the Subzero translator itself, which run on top of the existing Native Client
-sandbox as well as Chrome's OS-level sandbox.
-
-Sandboxed translator
-^^^^^^^^^^^^^^^^^^^^
-
-When running inside the browser, the Subzero translator executes as sandboxed,
-untrusted code that is initially checked by the validator, just like the
-LLVM-based ``pnacl-llc`` translator.  As such, the Subzero binary should be no
-more or less secure than the translator it replaces, from the point of view of
-the Chrome sandbox.  That said, Subzero is much smaller than ``pnacl-llc`` and
-was designed from the start with security in mind, so one expects fewer attacker
-opportunities here.
-
-Code diversification
-^^^^^^^^^^^^^^^^^^^^
-
-`Return-oriented programming
-<https://en.wikipedia.org/wiki/Return-oriented_programming>`_ (ROP) is a
-now-common technique for starting with e.g. a known buffer overflow situation
-and launching it into a deeper exploit.  The attacker scans the executable
-looking for ROP gadgets, which are short sequences of code that happen to load
-known values into known registers and then return.  An attacker who manages to
-overwrite parts of the stack can overwrite it with carefully chosen return
-addresses such that certain ROP gadgets are effectively chained together to set
-up the register state as desired, finally returning to some code that manages to
-do something nasty based on those register values.
-
-If there is a popular ``.pexe`` with a large install base, the attacker could
-run Subzero on it and scan the executable for suitable ROP gadgets to use as
-part of a potential exploit.  Note that if the trusted validator is working
-correctly, these ROP gadgets are limited to starting at a bundle boundary and
-cannot use the trick of finding a gadget that happens to begin inside another
-instruction.  All the same, gadgets with these constraints still exist and the
-attacker has access to them.  This is the attack model we focus most on --
-protecting the user against misuse of a "trusted" developer's application, as
-opposed to mischief from a malicious ``.pexe`` file.
-
-Subzero can mitigate these attacks to some degree through code diversification.
-Specifically, we can apply some randomness to the code generation that makes ROP
-gadgets less predictable.  This randomness can have some compile-time cost, and
-it can affect the code quality; and some diversifications may be more effective
-than others.  A more detailed treatment of hardening techniques may be found in
-the Matasano report "`Attacking Clientside JIT Compilers
-<https://www.nccgroup.trust/globalassets/resources/us/presentations/documents/attacking_clientside_jit_compilers_paper.pdf>`_".
-
-To evaluate diversification effectiveness, we use a third-party ROP gadget
-finder and limit its results to bundle-aligned addresses.  For a given
-diversification technique, we run it with a number of different random seeds,
-find ROP gadgets for each version, and determine how persistent each ROP gadget
-is across the different versions.  A gadget is persistent if the same gadget is
-found at the same code address.  The best diversifications are ones with low
-gadget persistence rates.
-
-Subzero implements 7 different diversification techniques.  Below is a
-discussion of each technique, its effectiveness, and its cost.  The discussions
-of cost and effectiveness are for a single diversification technique; the
-translation-time costs for multiple techniques are additive, but the effects of
-multiple techniques on code quality and effectiveness are not yet known.
-
-In Subzero's implementation, each randomization is "repeatable" in a sense.
-Each pass that includes a randomization option gets its own private instance of
-a random number generator (RNG).  The RNG is seeded with a combination of a
-global seed, the pass ID, and the function's sequence number.  The global seed
-is designed to be different across runs (perhaps based on the current time), but
-for debugging, the global seed can be set to a specific value and the results
-will be repeatable.
-
-Subzero-generated code is subject to diversification once per translation, and
-then Chrome caches the diversified binary for subsequent executions.  An
-attacker may attempt to run the binary multiple times hoping for
-higher-probability combinations of ROP gadgets.  When the attacker guesses
-wrong, a likely outcome is an application crash.  Chrome throttles creation of
-crashy processes which reduces the likelihood of the attacker eventually gaining
-a foothold.
-
-Constant blinding
-~~~~~~~~~~~~~~~~~
-
-Here, we prevent attackers from controlling large immediates in the text
-(executable) section.  A random cookie is generated for each function, and if
-the constant exceeds a specified threshold, the constant is obfuscated with the
-cookie and equivalent code is generated.  For example, instead of this x86
-instruction::
-
-    mov $0x11223344, <%Reg/Mem>
-
-the following code might be generated::
-
-    mov $(0x11223344+Cookie), %temp
-    lea -Cookie(%temp), %temp
-    mov %temp, <%Reg/Mem>
-
-The ``lea`` instruction is used rather than e.g. ``add``/``sub`` or ``xor``, to
-prevent unintended effects on the flags register.
-
-This transformation has almost no effect on translation time, and about 1%
-impact on code quality, depending on the threshold chosen.  It does little to
-reduce gadget persistence, but it does remove a lot of potential opportunities
-to construct intra-instruction ROP gadgets (which an attacker could use only if
-a validator bug were discovered, since the Native Client sandbox and associated
-validator force returns and other indirect branches to be to bundle-aligned
-addresses).
-
-Constant pooling
-~~~~~~~~~~~~~~~~
-
-This is similar to constant blinding, in that large immediates are removed from
-the text section.  In this case, each unique constant above the threshold is
-stored in a read-only data section and the constant is accessed via a memory
-load.  For the above example, the following code might be generated::
-
-    mov $Label$1, %temp
-    mov %temp, <%Reg/Mem>
-
-This has a similarly small impact on translation time and ROP gadget
-persistence, and a smaller (better) impact on code quality.  This is because it
-uses fewer instructions, and in some cases good register allocation leads to no
-increase in instruction count.  Note that this still gives an attacker some
-limited amount of control over some text section values, unless we randomize the
-constant pool layout.
-
-Static data reordering
-~~~~~~~~~~~~~~~~~~~~~~
-
-This transformation limits the attacker's ability to control bits in global data
-address references.  It simply permutes the order in memory of global variables
-and internal constant pool entries.  For the constant pool, we only permute
-within a type (i.e., emit a randomized list of ints, followed by a randomized
-list of floats, etc.) to maintain good packing in the face of alignment
-constraints.
-
-As might be expected, this has no impact on code quality, translation time, or
-ROP gadget persistence (though as above, it limits opportunities for
-intra-instruction ROP gadgets with a broken validator).
-
-Basic block reordering
-~~~~~~~~~~~~~~~~~~~~~~
-
-Here, we randomize the order of basic blocks within a function, with the
-constraint that we still want to maintain a topological order as much as
-possible, to avoid making the code too branchy.
-
-This has no impact on code quality, and about 1% impact on translation time, due
-to a separate pass to recompute layout.  It ends up having a huge effect on ROP
-gadget persistence, tied for best with nop insertion, reducing ROP gadget
-persistence to less than 5%.
-
-Function reordering
-~~~~~~~~~~~~~~~~~~~
-
-Here, we permute the order that functions are emitted, primarily to shift ROP
-gadgets around to less predictable locations.  It may also change call address
-offsets in case the attacker was trying to control that offset in the code.
-
-To control latency and memory footprint, we don't arbitrarily permute functions.
-Instead, for some relatively small value of N, we queue up N assembler buffers,
-and then emit the N functions in random order, and repeat until all functions
-are emitted.
-
-Function reordering has no impact on translation time or code quality.
-Measurements indicate that it reduces ROP gadget persistence to about 15%.
-
-Nop insertion
-~~~~~~~~~~~~~
-
-This diversification randomly adds a nop instruction after each regular
-instruction, with some probability.  Nop instructions of different lengths may
-be selected.  Nop instructions are never added inside a bundle_lock region.
-Note that when sandboxing is enabled, nop instructions are already being added
-for bundle alignment, so the diversification nop instructions may simply be
-taking the place of alignment nop instructions, though distributed differently
-through the bundle.
-
-In Subzero's currently implementation, nop insertion adds 3-5% to the
-translation time, but this is probably because it is implemented as a separate
-pass that adds actual nop instructions to the IR.  The overhead would probably
-be a lot less if it were integrated into the assembler pass.  The code quality
-is also reduced by 3-5%, making nop insertion the most expensive of the
-diversification techniques.
-
-Nop insertion is very effective in reducing ROP gadget persistence, at the same
-level as basic block randomization (less than 5%).  But given nop insertion's
-impact on translation time and code quality, one would most likely prefer to use
-basic block randomization instead (though the combined effects of the different
-diversification techniques have not yet been studied).
-
-Register allocation randomization
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-In this diversification, the register allocator tries to make different but
-mostly functionally equivalent choices, while maintaining stable code quality.
-
-A naive approach would be the following.  Whenever the allocator has more than
-one choice for assigning a register, choose randomly among those options.  And
-whenever there are no registers available and there is a tie for the
-lowest-weight variable, randomly select one of the lowest-weight variables to
-evict.  Because of the one-pass nature of the linear-scan algorithm, this
-randomization strategy can have a large impact on which variables are ultimately
-assigned registers, with a corresponding large impact on code quality.
-
-Instead, we choose an approach that tries to keep code quality stable regardless
-of the random seed.  We partition the set of physical registers into equivalence
-classes.  If a register is pre-colored in the function (i.e., referenced
-explicitly by name), it forms its own equivalence class.  The remaining
-registers are partitioned according to their combination of attributes such as
-integer versus floating-point, 8-bit versus 32-bit, caller-save versus
-callee-saved, etc.  Each equivalence class is randomly permuted, and the
-complete permutation is applied to the final register assignments.
-
-Register randomization reduces ROP gadget persistence to about 10% on average,
-though there tends to be fairly high variance across functions and applications.
-This probably has to do with the set of restrictions in the x86-32 instruction
-set and ABI, such as few general-purpose registers, ``%eax`` used for return
-values, ``%edx`` used for division, ``%cl`` used for shifting, etc.  As
-intended, register randomization has no impact on code quality, and a slight
-(0.5%) impact on translation time due to an extra scan over the variables to
-identify pre-colored registers.
-
-Fuzzing
-^^^^^^^
-
-We have started fuzz-testing the ``.pexe`` files input to Subzero, using a
-combination of `afl-fuzz <http://lcamtuf.coredump.cx/afl/>`_, LLVM's `libFuzzer
-<http://llvm.org/docs/LibFuzzer.html>`_, and custom tooling.  The purpose is to
-find and fix cases where Subzero crashes or otherwise ungracefully fails on
-unexpected inputs, and to do so automatically over a large range of unexpected
-inputs.  By fixing bugs that arise from fuzz testing, we reduce the possibility
-of an attacker exploiting these bugs.
-
-Most of the problems found so far are ones most appropriately handled in the
-parser.  However, there have been a couple that have identified problems in the
-lowering, or otherwise inappropriately triggered assertion failures and fatal
-errors.  We continue to dig into this area.
-
-Future security work
-^^^^^^^^^^^^^^^^^^^^
-
-Subzero is well-positioned to explore other future security enhancements, e.g.:
-
-- Tightening the Native Client sandbox.  ABI changes, such as the previous work
-  on `hiding the sandbox base address
-  <https://docs.google.com/document/d/1eskaI4353XdsJQFJLRnZzb_YIESQx4gNRzf31dqXVG8>`_
-  in x86-64, are easy to experiment with in Subzero.
-
-- Making the executable code section read-only.  This would prevent a PNaCl
-  application from inspecting its own binary and trying to find ROP gadgets even
-  after code diversification has been performed.  It may still be susceptible to
-  `blind ROP <http://www.scs.stanford.edu/brop/bittau-brop.pdf>`_ attacks,
-  security is still overall improved.
-
-- Instruction selection diversification.  It may be possible to lower a given
-  instruction in several largely equivalent ways, which gives more opportunities
-  for code randomization.
-
-Chrome integration
-------------------
-
-Currently Subzero is available in Chrome for the x86-32 architecture, but under
-a flag.  When the flag is enabled, Subzero is used when the `manifest file
-<https://developer.chrome.com/native-client/reference/nacl-manifest-format>`_
-linking to the ``.pexe`` file specifies the ``O0`` optimization level.
-
-The next step is to remove the flag, i.e. invoke Subzero as the only translator
-for ``O0``-specified manifest files.
-
-Ultimately, Subzero might produce code rivaling LLVM ``O2`` quality, in which
-case Subzero could be used for all PNaCl translation.
-
-Command line options
---------------------
-
-Subzero has a number of command-line options for debugging and diagnostics.
-Among the more interesting are the following.
-
-- Using the ``-verbose`` flag, Subzero will dump the CFG, or produce other
-  diagnostic output, with various levels of detail after each pass.  Instruction
-  numbers can be printed or suppressed.  Deleted instructions can be printed or
-  suppressed (they are retained in the instruction list, as discussed earlier,
-  because they can help explain how lower-level instructions originated).
-  Liveness information can be printed when available.  Details of register
-  allocation can be printed as register allocator decisions are made.  And more.
-
-- Running Subzero with any level of verbosity produces an enormous amount of
-  output.  When debugging a single function, verbose output can be suppressed
-  except for a particular function.  The ``-verbose-focus`` flag suppresses
-  verbose output except for the specified function.
-
-- Subzero has a ``-timing`` option that prints a breakdown of pass-level timing
-  at exit.  Timing markers can be placed in the Subzero source code to demarcate
-  logical operations or passes of interest.  Basic timing information plus
-  call-stack type timing information is printed at the end.
-
-- Along with ``-timing``, the user can instead get a report on the overall
-  translation time for each function, to help focus on timing outliers.  Also,
-  ``-timing-focus`` limits the ``-timing`` reporting to a single function,
-  instead of aggregating pass timing across all functions.
-
-- The ``-szstats`` option reports various statistics on each function, such as
-  stack frame size, static instruction count, etc.  It may be helpful to track
-  these stats over time as Subzero is improved, as an approximate measure of
-  code quality.
-
-- The flag ``-asm-verbose``, in conjunction with emitting textual assembly
-  output, annotate the assembly output with register-focused liveness
-  information.  In particular, each basic block is annotated with which
-  registers are live-in and live-out, and each instruction is annotated with
-  which registers' and stack locations' live ranges end at that instruction.
-  This is really useful when studying the generated code to find opportunities
-  for code quality improvements.
-
-Testing and debugging
----------------------
-
-LLVM lit tests
-^^^^^^^^^^^^^^
-
-For basic testing, Subzero uses LLVM's `lit
-<http://llvm.org/docs/CommandGuide/lit.html>`_ framework for running tests.  We
-have a suite of hundreds of small functions where we test for particular
-assembly code patterns across different target architectures.
-
-Cross tests
-^^^^^^^^^^^
-
-Unfortunately, the lit tests don't do a great job of precisely testing the
-correctness of the output.  Much better are the cross tests, which are execution
-tests that compare Subzero and ``pnacl-llc`` translated bitcode across a wide
-variety of interesting inputs.  Each cross test consists of a set of C, C++,
-and/or low-level bitcode files.  The C and C++ source files are compiled down to
-bitcode.  The bitcode files are translated by ``pnacl-llc`` and also by Subzero.
-Subzero mangles global symbol names with a special prefix to avoid duplicate
-symbol errors.  A driver program invokes both versions on a large set of
-interesting inputs, and reports when the Subzero and ``pnacl-llc`` results
-differ.  Cross tests turn out to be an excellent way of testing the basic
-lowering patterns, but they are less useful for testing more global things like
-liveness analysis and register allocation.
-
-Bisection debugging
-^^^^^^^^^^^^^^^^^^^
-
-Sometimes with a new application, Subzero will end up producing incorrect code
-that either crashes at runtime or otherwise produces the wrong results.  When
-this happens, we need to narrow it down to a single function (or small set of
-functions) that yield incorrect behavior.  For this, we have a bisection
-debugging framework.  Here, we initially translate the entire application once
-with Subzero and once with ``pnacl-llc``.  We then use ``objdump`` to
-selectively weaken symbols based on a whitelist or blacklist provided on the
-command line.  The two object files can then be linked together without link
-errors, with the desired version of each method "winning".  Then the binary is
-tested, and bisection proceeds based on whether the binary produces correct
-output.
-
-When the bisection completes, we are left with a minimal set of
-Subzero-translated functions that cause the failure.  Usually it is a single
-function, though sometimes it might require a combination of several functions
-to cause a failure; this may be due to an incorrect call ABI, for example.
-However, Murphy's Law implies that the single failing function is enormous and
-impractical to debug.  In that case, we can restart the bisection, explicitly
-blacklisting the enormous function, and try to find another candidate to debug.
-(Future work is to automate this to find all minimal sets of functions, so that
-debugging can focus on the simplest example.)
-
-Fuzz testing
-^^^^^^^^^^^^
-
-As described above, we try to find internal Subzero bugs using fuzz testing
-techniques.
-
-Sanitizers
-^^^^^^^^^^
-
-Subzero can be built with `AddressSanitizer
-<http://clang.llvm.org/docs/AddressSanitizer.html>`_ (ASan) or `ThreadSanitizer
-<http://clang.llvm.org/docs/ThreadSanitizer.html>`_ (TSan) support.  This is
-done using something as simple as ``make ASAN=1`` or ``make TSAN=1``.  So far,
-multithreading has been simple enough that TSan hasn't found any bugs, but ASan
-has found at least one memory leak which was subsequently fixed.
-`UndefinedBehaviorSanitizer
-<http://clang.llvm.org/docs/UsersManual.html#controlling-code-generation>`_
-(UBSan) support is in progress.  `Control flow integrity sanitization
-<http://clang.llvm.org/docs/ControlFlowIntegrity.html>`_ is also under
-consideration.
-
-Current status
-==============
-
-Target architectures
---------------------
-
-Subzero is currently more or less complete for the x86-32 target.  It has been
-refactored and extended to handle x86-64 as well, and that is mostly complete at
-this point.
-
-ARM32 work is in progress.  It currently lacks the testing level of x86, at
-least in part because Subzero's register allocator needs modifications to handle
-ARM's aliasing of floating point and vector registers.  Specifically, a 64-bit
-register is actually a gang of two consecutive and aligned 32-bit registers, and
-a 128-bit register is a gang of 4 consecutive and aligned 32-bit registers.
-ARM64 work has not started; when it does, it will be native-only since the
-Native Client sandbox model, validator, and other tools have never been defined.
-
-An external contributor is adding MIPS support, in most part by following the
-ARM work.
-
-Translator performance
-----------------------
-
-Single-threaded translation speed is currently about 5× the ``pnacl-llc``
-translation speed.  For a large ``.pexe`` file, the time breaks down as:
-
-- 11% for parsing and initial IR building
-
-- 4% for emitting to /dev/null
-
-- 27% for liveness analysis (two liveness passes plus live range construction)
-
-- 15% for linear-scan register allocation
-
-- 9% for basic lowering
-
-- 10% for advanced phi lowering
-
-- ~11% for other minor analysis
-
-- ~10% measurement overhead to acquire these numbers
-
-Some improvements could undoubtedly be made, but it will be hard to increase the
-speed to 10× of ``pnacl-llc`` while keeping acceptable code quality.  With
-``-Om1`` (lack of) optimization, we do actually achieve roughly 10×
-``pnacl-llc`` translation speed, but code quality drops by a factor of 3.
-
-Code quality
-------------
-
-Measured across 16 components of spec2k, Subzero's code quality is uniformly
-better than ``pnacl-llc`` ``-O0`` code quality, and in many cases solidly
-between ``pnacl-llc`` ``-O0`` and ``-O2``.
-
-Translator size
----------------
-
-When built in MINIMAL mode, the x86-64 native translator size for the x86-32
-target is about 700 KB, not including the size of functions referenced in
-dynamically-linked libraries.  The sandboxed version of Subzero is a bit over 1
-MB, and it is statically linked and also includes nop padding for bundling as
-well as indirect branch masking.
-
-Translator memory footprint
----------------------------
-
-It's hard to draw firm conclusions about memory footprint, since the footprint
-is at least proportional to the input function size, and there is no real limit
-on the size of functions in the ``.pexe`` file.
-
-That said, we looked at the memory footprint over time as Subzero translated
-``pnacl-llc.pexe``, which is the largest ``.pexe`` file (7.2 MB) at our
-disposal.  One of LLVM's libraries that Subzero uses can report the current
-malloc heap usage.  With single-threaded translation, Subzero tends to hover
-around 15 MB of memory usage.  There are a couple of monstrous functions where
-Subzero grows to around 100 MB, but then it drops back down after those
-functions finish translating.  In contrast, ``pnacl-llc`` grows larger and
-larger throughout translation, reaching several hundred MB by the time it
-completes.
-
-It's a bit more interesting when we enable multithreaded translation.  When
-there are N translation threads, Subzero implements a policy that limits the
-size of the translation queue to N entries -- if it is "full" when the parser
-tries to add a new CFG, the parser blocks until one of the translation threads
-removes a CFG.  This means the number of in-memory CFGs can (and generally does)
-reach 2*N+1, and so the memory footprint rises in proportion to the number of
-threads.  Adding to the pressure is the observation that the monstrous functions
-also take proportionally longer time to translate, so there's a good chance many
-of the monstrous functions will be active at the same time with multithreaded
-translation.  As a result, for N=32, Subzero's memory footprint peaks at about
-260 MB, but drops back down as the large functions finish translating.
-
-If this peak memory size becomes a problem, it might be possible for the parser
-to resequence the functions to try to spread out the larger functions, or to
-throttle the translation queue to prevent too many in-flight large functions.
-It may also be possible to throttle based on memory pressure signaling from
-Chrome.
-
-Translator scalability
-----------------------
-
-Currently scalability is "not very good".  Multiple translation threads lead to
-faster translation, but not to the degree desired.  We haven't dug in to
-investigate yet.
-
-There are a few areas to investigate.  First, there may be contention on the
-constant pool, which all threads access, and which requires locked access even
-for reading.  This could be mitigated by keeping a CFG-local cache of the most
-common constants.
-
-Second, there may be contention on memory allocation.  While almost all CFG
-objects are allocated from the CFG-local allocator, some passes use temporary
-STL containers that use the default allocator, which may require global locking.
-This could be mitigated by switching these to the CFG-local allocator.
-
-Third, multithreading may make the default allocator strategy more expensive.
-In a single-threaded environment, a pass will allocate its containers, run the
-pass, and deallocate the containers.  This results in stack-like allocation
-behavior and makes the heap free list easier to manage, with less heap
-fragmentation.  But when multithreading is added, the allocations and
-deallocations become much less stack-like, making allocation and deallocation
-operations individually more expensive.  Again, this could be mitigated by
-switching these to the CFG-local allocator.
diff --git a/DESIGN.rst b/DESIGN.rst

new file mode 120000 (symlink)

index 0000000000000000000000000000000000000000..d0f71625a6c3fc1d9e380507aad1014a26ec045a
--- /dev/null
+++ b/DESIGN.rst
@@ -0,0 +1 @@
+docs/DESIGN.rst
+\ No newline at end of file
diff --git a/Makefile.standalone b/Makefile.standalone

index 64f47f6..402a9d0 100644 (file)
--- a/Makefile.standalone
+++ b/Makefile.standalone
@@ -517,4 +517,4 @@ clean:
         rm -rf pnacl-sz *.o $(OBJDIR) $(SB_OBJDIR) build/pnacl-sz.bloat.json
  
  clean-all: clean
-       rm -rf build/ docs/
+       rm -rf build/ docs/html
diff --git a/README.rst b/README.rst

deleted file mode 100644 (file)

index 0257558a17fd0de0c4b8b3361ee92c51c5c33f58..0000000000000000000000000000000000000000
--- a/README.rst
+++ /dev/null
@@ -1,192 +0,0 @@
-Subzero - Fast code generator for PNaCl bitcode
-===============================================
-
-Design
-------
-
-See the accompanying DESIGN.rst file for a more detailed technical overview of
-Subzero.
-
-Building
---------
-
-Subzero is set up to be built within the Native Client tree.  Follow the
-`Developing PNaCl
-<https://sites.google.com/a/chromium.org/dev/nativeclient/pnacl/developing-pnacl>`_
-instructions, in particular the section on building PNaCl sources.  This will
-prepare the necessary external headers and libraries that Subzero needs.
-Checking out the Native Client project also gets the pre-built clang and LLVM
-tools in ``native_client/../third_party/llvm-build/Release+Asserts/bin`` which
-are used for building Subzero.
-
-The Subzero source is in ``native_client/toolchain_build/src/subzero``.  From
-within that directory, ``git checkout master && git pull`` to get the latest
-version of Subzero source code.
-
-The Makefile is designed to be used as part of the higher level LLVM build
-system.  To build manually, use the ``Makefile.standalone``.  There are several
-build configurations from the command line::
-
-    make -f Makefile.standalone
-    make -f Makefile.standalone DEBUG=1
-    make -f Makefile.standalone NOASSERT=1
-    make -f Makefile.standalone DEBUG=1 NOASSERT=1
-    make -f Makefile.standalone MINIMAL=1
-    make -f Makefile.standalone ASAN=1
-    make -f Makefile.standalone TSAN=1
-
-``DEBUG=1`` builds without optimizations and is good when running the translator
-inside a debugger.  ``NOASSERT=1`` disables assertions and is the preferred
-configuration for performance testing the translator.  ``MINIMAL=1`` attempts to
-minimize the size of the translator by compiling out everything unnecessary.
-``ASAN=1`` enables AddressSanitizer, and ``TSAN=1`` enables ThreadSanitizer.
-
-The result of the ``make`` command is the target ``pnacl-sz`` in the current
-directory.
-
-``pnacl-sz``
-------------
-
-The ``pnacl-sz`` program parses a pexe or an LLVM bitcode file and translates it
-into ICE (Subzero's intermediate representation).  It then invokes the ICE
-translate method to lower it to target-specific machine code, optionally dumping
-the intermediate representation at various stages of the translation.
-
-The program can be run as follows::
-
-    ../pnacl-sz ./path/to/<file>.pexe
-    ../pnacl-sz ./tests_lit/pnacl-sz_tests/<file>.ll
-
-At this time, ``pnacl-sz`` accepts a number of arguments, including the
-following:
-
-    ``-help`` -- Show available arguments and possible values.  (Note: this
-    unfortunately also pulls in some LLVM-specific options that are reported but
-    that Subzero doesn't use.)
-
-    ``-notranslate`` -- Suppress the ICE translation phase, which is useful if
-    ICE is missing some support.
-
-    ``-target=<TARGET>`` -- Set the target architecture.  The default is x8632.
-    Future targets include x8664, arm32, and arm64.
-
-    ``-filetype=obj|asm|iasm`` -- Select the output file type.  ``obj`` is a
-    native ELF file, ``asm`` is a textual assembly file, and ``iasm`` is a
-    low-level textual assembly file demonstrating the integrated assembler.
-
-    ``-O<LEVEL>`` -- Set the optimization level.  Valid levels are ``2``, ``1``,
-    ``0``, ``-1``, and ``m1``.  Levels ``-1`` and ``m1`` are synonyms, and
-    represent the minimum optimization and worst code quality, but fastest code
-    generation.
-
-    ``-verbose=<list>`` -- Set verbosity flags.  This argument allows a
-    comma-separated list of values.  The default is ``none``, and the value
-    ``inst,pred`` will roughly match the .ll bitcode file.  Of particular use
-    are ``all``, ``most``, and ``none``.
-
-    ``-o <FILE>`` -- Set the assembly output file name.  Default is stdout.
-
-    ``-log <FILE>`` -- Set the file name for diagnostic output (whose level is
-    controlled by ``-verbose``).  Default is stdout.
-
-    ``-timing`` -- Dump some pass timing information after translating the input
-    file.
-
-Running the test suite
-----------------------
-
-Subzero uses the LLVM ``lit`` testing tool for part of its test suite, which
-lives in ``tests_lit``. To execute the test suite, first build Subzero, and then
-run::
-
-    make -f Makefile.standalone check-lit
-
-There is also a suite of cross tests in the ``crosstest`` directory.  A cross
-test takes a test bitcode file implementing some unit tests, and translates it
-twice, once with Subzero and once with LLVM's known-good ``llc`` translator.
-The Subzero-translated symbols are specially mangled to avoid multiple
-definition errors from the linker.  Both translated versions are linked together
-with a driver program that calls each version of each unit test with a variety
-of interesting inputs and compares the results for equality.  The cross tests
-are currently invoked by running::
-
-    make -f Makefile.standalone check-xtest
-
-Similar, there is a suite of unit tests::
-
-    make -f Makefile.standalone check-unit
-
-A convenient way to run the lit, cross, and unit tests is::
-
-    make -f Makefile.standalone check
-
-Assembling ``pnacl-sz`` output as needed
-----------------------------------------
-
-``pnacl-sz`` can now produce a native ELF binary using ``-filetype=obj``.
-
-``pnacl-sz`` can also produce textual assembly code in a structure suitable for
-input to ``llvm-mc``, using ``-filetype=asm`` or ``-filetype=iasm``.  An object
-file can then be produced using the command::
-
-    llvm-mc -triple=i686 -filetype=obj -o=MyObj.o
-
-Building a translated binary
-----------------------------
-
-There is a helper script, ``pydir/szbuild.py``, that translates a finalized pexe
-into a fully linked executable.  Run it with ``-help`` for extensive
-documentation.
-
-By default, ``szbuild.py`` builds an executable using only Subzero translation,
-but it can also be used to produce hybrid Subzero/``llc`` binaries (``llc`` is
-the name of the LLVM translator) for bisection-based debugging.  In bisection
-debugging mode, the pexe is translated using both Subzero and ``llc``, and the
-resulting object files are combined into a single executable using symbol
-weakening and other linker tricks to control which Subzero symbols and which
-``llc`` symbols take precedence.  This is controlled by the ``-include`` and
-``-exclude`` arguments.  These can be used to rapidly find a single function
-that Subzero translates incorrectly leading to incorrect output.
-
-There is another helper script, ``pydir/szbuild_spec2k.py``, that runs
-``szbuild.py`` on one or more components of the Spec2K suite.  This assumes that
-Spec2K is set up in the usual place in the Native Client tree, and the finalized
-pexe files have been built.  (Note: for working with Spec2K and other pexes,
-it's helpful to finalize the pexe using ``--no-strip-syms``, to preserve the
-original function and global variable names.)
-
-Status
-------
-
-Subzero currently fully supports the x86-32 architecture, for both native and
-Native Client sandboxing modes.  The x86-64 architecture is also supported in
-native mode only, and only for the x32 flavor due to the fact that pointers and
-32-bit integers are indistinguishable in PNaCl bitcode.  Sandboxing support for
-x86-64 is in progress.  ARM and MIPS support is in progress.  Two optimization
-levels, ``-Om1`` and ``-O2``, are implemented.
-
-The ``-Om1`` configuration is designed to be the simplest and fastest possible,
-with a minimal set of passes and transformations.
-
-* Simple Phi lowering before target lowering, by generating temporaries and
-  adding assignments to the end of predecessor blocks.
-
-* Simple register allocation limited to pre-colored or infinite-weight
-  Variables.
-
-The ``-O2`` configuration is designed to use all optimizations available and
-produce the best code.
-
-* Address mode inference to leverage the complex x86 addressing modes.
-
-* Compare/branch fusing based on liveness/last-use analysis.
-
-* Global, linear-scan register allocation.
-
-* Advanced phi lowering after target lowering and global register allocation,
-  via edge splitting, topological sorting of the parallel moves, and final local
-  register allocation.
-
-* Stack slot coalescing to reduce frame size.
-
-* Branch optimization to reduce the number of branches to the following block.
diff --git a/README.rst b/README.rst

new file mode 120000 (symlink)

index 0000000000000000000000000000000000000000..cffceba7c58191d8215f5d98a17575ef3a50bc06
--- /dev/null
+++ b/README.rst
@@ -0,0 +1 @@
+docs/README.rst
+\ No newline at end of file
diff --git a/ALLOCATION.rst b/docs/ALLOCATION.rst

similarity index 100%

rename from ALLOCATION.rst

rename to docs/ALLOCATION.rst
diff --git a/docs/DESIGN.rst b/docs/DESIGN.rst

new file mode 100644 (file)

index 0000000..3d324f6
--- /dev/null
+++ b/docs/DESIGN.rst
@@ -0,0 +1,1494 @@
+Design of the Subzero fast code generator
+=========================================
+
+Introduction
+------------
+
+The `Portable Native Client (PNaCl) <http://gonacl.com>`_ project includes
+compiler technology based on `LLVM <http://llvm.org/>`_.  The developer uses the
+PNaCl toolchain to compile their application to architecture-neutral PNaCl
+bitcode (a ``.pexe`` file), using as much architecture-neutral optimization as
+possible.  The ``.pexe`` file is downloaded to the user's browser where the
+PNaCl translator (a component of Chrome) compiles the ``.pexe`` file to
+`sandboxed
+<https://developer.chrome.com/native-client/reference/sandbox_internals/index>`_
+native code.  The translator uses architecture-specific optimizations as much as
+practical to generate good native code.
+
+The native code can be cached by the browser to avoid repeating translation on
+future page loads.  However, first-time user experience is hampered by long
+translation times.  The LLVM-based PNaCl translator is pretty slow, even when
+using ``-O0`` to minimize optimizations, so delays are especially noticeable on
+slow browser platforms such as ARM-based Chromebooks.
+
+Translator slowness can be mitigated or hidden in a number of ways.
+
+- Parallel translation.  However, slow machines where this matters most, e.g.
+  ARM-based Chromebooks, are likely to have fewer cores to parallelize across,
+  and are likely to less memory available for multiple translation threads to
+  use.
+
+- Streaming translation, i.e. start translating as soon as the download starts.
+  This doesn't help much when translation speed is 10× slower than download
+  speed, or the ``.pexe`` file is already cached while the translated binary was
+  flushed from the cache.
+
+- Arrange the web page such that translation is done in parallel with
+  downloading large assets.
+
+- Arrange the web page to distract the user with `cat videos
+  <https://www.youtube.com/watch?v=tLt5rBfNucc>`_ while translation is in
+  progress.
+
+Or, improve translator performance to something more reasonable.
+
+This document describes Subzero's attempt to improve translation speed by an
+order of magnitude while rivaling LLVM's code quality.  Subzero does this
+through minimal IR layering, lean data structures and passes, and a careful
+selection of fast optimization passes.  It has two optimization recipes: full
+optimizations (``O2``) and minimal optimizations (``Om1``).  The recipes are the
+following (described in more detail below):
+
++-----------------------------------+-----------------------+
+| O2 recipe                         | Om1 recipe            |
++===================================+=======================+
+| Parse .pexe file                  | Parse .pexe file      |
++-----------------------------------+-----------------------+
+| Loop nest analysis                |                       |
++-----------------------------------+-----------------------+
+| Address mode inference            |                       |
++-----------------------------------+-----------------------+
+| Read-modify-write (RMW) transform |                       |
++-----------------------------------+-----------------------+
+| Basic liveness analysis           |                       |
++-----------------------------------+-----------------------+
+| Load optimization                 |                       |
++-----------------------------------+-----------------------+
+|                                   | Phi lowering (simple) |
++-----------------------------------+-----------------------+
+| Target lowering                   | Target lowering       |
++-----------------------------------+-----------------------+
+| Full liveness analysis            |                       |
++-----------------------------------+-----------------------+
+| Register allocation               |                       |
++-----------------------------------+-----------------------+
+| Phi lowering (advanced)           |                       |
++-----------------------------------+-----------------------+
+| Post-phi register allocation      |                       |
++-----------------------------------+-----------------------+
+| Branch optimization               |                       |
++-----------------------------------+-----------------------+
+| Code emission                     | Code emission         |
++-----------------------------------+-----------------------+
+
+Goals
+=====
+
+Translation speed
+-----------------
+
+We'd like to be able to translate a ``.pexe`` file as fast as download speed.
+Any faster is in a sense wasted effort.  Download speed varies greatly, but
+we'll arbitrarily say 1 MB/sec.  We'll pick the ARM A15 CPU as the example of a
+slow machine.  We observe a 3× single-thread performance difference between A15
+and a high-end x86 Xeon E5-2690 based workstation, and aggressively assume a
+``.pexe`` file could be compressed to 50% on the web server using gzip transport
+compression, so we set the translation speed goal to 6 MB/sec on the high-end
+Xeon workstation.
+
+Currently, at the ``-O0`` level, the LLVM-based PNaCl translation translates at
+⅒ the target rate.  The ``-O2`` mode takes 3× as long as the ``-O0`` mode.
+
+In other words, Subzero's goal is to improve over LLVM's translation speed by
+10×.
+
+Code quality
+------------
+
+Subzero's initial goal is to produce code that meets or exceeds LLVM's ``-O0``
+code quality.  The stretch goal is to approach LLVM ``-O2`` code quality.  On
+average, LLVM ``-O2`` performs twice as well as LLVM ``-O0``.
+
+It's important to note that the quality of Subzero-generated code depends on
+target-neutral optimizations and simplifications being run beforehand in the
+developer environment.  The ``.pexe`` file reflects these optimizations.  For
+example, Subzero assumes that the basic blocks are ordered topologically where
+possible (which makes liveness analysis converge fastest), and Subzero does not
+do any function inlining because it should already have been done.
+
+Translator size
+---------------
+
+The current LLVM-based translator binary (``pnacl-llc``) is about 10 MB in size.
+We think 1 MB is a more reasonable size -- especially for such a component that
+is distributed to a billion Chrome users.  Thus we target a 10× reduction in
+binary size.
+
+For development, Subzero can be built for all target architectures, and all
+debugging and diagnostic options enabled.  For a smaller translator, we restrict
+to a single target architecture, and define a ``MINIMAL`` build where
+unnecessary features are compiled out.
+
+Subzero leverages some data structures from LLVM's ``ADT`` and ``Support``
+include directories, which have little impact on translator size.  It also uses
+some of LLVM's bitcode decoding code (for binary-format ``.pexe`` files), again
+with little size impact.  In non-``MINIMAL`` builds, the translator size is much
+larger due to including code for parsing text-format bitcode files and forming
+LLVM IR.
+
+Memory footprint
+----------------
+
+The current LLVM-based translator suffers from an issue in which some
+function-specific data has to be retained in memory until all translation
+completes, and therefore the memory footprint grows without bound.  Large
+``.pexe`` files can lead to the translator process holding hundreds of MB of
+memory by the end.  The translator runs in a separate process, so this memory
+growth doesn't *directly* affect other processes, but it does dirty the physical
+memory and contributes to a perception of bloat and sometimes a reality of
+out-of-memory tab killing, especially noticeable on weaker systems.
+
+Subzero should maintain a stable memory footprint throughout translation.  It's
+not really practical to set a specific limit, because there is not really a
+practical limit on a single function's size, but the footprint should be
+"reasonable" and be proportional to the largest input function size, not the
+total ``.pexe`` file size.  Simply put, Subzero should not have memory leaks or
+inexorable memory growth.  (We use ASAN builds to test for leaks.)
+
+Multithreaded translation
+-------------------------
+
+It should be practical to translate different functions concurrently and see
+good scalability.  Some locking may be needed, such as accessing output buffers
+or constant pools, but that should be fairly minimal.  In contrast, LLVM was
+only designed for module-level parallelism, and as such, the PNaCl translator
+internally splits a ``.pexe`` file into several modules for concurrent
+translation.  All output needs to be deterministic regardless of the level of
+multithreading, i.e. functions and data should always be output in the same
+order.
+
+Target architectures
+--------------------
+
+Initial target architectures are x86-32, x86-64, ARM32, and MIPS32.  Future
+targets include ARM64 and MIPS64, though these targets lack NaCl support
+including a sandbox model or a validator.
+
+The first implementation is for x86-32, because it was expected to be
+particularly challenging, and thus more likely to draw out any design problems
+early:
+
+- There are a number of special cases, asymmetries, and warts in the x86
+  instruction set.
+
+- Complex addressing modes may be leveraged for better code quality.
+
+- 64-bit integer operations have to be lowered into longer sequences of 32-bit
+  operations.
+
+- Paucity of physical registers may reveal code quality issues early in the
+  design.
+
+Detailed design
+===============
+
+Intermediate representation - ICE
+---------------------------------
+
+Subzero's IR is called ICE.  It is designed to be reasonably similar to LLVM's
+IR, which is reflected in the ``.pexe`` file's bitcode structure.  It has a
+representation of global variables and initializers, and a set of functions.
+Each function contains a list of basic blocks, and each basic block constains a
+list of instructions.  Instructions that operate on stack and register variables
+do so using static single assignment (SSA) form.
+
+The ``.pexe`` file is translated one function at a time (or in parallel by
+multiple translation threads).  The recipe for optimization passes depends on
+the specific target and optimization level, and is described in detail below.
+Global variables (types and initializers) are simply and directly translated to
+object code, without any meaningful attempts at optimization.
+
+A function's control flow graph (CFG) is represented by the ``Ice::Cfg`` class.
+Its key contents include:
+
+- A list of ``CfgNode`` pointers, generally held in topological order.
+
+- A list of ``Variable`` pointers corresponding to local variables used in the
+  function plus compiler-generated temporaries.
+
+A basic block is represented by the ``Ice::CfgNode`` class.  Its key contents
+include:
+
+- A linear list of instructions, in the same style as LLVM.  The last
+  instruction of the list is always a terminator instruction: branch, switch,
+  return, unreachable.
+
+- A list of Phi instructions, also in the same style as LLVM.  They are held as
+  a linear list for convenience, though per Phi semantics, they are executed "in
+  parallel" without dependencies on each other.
+
+- An unordered list of ``CfgNode`` pointers corresponding to incoming edges, and
+  another list for outgoing edges.
+
+- The node's unique, 0-based index into the CFG's node list.
+
+An instruction is represented by the ``Ice::Inst`` class.  Its key contents
+include:
+
+- A list of source operands.
+
+- Its destination variable, if the instruction produces a result in an
+  ``Ice::Variable``.
+
+- A bitvector indicating which variables' live ranges this instruction ends.
+  This is computed during liveness analysis.
+
+Instructions kinds are divided into high-level ICE instructions and low-level
+ICE instructions.  High-level instructions consist of the PNaCl/LLVM bitcode
+instruction kinds.  Each target architecture implementation extends the
+instruction space with its own set of low-level instructions.  Generally,
+low-level instructions correspond to individual machine instructions.  The
+high-level ICE instruction space includes a few additional instruction kinds
+that are not part of LLVM but are generally useful (e.g., an Assignment
+instruction), or are useful across targets (e.g., BundleLock and BundleUnlock
+instructions for sandboxing).
+
+Specifically, high-level ICE instructions that derive from LLVM (but with PNaCl
+ABI restrictions as documented in the `PNaCl Bitcode Reference Manual
+<https://developer.chrome.com/native-client/reference/pnacl-bitcode-abi>`_) are
+the following:
+
+- Alloca: allocate data on the stack
+
+- Arithmetic: binary operations of the form ``A = B op C``
+
+- Br: conditional or unconditional branch
+
+- Call: function call
+
+- Cast: unary type-conversion operations
+
+- ExtractElement: extract a scalar element from a vector-type value
+
+- Fcmp: floating-point comparison
+
+- Icmp: integer comparison
+
+- IntrinsicCall: call a known intrinsic
+
+- InsertElement: insert a scalar element into a vector-type value
+
+- Load: load a value from memory
+
+- Phi: implement the SSA phi node
+
+- Ret: return from the function
+
+- Select: essentially the C language operation of the form ``X = C ? Y : Z``
+
+- Store: store a value into memory
+
+- Switch: generalized branch to multiple possible locations
+
+- Unreachable: indicate that this portion of the code is unreachable
+
+The additional high-level ICE instructions are the following:
+
+- Assign: a simple ``A=B`` assignment.  This is useful for e.g. lowering Phi
+  instructions to non-SSA assignments, before lowering to machine code.
+
+- BundleLock, BundleUnlock.  These are markers used for sandboxing, but are
+  common across all targets and so they are elevated to the high-level
+  instruction set.
+
+- FakeDef, FakeUse, FakeKill.  These are tools used to preserve consistency in
+  liveness analysis, elevated to the high-level because they are used by all
+  targets.  They are described in more detail at the end of this section.
+
+- JumpTable: this represents the result of switch optimization analysis, where
+  some switch instructions may use jump tables instead of cascading
+  compare/branches.
+
+An operand is represented by the ``Ice::Operand`` class.  In high-level ICE, an
+operand is either an ``Ice::Constant`` or an ``Ice::Variable``.  Constants
+include scalar integer constants, scalar floating point constants, Undef (an
+unspecified constant of a particular scalar or vector type), and symbol
+constants (essentially addresses of globals).  Note that the PNaCl ABI does not
+include vector-type constants besides Undef, and as such, Subzero (so far) has
+no reason to represent vector-type constants internally.  A variable represents
+a value allocated on the stack (though not including alloca-derived storage).
+Among other things, a variable holds its unique, 0-based index into the CFG's
+variable list.
+
+Each target can extend the ``Constant`` and ``Variable`` classes for its own
+needs.  In addition, the ``Operand`` class may be extended, e.g. to define an
+x86 ``MemOperand`` that encodes a base register, an index register, an index
+register shift amount, and a constant offset.
+
+Register allocation and liveness analysis are restricted to Variable operands.
+Because of the importance of register allocation to code quality, and the
+translation-time cost of liveness analysis, Variable operands get some special
+treatment in ICE.  Most notably, a frequent pattern in Subzero is to iterate
+across all the Variables of an instruction.  An instruction holds a list of
+operands, but an operand may contain 0, 1, or more Variables.  As such, the
+``Operand`` class specially holds a list of Variables contained within, for
+quick access.
+
+A Subzero transformation pass may work by deleting an existing instruction and
+replacing it with zero or more new instructions.  Instead of actually deleting
+the existing instruction, we generally mark it as deleted and insert the new
+instructions right after the deleted instruction.  When printing the IR for
+debugging, this is a big help because it makes it much more clear how the
+non-deleted instructions came about.
+
+Subzero has a few special instructions to help with liveness analysis
+consistency.
+
+- The FakeDef instruction gives a fake definition of some variable.  For
+  example, on x86-32, a divide instruction defines both ``%eax`` and ``%edx``
+  but an ICE instruction can represent only one destination variable.  This is
+  similar for multiply instructions, and for function calls that return a 64-bit
+  integer result in the ``%edx:%eax`` pair.  Also, using the ``xor %eax, %eax``
+  trick to set ``%eax`` to 0 requires an initial FakeDef of ``%eax``.
+
+- The FakeUse instruction registers a use of a variable, typically to prevent an
+  earlier assignment to that variable from being dead-code eliminated.  For
+  example, lowering an operation like ``x=cc?y:z`` may be done using x86's
+  conditional move (cmov) instruction: ``mov z, x; cmov_cc y, x``.  Without a
+  FakeUse of ``x`` between the two instructions, the liveness analysis pass may
+  dead-code eliminate the first instruction.
+
+- The FakeKill instruction is added after a call instruction, and is a quick way
+  of indicating that caller-save registers are invalidated.
+
+Pexe parsing
+------------
+
+Subzero includes an integrated PNaCl bitcode parser for ``.pexe`` files.  It
+parses the ``.pexe`` file function by function, ultimately constructing an ICE
+CFG for each function.  After a function is parsed, its CFG is handed off to the
+translation phase.  The bitcode parser also parses global initializer data and
+hands it off to be translated to data sections in the object file.
+
+Subzero has another parsing strategy for testing/debugging.  LLVM libraries can
+be used to parse a module into LLVM IR (though very slowly relative to Subzero
+native parsing).  Then we iterate across the LLVM IR and construct high-level
+ICE, handing off each CFG to the translation phase.
+
+Overview of lowering
+--------------------
+
+In general, translation goes like this:
+
+- Parse the next function from the ``.pexe`` file and construct a CFG consisting
+  of high-level ICE.
+
+- Do analysis passes and transformation passes on the high-level ICE, as
+  desired.
+
+- Lower each high-level ICE instruction into a sequence of zero or more
+  low-level ICE instructions.  Each high-level instruction is generally lowered
+  independently, though the target lowering is allowed to look ahead in the
+  CfgNode's instruction list if desired.
+
+- Do more analysis and transformation passes on the low-level ICE, as desired.
+
+- Assemble the low-level CFG into an ELF object file (alternatively, a textual
+  assembly file that is later assembled by some external tool).
+
+- Repeat for all functions, and also produce object code for data such as global
+  initializers and internal constant pools.
+
+Currently there are two optimization levels: ``O2`` and ``Om1``.  For ``O2``,
+the intention is to apply all available optimizations to get the best code
+quality (though the initial code quality goal is measured against LLVM's ``O0``
+code quality).  For ``Om1``, the intention is to apply as few optimizations as
+possible and produce code as quickly as possible, accepting poor code quality.
+``Om1`` is short for "O-minus-one", i.e. "worse than O0", or in other words,
+"sub-zero".
+
+High-level debuggability of generated code is so far not a design requirement.
+Subzero doesn't really do transformations that would obfuscate debugging; the
+main thing might be that register allocation (including stack slot coalescing
+for stack-allocated variables whose live ranges don't overlap) may render a
+variable's value unobtainable after its live range ends.  This would not be an
+issue for ``Om1`` since it doesn't register-allocate program-level variables,
+nor does it coalesce stack slots.  That said, fully supporting debuggability
+would require a few additions:
+
+- DWARF support would need to be added to Subzero's ELF file emitter.  Subzero
+  propagates global symbol names, local variable names, and function-internal
+  label names that are present in the ``.pexe`` file.  This would allow a
+  debugger to map addresses back to symbols in the ``.pexe`` file.
+
+- To map ``.pexe`` file symbols back to meaningful source-level symbol names,
+  file names, line numbers, etc., Subzero would need to handle `LLVM bitcode
+  metadata <http://llvm.org/docs/LangRef.html#metadata>`_ and ``llvm.dbg``
+  `instrinsics<http://llvm.org/docs/LangRef.html#dbg-intrinsics>`_.
+
+- The PNaCl toolchain explicitly strips all this from the ``.pexe`` file, and so
+  the toolchain would need to be modified to preserve it.
+
+Our experience so far is that ``Om1`` translates twice as fast as ``O2``, but
+produces code with one third the code quality.  ``Om1`` is good for testing and
+debugging -- during translation, it tends to expose errors in the basic lowering
+that might otherwise have been hidden by the register allocator or other
+optimization passes.  It also helps determine whether a code correctness problem
+is a fundamental problem in the basic lowering, or an error in another
+optimization pass.
+
+The implementation of target lowering also controls the recipe of passes used
+for ``Om1`` and ``O2`` translation.  For example, address mode inference may
+only be relevant for x86.
+
+Lowering strategy
+-----------------
+
+The core of Subzero's lowering from high-level ICE to low-level ICE is to lower
+each high-level instruction down to a sequence of low-level target-specific
+instructions, in a largely context-free setting.  That is, each high-level
+instruction conceptually has a simple template expansion into low-level
+instructions, and lowering can in theory be done in any order.  This may sound
+like a small effort, but quite a large number of templates may be needed because
+of the number of PNaCl types and instruction variants.  Furthermore, there may
+be optimized templates, e.g. to take advantage of operator commutativity (for
+example, ``x=x+1`` might allow a bettern lowering than ``x=1+x``).  This is
+similar to other template-based approaches in fast code generation or
+interpretation, though some decisions are deferred until after some global
+analysis passes, mostly related to register allocation, stack slot assignment,
+and specific choice of instruction variant and addressing mode.
+
+The key idea for a lowering template is to produce valid low-level instructions
+that are guaranteed to meet address mode and other structural requirements of
+the instruction set.  For example, on x86, the source operand of an integer
+store instruction must be an immediate or a physical register; a shift
+instruction's shift amount must be an immediate or in register ``%cl``; a
+function's integer return value is in ``%eax``; most x86 instructions are
+two-operand, in contrast to corresponding three-operand high-level instructions;
+etc.
+
+Because target lowering runs before register allocation, there is no way to know
+whether a given ``Ice::Variable`` operand lives on the stack or in a physical
+register.  When the low-level instruction calls for a physical register operand,
+the target lowering can create an infinite-weight Variable.  This tells the
+register allocator to assign infinite weight when making decisions, effectively
+guaranteeing some physical register.  Variables can also be pre-colored to a
+specific physical register (``cl`` in the shift example above), which also gives
+infinite weight.
+
+To illustrate, consider a high-level arithmetic instruction on 32-bit integer
+operands::
+
+    A = B + C
+
+X86 target lowering might produce the following::
+
+    T.inf = B  // mov instruction
+    T.inf += C // add instruction
+    A = T.inf  // mov instruction
+
+Here, ``T.inf`` is an infinite-weight temporary.  As long as ``T.inf`` has a
+physical register, the three lowered instructions are all encodable regardless
+of whether ``B`` and ``C`` are physical registers, memory, or immediates, and
+whether ``A`` is a physical register or in memory.
+
+In this example, ``A`` must be a Variable and one may be tempted to simplify the
+lowering sequence by setting ``A`` as infinite-weight and using::
+
+        A = B  // mov instruction
+        A += C // add instruction
+
+This has two problems.  First, if the original instruction was actually ``A =
+B + A``, the result would be incorrect.  Second, assigning ``A`` a physical
+register applies throughout ``A``'s entire live range.  This is probably not
+what is intended, and may ultimately lead to a failure to allocate a register
+for an infinite-weight variable.
+
+This style of lowering leads to many temporaries being generated, so in ``O2``
+mode, we rely on the register allocator to clean things up.  For example, in the
+example above, if ``B`` ends up getting a physical register and its live range
+ends at this instruction, the register allocator is likely to reuse that
+register for ``T.inf``.  This leads to ``T.inf=B`` being a redundant register
+copy, which is removed as an emission-time peephole optimization.
+
+O2 lowering
+-----------
+
+Currently, the ``O2`` lowering recipe is the following:
+
+- Loop nest analysis
+
+- Address mode inference
+
+- Read-modify-write (RMW) transformation
+
+- Basic liveness analysis
+
+- Load optimization
+
+- Target lowering
+
+- Full liveness analysis
+
+- Register allocation
+
+- Phi instruction lowering (advanced)
+
+- Post-phi lowering register allocation
+
+- Branch optimization
+
+These passes are described in more detail below.
+
+Om1 lowering
+------------
+
+Currently, the ``Om1`` lowering recipe is the following:
+
+- Phi instruction lowering (simple)
+
+- Target lowering
+
+- Register allocation (infinite-weight and pre-colored only)
+
+Optimization passes
+-------------------
+
+Liveness analysis
+^^^^^^^^^^^^^^^^^
+
+Liveness analysis is a standard dataflow optimization, implemented as follows.
+For each node (basic block), its live-out set is computed as the union of the
+live-in sets of its successor nodes.  Then the node's instructions are processed
+in reverse order, updating the live set, until the beginning of the node is
+reached, and the node's live-in set is recorded.  If this iteration has changed
+the node's live-in set, the node's predecessors are marked for reprocessing.
+This continues until no more nodes need reprocessing.  If nodes are processed in
+reverse topological order, the number of iterations over the CFG is generally
+equal to the maximum loop nest depth.
+
+To implement this, each node records its live-in and live-out sets, initialized
+to the empty set.  Each instruction records which of its Variables' live ranges
+end in that instruction, initialized to the empty set.  A side effect of
+liveness analysis is dead instruction elimination.  Each instruction can be
+marked as tentatively dead, and after the algorithm converges, the tentatively
+dead instructions are permanently deleted.
+
+Optionally, after this liveness analysis completes, we can do live range
+construction, in which we calculate the live range of each variable in terms of
+instruction numbers.  A live range is represented as a union of segments, where
+the segment endpoints are instruction numbers.  Instruction numbers are required
+to be unique across the CFG, and monotonically increasing within a basic block.
+As a union of segments, live ranges can contain "gaps" and are therefore
+precise.  Because of SSA properties, a variable's live range can start at most
+once in a basic block, and can end at most once in a basic block.  Liveness
+analysis keeps track of which variable/instruction tuples begin live ranges and
+end live ranges, and combined with live-in and live-out sets, we can efficiently
+build up live ranges of all variables across all basic blocks.
+
+A lot of care is taken to try to make liveness analysis fast and efficient.
+Because of the lowering strategy, the number of variables is generally
+proportional to the number of instructions, leading to an O(N^2) complexity
+algorithm if implemented naively.  To improve things based on sparsity, we note
+that most variables are "local" and referenced in at most one basic block (in
+contrast to the "global" variables with multi-block usage), and therefore cannot
+be live across basic blocks.  Therefore, the live-in and live-out sets,
+typically represented as bit vectors, can be limited to the set of global
+variables, and the intra-block liveness bit vector can be compacted to hold the
+global variables plus the local variables for that block.
+
+Register allocation
+^^^^^^^^^^^^^^^^^^^
+
+Subzero implements a simple linear-scan register allocator, based on the
+allocator described by Hanspeter Mössenböck and Michael Pfeiffer in `Linear Scan
+Register Allocation in the Context of SSA Form and Register Constraints
+<ftp://ftp.ssw.uni-linz.ac.at/pub/Papers/Moe02.PDF>`_.  This allocator has
+several nice features:
+
+- Live ranges are represented as unions of segments, as described above, rather
+  than a single start/end tuple.
+
+- It allows pre-coloring of variables with specific physical registers.
+
+- It applies equally well to pre-lowered Phi instructions.
+
+The paper suggests an approach of aggressively coalescing variables across Phi
+instructions (i.e., trying to force Phi source and destination variables to have
+the same register assignment), but we reject that in favor of the more natural
+preference mechanism described below.
+
+We enhance the algorithm in the paper with the capability of automatic inference
+of register preference, and with the capability of allowing overlapping live
+ranges to safely share the same register in certain circumstances.  If we are
+considering register allocation for variable ``A``, and ``A`` has a single
+defining instruction ``A=B+C``, then the preferred register for ``A``, if
+available, would be the register assigned to ``B`` or ``C``, if any, provided
+that ``B`` or ``C``'s live range does not overlap ``A``'s live range.  In this
+way we infer a good register preference for ``A``.
+
+We allow overlapping live ranges to get the same register in certain cases.
+Suppose a high-level instruction like::
+
+    A = unary_op(B)
+
+has been target-lowered like::
+
+    T.inf = B
+    A = unary_op(T.inf)
+
+Further, assume that ``B``'s live range continues beyond this instruction
+sequence, and that ``B`` has already been assigned some register.  Normally, we
+might want to infer ``B``'s register as a good candidate for ``T.inf``, but it
+turns out that ``T.inf`` and ``B``'s live ranges overlap, requiring them to have
+different registers.  But ``T.inf`` is just a read-only copy of ``B`` that is
+guaranteed to be in a register, so in theory these overlapping live ranges could
+safely have the same register.  Our implementation allows this overlap as long
+as ``T.inf`` is never modified within ``B``'s live range, and ``B`` is never
+modified within ``T.inf``'s live range.
+
+Subzero's register allocator can be run in 3 configurations.
+
+- Normal mode.  All Variables are considered for register allocation.  It
+  requires full liveness analysis and live range construction as a prerequisite.
+  This is used by ``O2`` lowering.
+
+- Minimal mode.  Only infinite-weight or pre-colored Variables are considered.
+  All other Variables are stack-allocated.  It does not require liveness
+  analysis; instead, it quickly scans the instructions and records first
+  definitions and last uses of all relevant Variables, using that to construct a
+  single-segment live range.  Although this includes most of the Variables, the
+  live ranges are mostly simple, short, and rarely overlapping, which the
+  register allocator handles efficiently.  This is used by ``Om1`` lowering.
+
+- Post-phi lowering mode.  Advanced phi lowering is done after normal-mode
+  register allocation, and may result in new infinite-weight Variables that need
+  registers.  One would like to just run something like minimal mode to assign
+  registers to the new Variables while respecting existing register allocation
+  decisions.  However, it sometimes happens that there are no free registers.
+  In this case, some register needs to be forcibly spilled to the stack and
+  temporarily reassigned to the new Variable, and reloaded at the end of the new
+  Variable's live range.  The register must be one that has no explicit
+  references during the Variable's live range.  Since Subzero currently doesn't
+  track def/use chains (though it does record the CfgNode where a Variable is
+  defined), we just do a brute-force search across the CfgNode's instruction
+  list for the instruction numbers of interest.  This situation happens very
+  rarely, so there's little point for now in improving its performance.
+
+The basic linear-scan algorithm may, as it proceeds, rescind an early register
+allocation decision, leaving that Variable to be stack-allocated.  Some of these
+times, it turns out that the Variable could have been given a different register
+without conflict, but by this time it's too late.  The literature recognizes
+this situation and describes "second-chance bin-packing", which Subzero can do.
+We can rerun the register allocator in a mode that respects existing register
+allocation decisions, and sometimes it finds new non-conflicting opportunities.
+In fact, we can repeatedly run the register allocator until convergence.
+Unfortunately, in the current implementation, these subsequent register
+allocation passes end up being extremely expensive.  This is because of the
+treatment of the "unhandled pre-colored" Variable set, which is normally very
+small but ends up being quite large on subsequent passes.  Its performance can
+probably be made acceptable with a better choice of data structures, but for now
+this second-chance mechanism is disabled.
+
+Future work is to implement LLVM's `Greedy
+<http://blog.llvm.org/2011/09/greedy-register-allocation-in-llvm-30.html>`_
+register allocator as a replacement for the basic linear-scan algorithm, given
+LLVM's experience with its improvement in code quality.  (The blog post claims
+that the Greedy allocator also improved maintainability because a lot of hacks
+could be removed, but Subzero is probably not yet to that level of hacks, and is
+less likely to see that particular benefit.)
+
+Basic phi lowering
+^^^^^^^^^^^^^^^^^^
+
+The simplest phi lowering strategy works as follows (this is how LLVM ``-O0``
+implements it).  Consider this example::
+
+    L1:
+      ...
+      br L3
+    L2:
+      ...
+      br L3
+    L3:
+      A = phi [B, L1], [C, L2]
+      X = phi [Y, L1], [Z, L2]
+
+For each destination of a phi instruction, we can create a temporary and insert
+the temporary's assignment at the end of the predecessor block::
+
+    L1:
+      ...
+      A' = B
+      X' = Y
+      br L3
+    L2:
+      ...
+      A' = C
+      X' = Z
+      br L3
+    L2:
+      A = A'
+      X = X'
+
+This transformation is very simple and reliable.  It can be done before target
+lowering and register allocation, and it easily avoids the classic lost-copy and
+related problems.  ``Om1`` lowering uses this strategy.
+
+However, it has the disadvantage of initializing temporaries even for branches
+not taken, though that could be mitigated by splitting non-critical edges and
+putting assignments in the edge-split nodes.  Another problem is that without
+extra machinery, the assignments to ``A``, ``A'``, ``X``, and ``X'`` are given a
+specific ordering even though phi semantics are that the assignments are
+parallel or unordered.  This sometimes imposes false live range overlaps and
+leads to poorer register allocation.
+
+Advanced phi lowering
+^^^^^^^^^^^^^^^^^^^^^
+
+``O2`` lowering defers phi lowering until after register allocation to avoid the
+problem of false live range overlaps.  It works as follows.  We split each
+incoming edge and move the (parallel) phi assignments into the split nodes.  We
+linearize each set of assignments by finding a safe, topological ordering of the
+assignments, respecting register assignments as well.  For example::
+
+    A = B
+    X = Y
+
+Normally these assignments could be executed in either order, but if ``B`` and
+``X`` are assigned the same physical register, we would want to use the above
+ordering.  Dependency cycles are broken by introducing a temporary.  For
+example::
+
+    A = B
+    B = A
+
+Here, a temporary breaks the cycle::
+
+    t = A
+    A = B
+    B = t
+
+Finally, we use the existing target lowering to lower the assignments in this
+basic block, and once that is done for all basic blocks, we run the post-phi
+variant of register allocation on the edge-split basic blocks.
+
+When computing a topological order, we try to first schedule assignments whose
+source has a physical register, and last schedule assignments whose destination
+has a physical register.  This helps reduce register pressure.
+
+X86 address mode inference
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+We try to take advantage of the x86 addressing mode that includes a base
+register, an index register, an index register scale amount, and an immediate
+offset.  We do this through simple pattern matching.  Starting with a load or
+store instruction where the address is a variable, we initialize the base
+register to that variable, and look up the instruction where that variable is
+defined.  If that is an add instruction of two variables and the index register
+hasn't been set, we replace the base and index register with those two
+variables.  If instead it is an add instruction of a variable and a constant, we
+replace the base register with the variable and add the constant to the
+immediate offset.
+
+There are several more patterns that can be matched.  This pattern matching
+continues on the load or store instruction until no more matches are found.
+Because a program typically has few load and store instructions (not to be
+confused with instructions that manipulate stack variables), this address mode
+inference pass is fast.
+
+X86 read-modify-write inference
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+A reasonably common bitcode pattern is a non-atomic update of a memory
+location::
+
+    x = load addr
+    y = add x, 1
+    store y, addr
+
+On x86, with good register allocation, the Subzero passes described above
+generate code with only this quality::
+
+    mov [%ebx], %eax
+    add $1, %eax
+    mov %eax, [%ebx]
+
+However, x86 allows for this kind of code::
+
+    add $1, [%ebx]
+
+which requires fewer instructions, but perhaps more importantly, requires fewer
+physical registers.
+
+It's also important to note that this transformation only makes sense if the
+store instruction ends ``x``'s live range.
+
+Subzero's ``O2`` recipe includes an early pass to find read-modify-write (RMW)
+opportunities via simple pattern matching.  The only problem is that it is run
+before liveness analysis, which is needed to determine whether ``x``'s live
+range ends after the RMW.  Since liveness analysis is one of the most expensive
+passes, it's not attractive to run it an extra time just for RMW analysis.
+Instead, we essentially generate both the RMW and the non-RMW versions, and then
+during lowering, the RMW version deletes itself if it finds x still live.
+
+X86 compare-branch inference
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+In the LLVM instruction set, the compare/branch pattern works like this::
+
+    cond = icmp eq a, b
+    br cond, target
+
+The result of the icmp instruction is a single bit, and a conditional branch
+tests that bit.  By contrast, most target architectures use this pattern::
+
+    cmp a, b  // implicitly sets various bits of FLAGS register
+    br eq, target  // branch on a particular FLAGS bit
+
+A naive lowering sequence conditionally sets ``cond`` to 0 or 1, then tests
+``cond`` and conditionally branches.  Subzero has a pass that identifies
+boolean-based operations like this and folds them into a single
+compare/branch-like operation.  It is set up for more than just cmp/br though.
+Boolean producers include icmp (integer compare), fcmp (floating-point compare),
+and trunc (integer truncation when the destination has bool type).  Boolean
+consumers include branch, select (the ternary operator from the C language), and
+sign-extend and zero-extend when the source has bool type.
+
+Sandboxing
+^^^^^^^^^^
+
+Native Client's sandbox model uses software fault isolation (SFI) to provide
+safety when running untrusted code in a browser or other environment.  Subzero
+implements Native Client's `sandboxing
+<https://developer.chrome.com/native-client/reference/sandbox_internals/index>`_
+to enable Subzero-translated executables to be run inside Chrome.  Subzero also
+provides a fairly simple framework for investigating alternative sandbox models
+or other restrictions on the sandbox model.
+
+Sandboxing in Subzero is not actually implemented as a separate pass, but is
+integrated into lowering and assembly.
+
+- Indirect branches, including the ret instruction, are masked to a bundle
+  boundary and bundle-locked.
+
+- Call instructions are aligned to the end of the bundle so that the return
+  address is bundle-aligned.
+
+- Indirect branch targets, including function entry and targets in a switch
+  statement jump table, are bundle-aligned.
+
+- The intrinsic for reading the thread pointer is inlined appropriately.
+
+- For x86-64, non-stack memory accesses are with respect to the reserved sandbox
+  base register.  We reduce the aggressiveness of address mode inference to
+  leave room for the sandbox base register during lowering.  There are no memory
+  sandboxing changes for x86-32.
+
+Code emission
+-------------
+
+Subzero's integrated assembler is derived from Dart's `assembler code
+<https://github.com/dart-lang/sdk/tree/master/runtime/vm>'_.  There is a pass
+that iterates through the low-level ICE instructions and invokes the relevant
+assembler functions.  Placeholders are added for later fixup of branch target
+offsets.  (Backward branches use short offsets if possible; forward branches
+generally use long offsets unless it is an intra-block branch of "known" short
+length.)  The assembler emits into a staging buffer.  Once emission into the
+staging buffer for a function is complete, the data is emitted to the output
+file as an ELF object file, and metadata such as relocations, symbol table, and
+string table, are accumulated for emission at the end.  Global data initializers
+are emitted similarly.  A key point is that at this point, the staging buffer
+can be deallocated, and only a minimum of data needs to held until the end.
+
+As a debugging alternative, Subzero can emit textual assembly code which can
+then be run through an external assembler.  This is of course super slow, but
+quite valuable when bringing up a new target.
+
+As another debugging option, the staging buffer can be emitted as textual
+assembly, primarily in the form of ".byte" lines.  This allows the assembler to
+be tested separately from the ELF related code.
+
+Memory management
+-----------------
+
+Where possible, we allocate from a ``CfgLocalAllocator`` which derives from
+LLVM's ``BumpPtrAllocator``.  This is an arena-style allocator where objects
+allocated from the arena are never actually freed; instead, when the CFG
+translation completes and the CFG is deleted, the entire arena memory is
+reclaimed at once.  This style of allocation works well in an environment like a
+compiler where there are distinct phases with only easily-identifiable objects
+living across phases.  It frees the developer from having to manage object
+deletion, and it amortizes deletion costs across just a single arena deletion at
+the end of the phase.  Furthermore, it helps scalability by allocating entirely
+from thread-local memory pools, and minimizing global locking of the heap.
+
+Instructions are probably the most heavily allocated complex class in Subzero.
+We represent an instruction list as an intrusive doubly linked list, allocate
+all instructions from the ``CfgLocalAllocator``, and we make sure each
+instruction subclass is basically `POD
+<http://en.cppreference.com/w/cpp/concept/PODType>`_ (Plain Old Data) with a
+trivial destructor.  This way, when the CFG is finished, we don't need to
+individually deallocate every instruction.  We do similar for Variables, which
+is probably the second most popular complex class.
+
+There are some situations where passes need to use some `STL container class
+<http://en.cppreference.com/w/cpp/container>`_.  Subzero has a way of using the
+``CfgLocalAllocator`` as the container allocator if this is needed.
+
+Multithreaded translation
+-------------------------
+
+Subzero is designed to be able to translate functions in parallel.  With the
+``-threads=N`` command-line option, there is a 3-stage producer-consumer
+pipeline:
+
+- A single thread parses the ``.pexe`` file and produces a sequence of work
+  units.  A work unit can be either a fully constructed CFG, or a set of global
+  initializers.  The work unit includes its sequence number denoting its parse
+  order.  Each work unit is added to the translation queue.
+
+- There are N translation threads that draw work units from the translation
+  queue and lower them into assembler buffers.  Each assembler buffer is added
+  to the emitter queue, tagged with its sequence number.  The CFG and its
+  ``CfgLocalAllocator`` are disposed of at this point.
+
+- A single thread draws assembler buffers from the emitter queue and appends to
+  the output file.  It uses the sequence numbers to reintegrate the assembler
+  buffers according to the original parse order, such that output order is
+  always deterministic.
+
+This means that with ``-threads=N``, there are actually ``N+1`` spawned threads
+for a total of ``N+2`` execution threads, taking the parser and emitter threads
+into account.  For the special case of ``N=0``, execution is entirely sequential
+-- the same thread parses, translates, and emits, one function at a time.  This
+is useful for performance measurements.
+
+Ideally, we would like to get near-linear scalability as the number of
+translation threads increases.  We expect that ``-threads=1`` should be slightly
+faster than ``-threads=0`` as the small amount of time spent parsing and
+emitting is done largely in parallel with translation.  With perfect
+scalability, we see ``-threads=N`` translating ``N`` times as fast as
+``-threads=1``, up until the point where parsing or emitting becomes the
+bottleneck, or ``N+2`` exceeds the number of CPU cores.  In reality, memory
+performance would become a bottleneck and efficiency might peak at, say, 75%.
+
+Currently, parsing takes about 11% of total sequential time.  If translation
+scalability ever gets so fast and awesomely scalable that parsing becomes a
+bottleneck, it should be possible to make parsing multithreaded as well.
+
+Internally, all shared, mutable data is held in the GlobalContext object, and
+access to each field is guarded by a mutex.
+
+Security
+--------
+
+Subzero includes a number of security features in the generated code, as well as
+in the Subzero translator itself, which run on top of the existing Native Client
+sandbox as well as Chrome's OS-level sandbox.
+
+Sandboxed translator
+^^^^^^^^^^^^^^^^^^^^
+
+When running inside the browser, the Subzero translator executes as sandboxed,
+untrusted code that is initially checked by the validator, just like the
+LLVM-based ``pnacl-llc`` translator.  As such, the Subzero binary should be no
+more or less secure than the translator it replaces, from the point of view of
+the Chrome sandbox.  That said, Subzero is much smaller than ``pnacl-llc`` and
+was designed from the start with security in mind, so one expects fewer attacker
+opportunities here.
+
+Code diversification
+^^^^^^^^^^^^^^^^^^^^
+
+`Return-oriented programming
+<https://en.wikipedia.org/wiki/Return-oriented_programming>`_ (ROP) is a
+now-common technique for starting with e.g. a known buffer overflow situation
+and launching it into a deeper exploit.  The attacker scans the executable
+looking for ROP gadgets, which are short sequences of code that happen to load
+known values into known registers and then return.  An attacker who manages to
+overwrite parts of the stack can overwrite it with carefully chosen return
+addresses such that certain ROP gadgets are effectively chained together to set
+up the register state as desired, finally returning to some code that manages to
+do something nasty based on those register values.
+
+If there is a popular ``.pexe`` with a large install base, the attacker could
+run Subzero on it and scan the executable for suitable ROP gadgets to use as
+part of a potential exploit.  Note that if the trusted validator is working
+correctly, these ROP gadgets are limited to starting at a bundle boundary and
+cannot use the trick of finding a gadget that happens to begin inside another
+instruction.  All the same, gadgets with these constraints still exist and the
+attacker has access to them.  This is the attack model we focus most on --
+protecting the user against misuse of a "trusted" developer's application, as
+opposed to mischief from a malicious ``.pexe`` file.
+
+Subzero can mitigate these attacks to some degree through code diversification.
+Specifically, we can apply some randomness to the code generation that makes ROP
+gadgets less predictable.  This randomness can have some compile-time cost, and
+it can affect the code quality; and some diversifications may be more effective
+than others.  A more detailed treatment of hardening techniques may be found in
+the Matasano report "`Attacking Clientside JIT Compilers
+<https://www.nccgroup.trust/globalassets/resources/us/presentations/documents/attacking_clientside_jit_compilers_paper.pdf>`_".
+
+To evaluate diversification effectiveness, we use a third-party ROP gadget
+finder and limit its results to bundle-aligned addresses.  For a given
+diversification technique, we run it with a number of different random seeds,
+find ROP gadgets for each version, and determine how persistent each ROP gadget
+is across the different versions.  A gadget is persistent if the same gadget is
+found at the same code address.  The best diversifications are ones with low
+gadget persistence rates.
+
+Subzero implements 7 different diversification techniques.  Below is a
+discussion of each technique, its effectiveness, and its cost.  The discussions
+of cost and effectiveness are for a single diversification technique; the
+translation-time costs for multiple techniques are additive, but the effects of
+multiple techniques on code quality and effectiveness are not yet known.
+
+In Subzero's implementation, each randomization is "repeatable" in a sense.
+Each pass that includes a randomization option gets its own private instance of
+a random number generator (RNG).  The RNG is seeded with a combination of a
+global seed, the pass ID, and the function's sequence number.  The global seed
+is designed to be different across runs (perhaps based on the current time), but
+for debugging, the global seed can be set to a specific value and the results
+will be repeatable.
+
+Subzero-generated code is subject to diversification once per translation, and
+then Chrome caches the diversified binary for subsequent executions.  An
+attacker may attempt to run the binary multiple times hoping for
+higher-probability combinations of ROP gadgets.  When the attacker guesses
+wrong, a likely outcome is an application crash.  Chrome throttles creation of
+crashy processes which reduces the likelihood of the attacker eventually gaining
+a foothold.
+
+Constant blinding
+~~~~~~~~~~~~~~~~~
+
+Here, we prevent attackers from controlling large immediates in the text
+(executable) section.  A random cookie is generated for each function, and if
+the constant exceeds a specified threshold, the constant is obfuscated with the
+cookie and equivalent code is generated.  For example, instead of this x86
+instruction::
+
+    mov $0x11223344, <%Reg/Mem>
+
+the following code might be generated::
+
+    mov $(0x11223344+Cookie), %temp
+    lea -Cookie(%temp), %temp
+    mov %temp, <%Reg/Mem>
+
+The ``lea`` instruction is used rather than e.g. ``add``/``sub`` or ``xor``, to
+prevent unintended effects on the flags register.
+
+This transformation has almost no effect on translation time, and about 1%
+impact on code quality, depending on the threshold chosen.  It does little to
+reduce gadget persistence, but it does remove a lot of potential opportunities
+to construct intra-instruction ROP gadgets (which an attacker could use only if
+a validator bug were discovered, since the Native Client sandbox and associated
+validator force returns and other indirect branches to be to bundle-aligned
+addresses).
+
+Constant pooling
+~~~~~~~~~~~~~~~~
+
+This is similar to constant blinding, in that large immediates are removed from
+the text section.  In this case, each unique constant above the threshold is
+stored in a read-only data section and the constant is accessed via a memory
+load.  For the above example, the following code might be generated::
+
+    mov $Label$1, %temp
+    mov %temp, <%Reg/Mem>
+
+This has a similarly small impact on translation time and ROP gadget
+persistence, and a smaller (better) impact on code quality.  This is because it
+uses fewer instructions, and in some cases good register allocation leads to no
+increase in instruction count.  Note that this still gives an attacker some
+limited amount of control over some text section values, unless we randomize the
+constant pool layout.
+
+Static data reordering
+~~~~~~~~~~~~~~~~~~~~~~
+
+This transformation limits the attacker's ability to control bits in global data
+address references.  It simply permutes the order in memory of global variables
+and internal constant pool entries.  For the constant pool, we only permute
+within a type (i.e., emit a randomized list of ints, followed by a randomized
+list of floats, etc.) to maintain good packing in the face of alignment
+constraints.
+
+As might be expected, this has no impact on code quality, translation time, or
+ROP gadget persistence (though as above, it limits opportunities for
+intra-instruction ROP gadgets with a broken validator).
+
+Basic block reordering
+~~~~~~~~~~~~~~~~~~~~~~
+
+Here, we randomize the order of basic blocks within a function, with the
+constraint that we still want to maintain a topological order as much as
+possible, to avoid making the code too branchy.
+
+This has no impact on code quality, and about 1% impact on translation time, due
+to a separate pass to recompute layout.  It ends up having a huge effect on ROP
+gadget persistence, tied for best with nop insertion, reducing ROP gadget
+persistence to less than 5%.
+
+Function reordering
+~~~~~~~~~~~~~~~~~~~
+
+Here, we permute the order that functions are emitted, primarily to shift ROP
+gadgets around to less predictable locations.  It may also change call address
+offsets in case the attacker was trying to control that offset in the code.
+
+To control latency and memory footprint, we don't arbitrarily permute functions.
+Instead, for some relatively small value of N, we queue up N assembler buffers,
+and then emit the N functions in random order, and repeat until all functions
+are emitted.
+
+Function reordering has no impact on translation time or code quality.
+Measurements indicate that it reduces ROP gadget persistence to about 15%.
+
+Nop insertion
+~~~~~~~~~~~~~
+
+This diversification randomly adds a nop instruction after each regular
+instruction, with some probability.  Nop instructions of different lengths may
+be selected.  Nop instructions are never added inside a bundle_lock region.
+Note that when sandboxing is enabled, nop instructions are already being added
+for bundle alignment, so the diversification nop instructions may simply be
+taking the place of alignment nop instructions, though distributed differently
+through the bundle.
+
+In Subzero's currently implementation, nop insertion adds 3-5% to the
+translation time, but this is probably because it is implemented as a separate
+pass that adds actual nop instructions to the IR.  The overhead would probably
+be a lot less if it were integrated into the assembler pass.  The code quality
+is also reduced by 3-5%, making nop insertion the most expensive of the
+diversification techniques.
+
+Nop insertion is very effective in reducing ROP gadget persistence, at the same
+level as basic block randomization (less than 5%).  But given nop insertion's
+impact on translation time and code quality, one would most likely prefer to use
+basic block randomization instead (though the combined effects of the different
+diversification techniques have not yet been studied).
+
+Register allocation randomization
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In this diversification, the register allocator tries to make different but
+mostly functionally equivalent choices, while maintaining stable code quality.
+
+A naive approach would be the following.  Whenever the allocator has more than
+one choice for assigning a register, choose randomly among those options.  And
+whenever there are no registers available and there is a tie for the
+lowest-weight variable, randomly select one of the lowest-weight variables to
+evict.  Because of the one-pass nature of the linear-scan algorithm, this
+randomization strategy can have a large impact on which variables are ultimately
+assigned registers, with a corresponding large impact on code quality.
+
+Instead, we choose an approach that tries to keep code quality stable regardless
+of the random seed.  We partition the set of physical registers into equivalence
+classes.  If a register is pre-colored in the function (i.e., referenced
+explicitly by name), it forms its own equivalence class.  The remaining
+registers are partitioned according to their combination of attributes such as
+integer versus floating-point, 8-bit versus 32-bit, caller-save versus
+callee-saved, etc.  Each equivalence class is randomly permuted, and the
+complete permutation is applied to the final register assignments.
+
+Register randomization reduces ROP gadget persistence to about 10% on average,
+though there tends to be fairly high variance across functions and applications.
+This probably has to do with the set of restrictions in the x86-32 instruction
+set and ABI, such as few general-purpose registers, ``%eax`` used for return
+values, ``%edx`` used for division, ``%cl`` used for shifting, etc.  As
+intended, register randomization has no impact on code quality, and a slight
+(0.5%) impact on translation time due to an extra scan over the variables to
+identify pre-colored registers.
+
+Fuzzing
+^^^^^^^
+
+We have started fuzz-testing the ``.pexe`` files input to Subzero, using a
+combination of `afl-fuzz <http://lcamtuf.coredump.cx/afl/>`_, LLVM's `libFuzzer
+<http://llvm.org/docs/LibFuzzer.html>`_, and custom tooling.  The purpose is to
+find and fix cases where Subzero crashes or otherwise ungracefully fails on
+unexpected inputs, and to do so automatically over a large range of unexpected
+inputs.  By fixing bugs that arise from fuzz testing, we reduce the possibility
+of an attacker exploiting these bugs.
+
+Most of the problems found so far are ones most appropriately handled in the
+parser.  However, there have been a couple that have identified problems in the
+lowering, or otherwise inappropriately triggered assertion failures and fatal
+errors.  We continue to dig into this area.
+
+Future security work
+^^^^^^^^^^^^^^^^^^^^
+
+Subzero is well-positioned to explore other future security enhancements, e.g.:
+
+- Tightening the Native Client sandbox.  ABI changes, such as the previous work
+  on `hiding the sandbox base address
+  <https://docs.google.com/document/d/1eskaI4353XdsJQFJLRnZzb_YIESQx4gNRzf31dqXVG8>`_
+  in x86-64, are easy to experiment with in Subzero.
+
+- Making the executable code section read-only.  This would prevent a PNaCl
+  application from inspecting its own binary and trying to find ROP gadgets even
+  after code diversification has been performed.  It may still be susceptible to
+  `blind ROP <http://www.scs.stanford.edu/brop/bittau-brop.pdf>`_ attacks,
+  security is still overall improved.
+
+- Instruction selection diversification.  It may be possible to lower a given
+  instruction in several largely equivalent ways, which gives more opportunities
+  for code randomization.
+
+Chrome integration
+------------------
+
+Currently Subzero is available in Chrome for the x86-32 architecture, but under
+a flag.  When the flag is enabled, Subzero is used when the `manifest file
+<https://developer.chrome.com/native-client/reference/nacl-manifest-format>`_
+linking to the ``.pexe`` file specifies the ``O0`` optimization level.
+
+The next step is to remove the flag, i.e. invoke Subzero as the only translator
+for ``O0``-specified manifest files.
+
+Ultimately, Subzero might produce code rivaling LLVM ``O2`` quality, in which
+case Subzero could be used for all PNaCl translation.
+
+Command line options
+--------------------
+
+Subzero has a number of command-line options for debugging and diagnostics.
+Among the more interesting are the following.
+
+- Using the ``-verbose`` flag, Subzero will dump the CFG, or produce other
+  diagnostic output, with various levels of detail after each pass.  Instruction
+  numbers can be printed or suppressed.  Deleted instructions can be printed or
+  suppressed (they are retained in the instruction list, as discussed earlier,
+  because they can help explain how lower-level instructions originated).
+  Liveness information can be printed when available.  Details of register
+  allocation can be printed as register allocator decisions are made.  And more.
+
+- Running Subzero with any level of verbosity produces an enormous amount of
+  output.  When debugging a single function, verbose output can be suppressed
+  except for a particular function.  The ``-verbose-focus`` flag suppresses
+  verbose output except for the specified function.
+
+- Subzero has a ``-timing`` option that prints a breakdown of pass-level timing
+  at exit.  Timing markers can be placed in the Subzero source code to demarcate
+  logical operations or passes of interest.  Basic timing information plus
+  call-stack type timing information is printed at the end.
+
+- Along with ``-timing``, the user can instead get a report on the overall
+  translation time for each function, to help focus on timing outliers.  Also,
+  ``-timing-focus`` limits the ``-timing`` reporting to a single function,
+  instead of aggregating pass timing across all functions.
+
+- The ``-szstats`` option reports various statistics on each function, such as
+  stack frame size, static instruction count, etc.  It may be helpful to track
+  these stats over time as Subzero is improved, as an approximate measure of
+  code quality.
+
+- The flag ``-asm-verbose``, in conjunction with emitting textual assembly
+  output, annotate the assembly output with register-focused liveness
+  information.  In particular, each basic block is annotated with which
+  registers are live-in and live-out, and each instruction is annotated with
+  which registers' and stack locations' live ranges end at that instruction.
+  This is really useful when studying the generated code to find opportunities
+  for code quality improvements.
+
+Testing and debugging
+---------------------
+
+LLVM lit tests
+^^^^^^^^^^^^^^
+
+For basic testing, Subzero uses LLVM's `lit
+<http://llvm.org/docs/CommandGuide/lit.html>`_ framework for running tests.  We
+have a suite of hundreds of small functions where we test for particular
+assembly code patterns across different target architectures.
+
+Cross tests
+^^^^^^^^^^^
+
+Unfortunately, the lit tests don't do a great job of precisely testing the
+correctness of the output.  Much better are the cross tests, which are execution
+tests that compare Subzero and ``pnacl-llc`` translated bitcode across a wide
+variety of interesting inputs.  Each cross test consists of a set of C, C++,
+and/or low-level bitcode files.  The C and C++ source files are compiled down to
+bitcode.  The bitcode files are translated by ``pnacl-llc`` and also by Subzero.
+Subzero mangles global symbol names with a special prefix to avoid duplicate
+symbol errors.  A driver program invokes both versions on a large set of
+interesting inputs, and reports when the Subzero and ``pnacl-llc`` results
+differ.  Cross tests turn out to be an excellent way of testing the basic
+lowering patterns, but they are less useful for testing more global things like
+liveness analysis and register allocation.
+
+Bisection debugging
+^^^^^^^^^^^^^^^^^^^
+
+Sometimes with a new application, Subzero will end up producing incorrect code
+that either crashes at runtime or otherwise produces the wrong results.  When
+this happens, we need to narrow it down to a single function (or small set of
+functions) that yield incorrect behavior.  For this, we have a bisection
+debugging framework.  Here, we initially translate the entire application once
+with Subzero and once with ``pnacl-llc``.  We then use ``objdump`` to
+selectively weaken symbols based on a whitelist or blacklist provided on the
+command line.  The two object files can then be linked together without link
+errors, with the desired version of each method "winning".  Then the binary is
+tested, and bisection proceeds based on whether the binary produces correct
+output.
+
+When the bisection completes, we are left with a minimal set of
+Subzero-translated functions that cause the failure.  Usually it is a single
+function, though sometimes it might require a combination of several functions
+to cause a failure; this may be due to an incorrect call ABI, for example.
+However, Murphy's Law implies that the single failing function is enormous and
+impractical to debug.  In that case, we can restart the bisection, explicitly
+blacklisting the enormous function, and try to find another candidate to debug.
+(Future work is to automate this to find all minimal sets of functions, so that
+debugging can focus on the simplest example.)
+
+Fuzz testing
+^^^^^^^^^^^^
+
+As described above, we try to find internal Subzero bugs using fuzz testing
+techniques.
+
+Sanitizers
+^^^^^^^^^^
+
+Subzero can be built with `AddressSanitizer
+<http://clang.llvm.org/docs/AddressSanitizer.html>`_ (ASan) or `ThreadSanitizer
+<http://clang.llvm.org/docs/ThreadSanitizer.html>`_ (TSan) support.  This is
+done using something as simple as ``make ASAN=1`` or ``make TSAN=1``.  So far,
+multithreading has been simple enough that TSan hasn't found any bugs, but ASan
+has found at least one memory leak which was subsequently fixed.
+`UndefinedBehaviorSanitizer
+<http://clang.llvm.org/docs/UsersManual.html#controlling-code-generation>`_
+(UBSan) support is in progress.  `Control flow integrity sanitization
+<http://clang.llvm.org/docs/ControlFlowIntegrity.html>`_ is also under
+consideration.
+
+Current status
+==============
+
+Target architectures
+--------------------
+
+Subzero is currently more or less complete for the x86-32 target.  It has been
+refactored and extended to handle x86-64 as well, and that is mostly complete at
+this point.
+
+ARM32 work is in progress.  It currently lacks the testing level of x86, at
+least in part because Subzero's register allocator needs modifications to handle
+ARM's aliasing of floating point and vector registers.  Specifically, a 64-bit
+register is actually a gang of two consecutive and aligned 32-bit registers, and
+a 128-bit register is a gang of 4 consecutive and aligned 32-bit registers.
+ARM64 work has not started; when it does, it will be native-only since the
+Native Client sandbox model, validator, and other tools have never been defined.
+
+An external contributor is adding MIPS support, in most part by following the
+ARM work.
+
+Translator performance
+----------------------
+
+Single-threaded translation speed is currently about 5× the ``pnacl-llc``
+translation speed.  For a large ``.pexe`` file, the time breaks down as:
+
+- 11% for parsing and initial IR building
+
+- 4% for emitting to /dev/null
+
+- 27% for liveness analysis (two liveness passes plus live range construction)
+
+- 15% for linear-scan register allocation
+
+- 9% for basic lowering
+
+- 10% for advanced phi lowering
+
+- ~11% for other minor analysis
+
+- ~10% measurement overhead to acquire these numbers
+
+Some improvements could undoubtedly be made, but it will be hard to increase the
+speed to 10× of ``pnacl-llc`` while keeping acceptable code quality.  With
+``-Om1`` (lack of) optimization, we do actually achieve roughly 10×
+``pnacl-llc`` translation speed, but code quality drops by a factor of 3.
+
+Code quality
+------------
+
+Measured across 16 components of spec2k, Subzero's code quality is uniformly
+better than ``pnacl-llc`` ``-O0`` code quality, and in many cases solidly
+between ``pnacl-llc`` ``-O0`` and ``-O2``.
+
+Translator size
+---------------
+
+When built in MINIMAL mode, the x86-64 native translator size for the x86-32
+target is about 700 KB, not including the size of functions referenced in
+dynamically-linked libraries.  The sandboxed version of Subzero is a bit over 1
+MB, and it is statically linked and also includes nop padding for bundling as
+well as indirect branch masking.
+
+Translator memory footprint
+---------------------------
+
+It's hard to draw firm conclusions about memory footprint, since the footprint
+is at least proportional to the input function size, and there is no real limit
+on the size of functions in the ``.pexe`` file.
+
+That said, we looked at the memory footprint over time as Subzero translated
+``pnacl-llc.pexe``, which is the largest ``.pexe`` file (7.2 MB) at our
+disposal.  One of LLVM's libraries that Subzero uses can report the current
+malloc heap usage.  With single-threaded translation, Subzero tends to hover
+around 15 MB of memory usage.  There are a couple of monstrous functions where
+Subzero grows to around 100 MB, but then it drops back down after those
+functions finish translating.  In contrast, ``pnacl-llc`` grows larger and
+larger throughout translation, reaching several hundred MB by the time it
+completes.
+
+It's a bit more interesting when we enable multithreaded translation.  When
+there are N translation threads, Subzero implements a policy that limits the
+size of the translation queue to N entries -- if it is "full" when the parser
+tries to add a new CFG, the parser blocks until one of the translation threads
+removes a CFG.  This means the number of in-memory CFGs can (and generally does)
+reach 2*N+1, and so the memory footprint rises in proportion to the number of
+threads.  Adding to the pressure is the observation that the monstrous functions
+also take proportionally longer time to translate, so there's a good chance many
+of the monstrous functions will be active at the same time with multithreaded
+translation.  As a result, for N=32, Subzero's memory footprint peaks at about
+260 MB, but drops back down as the large functions finish translating.
+
+If this peak memory size becomes a problem, it might be possible for the parser
+to resequence the functions to try to spread out the larger functions, or to
+throttle the translation queue to prevent too many in-flight large functions.
+It may also be possible to throttle based on memory pressure signaling from
+Chrome.
+
+Translator scalability
+----------------------
+
+Currently scalability is "not very good".  Multiple translation threads lead to
+faster translation, but not to the degree desired.  We haven't dug in to
+investigate yet.
+
+There are a few areas to investigate.  First, there may be contention on the
+constant pool, which all threads access, and which requires locked access even
+for reading.  This could be mitigated by keeping a CFG-local cache of the most
+common constants.
+
+Second, there may be contention on memory allocation.  While almost all CFG
+objects are allocated from the CFG-local allocator, some passes use temporary
+STL containers that use the default allocator, which may require global locking.
+This could be mitigated by switching these to the CFG-local allocator.
+
+Third, multithreading may make the default allocator strategy more expensive.
+In a single-threaded environment, a pass will allocate its containers, run the
+pass, and deallocate the containers.  This results in stack-like allocation
+behavior and makes the heap free list easier to manage, with less heap
+fragmentation.  But when multithreading is added, the allocations and
+deallocations become much less stack-like, making allocation and deallocation
+operations individually more expensive.  Again, this could be mitigated by
+switching these to the CFG-local allocator.
diff --git a/LOWERING.rst b/docs/LOWERING.rst

similarity index 100%

rename from LOWERING.rst

rename to docs/LOWERING.rst
diff --git a/docs/README.rst b/docs/README.rst

new file mode 100644 (file)

index 0000000..0257558
--- /dev/null
+++ b/docs/README.rst
@@ -0,0 +1,192 @@
+Subzero - Fast code generator for PNaCl bitcode
+===============================================
+
+Design
+------
+
+See the accompanying DESIGN.rst file for a more detailed technical overview of
+Subzero.
+
+Building
+--------
+
+Subzero is set up to be built within the Native Client tree.  Follow the
+`Developing PNaCl
+<https://sites.google.com/a/chromium.org/dev/nativeclient/pnacl/developing-pnacl>`_
+instructions, in particular the section on building PNaCl sources.  This will
+prepare the necessary external headers and libraries that Subzero needs.
+Checking out the Native Client project also gets the pre-built clang and LLVM
+tools in ``native_client/../third_party/llvm-build/Release+Asserts/bin`` which
+are used for building Subzero.
+
+The Subzero source is in ``native_client/toolchain_build/src/subzero``.  From
+within that directory, ``git checkout master && git pull`` to get the latest
+version of Subzero source code.
+
+The Makefile is designed to be used as part of the higher level LLVM build
+system.  To build manually, use the ``Makefile.standalone``.  There are several
+build configurations from the command line::
+
+    make -f Makefile.standalone
+    make -f Makefile.standalone DEBUG=1
+    make -f Makefile.standalone NOASSERT=1
+    make -f Makefile.standalone DEBUG=1 NOASSERT=1
+    make -f Makefile.standalone MINIMAL=1
+    make -f Makefile.standalone ASAN=1
+    make -f Makefile.standalone TSAN=1
+
+``DEBUG=1`` builds without optimizations and is good when running the translator
+inside a debugger.  ``NOASSERT=1`` disables assertions and is the preferred
+configuration for performance testing the translator.  ``MINIMAL=1`` attempts to
+minimize the size of the translator by compiling out everything unnecessary.
+``ASAN=1`` enables AddressSanitizer, and ``TSAN=1`` enables ThreadSanitizer.
+
+The result of the ``make`` command is the target ``pnacl-sz`` in the current
+directory.
+
+``pnacl-sz``
+------------
+
+The ``pnacl-sz`` program parses a pexe or an LLVM bitcode file and translates it
+into ICE (Subzero's intermediate representation).  It then invokes the ICE
+translate method to lower it to target-specific machine code, optionally dumping
+the intermediate representation at various stages of the translation.
+
+The program can be run as follows::
+
+    ../pnacl-sz ./path/to/<file>.pexe
+    ../pnacl-sz ./tests_lit/pnacl-sz_tests/<file>.ll
+
+At this time, ``pnacl-sz`` accepts a number of arguments, including the
+following:
+
+    ``-help`` -- Show available arguments and possible values.  (Note: this
+    unfortunately also pulls in some LLVM-specific options that are reported but
+    that Subzero doesn't use.)
+
+    ``-notranslate`` -- Suppress the ICE translation phase, which is useful if
+    ICE is missing some support.
+
+    ``-target=<TARGET>`` -- Set the target architecture.  The default is x8632.
+    Future targets include x8664, arm32, and arm64.
+
+    ``-filetype=obj|asm|iasm`` -- Select the output file type.  ``obj`` is a
+    native ELF file, ``asm`` is a textual assembly file, and ``iasm`` is a
+    low-level textual assembly file demonstrating the integrated assembler.
+
+    ``-O<LEVEL>`` -- Set the optimization level.  Valid levels are ``2``, ``1``,
+    ``0``, ``-1``, and ``m1``.  Levels ``-1`` and ``m1`` are synonyms, and
+    represent the minimum optimization and worst code quality, but fastest code
+    generation.
+
+    ``-verbose=<list>`` -- Set verbosity flags.  This argument allows a
+    comma-separated list of values.  The default is ``none``, and the value
+    ``inst,pred`` will roughly match the .ll bitcode file.  Of particular use
+    are ``all``, ``most``, and ``none``.
+
+    ``-o <FILE>`` -- Set the assembly output file name.  Default is stdout.
+
+    ``-log <FILE>`` -- Set the file name for diagnostic output (whose level is
+    controlled by ``-verbose``).  Default is stdout.
+
+    ``-timing`` -- Dump some pass timing information after translating the input
+    file.
+
+Running the test suite
+----------------------
+
+Subzero uses the LLVM ``lit`` testing tool for part of its test suite, which
+lives in ``tests_lit``. To execute the test suite, first build Subzero, and then
+run::
+
+    make -f Makefile.standalone check-lit
+
+There is also a suite of cross tests in the ``crosstest`` directory.  A cross
+test takes a test bitcode file implementing some unit tests, and translates it
+twice, once with Subzero and once with LLVM's known-good ``llc`` translator.
+The Subzero-translated symbols are specially mangled to avoid multiple
+definition errors from the linker.  Both translated versions are linked together
+with a driver program that calls each version of each unit test with a variety
+of interesting inputs and compares the results for equality.  The cross tests
+are currently invoked by running::
+
+    make -f Makefile.standalone check-xtest
+
+Similar, there is a suite of unit tests::
+
+    make -f Makefile.standalone check-unit
+
+A convenient way to run the lit, cross, and unit tests is::
+
+    make -f Makefile.standalone check
+
+Assembling ``pnacl-sz`` output as needed
+----------------------------------------
+
+``pnacl-sz`` can now produce a native ELF binary using ``-filetype=obj``.
+
+``pnacl-sz`` can also produce textual assembly code in a structure suitable for
+input to ``llvm-mc``, using ``-filetype=asm`` or ``-filetype=iasm``.  An object
+file can then be produced using the command::
+
+    llvm-mc -triple=i686 -filetype=obj -o=MyObj.o
+
+Building a translated binary
+----------------------------
+
+There is a helper script, ``pydir/szbuild.py``, that translates a finalized pexe
+into a fully linked executable.  Run it with ``-help`` for extensive
+documentation.
+
+By default, ``szbuild.py`` builds an executable using only Subzero translation,
+but it can also be used to produce hybrid Subzero/``llc`` binaries (``llc`` is
+the name of the LLVM translator) for bisection-based debugging.  In bisection
+debugging mode, the pexe is translated using both Subzero and ``llc``, and the
+resulting object files are combined into a single executable using symbol
+weakening and other linker tricks to control which Subzero symbols and which
+``llc`` symbols take precedence.  This is controlled by the ``-include`` and
+``-exclude`` arguments.  These can be used to rapidly find a single function
+that Subzero translates incorrectly leading to incorrect output.
+
+There is another helper script, ``pydir/szbuild_spec2k.py``, that runs
+``szbuild.py`` on one or more components of the Spec2K suite.  This assumes that
+Spec2K is set up in the usual place in the Native Client tree, and the finalized
+pexe files have been built.  (Note: for working with Spec2K and other pexes,
+it's helpful to finalize the pexe using ``--no-strip-syms``, to preserve the
+original function and global variable names.)
+
+Status
+------
+
+Subzero currently fully supports the x86-32 architecture, for both native and
+Native Client sandboxing modes.  The x86-64 architecture is also supported in
+native mode only, and only for the x32 flavor due to the fact that pointers and
+32-bit integers are indistinguishable in PNaCl bitcode.  Sandboxing support for
+x86-64 is in progress.  ARM and MIPS support is in progress.  Two optimization
+levels, ``-Om1`` and ``-O2``, are implemented.
+
+The ``-Om1`` configuration is designed to be the simplest and fastest possible,
+with a minimal set of passes and transformations.
+
+* Simple Phi lowering before target lowering, by generating temporaries and
+  adding assignments to the end of predecessor blocks.
+
+* Simple register allocation limited to pre-colored or infinite-weight
+  Variables.
+
+The ``-O2`` configuration is designed to use all optimizations available and
+produce the best code.
+
+* Address mode inference to leverage the complex x86 addressing modes.
+
+* Compare/branch fusing based on liveness/last-use analysis.
+
+* Global, linear-scan register allocation.
+
+* Advanced phi lowering after target lowering and global register allocation,
+  via edge splitting, topological sorting of the parallel moves, and final local
+  register allocation.
+
+* Stack slot coalescing to reduce frame size.
+
+* Branch optimization to reduce the number of branches to the following block.
author	Reed Kotler <rkotlerimgtec@gmail.com>
	Thu, 7 Jan 2016 01:22:21 +0000 (17:22 -0800)
committer	Jim Stichnoth <stichnot@chromium.org>
	Thu, 7 Jan 2016 01:22:21 +0000 (17:22 -0800)
.gitignore		patch \| blob \| history
DESIGN.rst	[changed from file to symlink]	patch \| blob \| history
Makefile.standalone		patch \| blob \| history
README.rst	[changed from file to symlink]	patch \| blob \| history
docs/ALLOCATION.rst	[moved from ALLOCATION.rst with 100% similarity]	patch \| blob \| history
docs/DESIGN.rst	[new file with mode: 0644]	patch \| blob
docs/LOWERING.rst	[moved from LOWERING.rst with 100% similarity]	patch \| blob \| history
docs/README.rst	[new file with mode: 0644]	patch \| blob