Subzero: Add a new document describing the register allocator.

author Jim Stichnoth <stichnot@chromium.org>

Wed, 1 Feb 2017 17:03:02 +0000 (09:03 -0800)

committer Jim Stichnoth <stichnot@chromium.org>

Wed, 1 Feb 2017 17:03:02 +0000 (09:03 -0800)
author Jim Stichnoth <stichnot@chromium.org>
Wed, 1 Feb 2017 17:03:02 +0000 (09:03 -0800)
committer Jim Stichnoth <stichnot@chromium.org>
Wed, 1 Feb 2017 17:03:02 +0000 (09:03 -0800)
diff --git a/docs/REGALLOC.rst b/docs/REGALLOC.rst

new file mode 100644 (file)

index 0000000..99c6939
--- /dev/null
+++ b/docs/REGALLOC.rst
@@ -0,0 +1,411 @@
+Register allocation in Subzero
+==============================
+
+Introduction
+------------
+
+`Subzero
+<https://chromium.googlesource.com/native_client/pnacl-subzero/+/master/docs/DESIGN.rst>`_
+is a fast code generator that translates architecture-independent `PNaCl bitcode
+<https://developer.chrome.com/native-client/reference/pnacl-bitcode-abi>`_ into
+architecture-specific machine code.  PNaCl bitcode is LLVM bitcode that has been
+simplified (e.g. weird-width primitive types like 57-bit integers are not
+allowed) and has had architecture-independent optimizations already applied.
+Subzero aims to generate high-quality code as fast as practical, and as such
+Subzero needs to make tradeoffs between compilation speed and output code
+quality.
+
+In Subzero, we have found register allocation to be one of the most important
+optimizations toward achieving the best code quality, which is in tension with
+register allocation's reputation for being complex and expensive.  Linear-scan
+register allocation is a modern favorite for getting fairly good register
+allocation at relatively low cost.  Subzero uses linear-scan for its core
+register allocator, with a few internal modifications to improve its
+effectiveness (specifically: register preference, register overlap, and register
+aliases).  Subzero also does a few high-level things on top of its core register
+allocator to improve overall effectiveness (specifically: repeat until
+convergence, delayed phi lowering, and local live range splitting).
+
+What we describe here are techniques that have worked well for Subzero, in the
+context of its particular intermediate representation (IR) and compilation
+strategy.  Some of these techniques may not be applicable to another compiler,
+depending on its particular IR and compilation strategy.  Some key concepts in
+Subzero are the following:
+
+- Subzero's ``Variable`` operand is an operand that resides either on the stack
+  or in a physical register.  A Variable can be tagged as *must-have-register*
+  or *must-not-have-register*, but its default is *may-have-register*.  All uses
+  of the Variable map to the same physical register or stack location.
+
+- Basic lowering is done before register allocation.  Lowering is the process of
+  translating PNaCl bitcode instructions into native instructions.  Sometimes a
+  native instruction, like the x86 ``add`` instruction, allows one of its
+  Variable operands to be either in a physical register or on the stack, in
+  which case the lowering is relatively simple.  But if the lowered instruction
+  requires the operand to be in a physical register, we generate code that
+  copies the Variable into a *must-have-register* temporary, and then use that
+  temporary in the lowered instruction.
+
+- Instructions within a basic block appear in a linearized order (as opposed to
+  something like a directed acyclic graph of dependency edges).
+
+- An instruction has 0 or 1 *dest* Variables and an arbitrary (but usually
+  small) number of *source* Variables.  Assuming SSA form, the instruction
+  begins the live range of the dest Variable, and may end the live range of one
+  or more of the source Variables.
+
+Executive summary
+-----------------
+
+- Liveness analysis and live range construction are prerequisites for register
+  allocation.  Without careful attention, they can be potentially very
+  expensive, especially when the number of variables and basic blocks gets very
+  large.  Subzero uses some clever data structures to take advantage of the
+  sparsity of the data, resulting in stable performance as function size scales
+  up.  This means that the proportion of compilation time spent on liveness
+  analysis stays roughly the same.
+
+- The core of Subzero's register allocator is taken from `Mössenböck and
+  Pfeiffer's paper <ftp://ftp.ssw.uni-linz.ac.at/pub/Papers/Moe02.PDF>`_ on
+  linear-scan register allocation.
+
+- We enhance the core algorithm with a good automatic preference mechanism when
+  more than one register is available, to try to minimize register shuffling.
+
+- We also enhance it to allow overlapping live ranges to share the same
+  register, when one variable is recognized as a read-only copy of the other.
+
+- We deal with register aliasing in a clean and general fashion.  Register
+  aliasing is when e.g. 16-bit architectural registers share some bits with
+  their 32-bit counterparts, or 64-bit registers are actually pairs of 32-bit
+  registers.
+
+- We improve register allocation by rerunning the algorithm on likely candidates
+  that didn't get registers during the previous iteration, without imposing much
+  additional cost.
+
+- The main register allocation is done before phi lowering, because phi lowering
+  imposes early and unnecessary ordering constraints on the resulting
+  assigments, which create spurious interferences in live ranges.
+
+- Within each basic block, we aggressively split each variable's live range at
+  every use, so that portions of the live range can get registers even if the
+  whole live range can't.  Doing this separately for each basic block avoids
+  merge complications, and keeps liveness analysis and register allocation fast
+  by fitting well into the sparsity optimizations of their data structures.
+
+Liveness analysis
+-----------------
+
+With respect to register allocation, the main purpose of liveness analysis is to
+calculate the live range of each variable.  The live range is represented as a
+set of instruction number ranges.  Instruction numbers within a basic block must
+be monotonically increasing, and the instruction ranges of two different basic
+blocks must not overlap.
+
+Basic liveness
+^^^^^^^^^^^^^^
+
+Liveness analysis is a straightforward dataflow algorithm.  For each basic
+block, we keep track of the live-in and live-out set, i.e. the set of variables
+that are live coming into or going out of the basic block.  Processing of a
+basic block starts by initializing a temporary set as the union of the live-in
+sets of the basic block's successor blocks.  (This basic block's live-out set is
+captured as the initial value of the temporary set.)  Then each instruction of
+the basic block is processed in reverse order.  All the source variables of the
+instruction are marked as live, by adding them to the temporary set, and the
+dest variable of the instruction (if any) is marked as not-live, by removing it
+from the temporary set.  When we finish processing all of the block's
+instructions, we add/union the temporary set into the basic block's live-in set.
+If this changes the basic block's live-in set, then we mark all of this basic
+block's predecessor blocks to be reprocessed.  Then we repeat for other basic
+blocks until convergence, i.e. no more basic blocks are marked to be
+reprocessed.  If basic blocks are processed in reverse topological order, then
+the number of times each basic block need to be reprocessed is generally its
+loop nest depth.
+
+The result of this pass is the live-in and live-out set for each basic block.
+
+With so many set operations, choice of data structure is crucial for
+performance.  We tried a few options, and found that a simple dense bit vector
+works best.  This keeps the per-instruction cost very low.  However, we found
+that when the function gets very large, merging and copying bit vectors at basic
+block boundaries dominates the cost.  This is due to the number of variables
+(hence the bit vector size) and the number of basic blocks getting large.
+
+A simple enhancement brought this under control in Subzero.  It turns out that
+the vast majority of variables are referenced, and therefore live, only in a
+single basic block.  This is largely due to the SSA form of PNaCl bitcode.  To
+take advantage of this, we partition variables by single-block versus
+multi-block liveness.  Multi-block variables get lower-numbered bit vector
+indexes, and single-block variables get higher-number indexes.  Single-block bit
+vector indexes are reused across different basic blocks.  As such, the size of
+live-in and live-out bit vectors is limited to the number of multi-block
+variables, and the temporary set's size can be limited to that plus the largest
+number of single-block variables across all basic blocks.
+
+For the purpose of live range construction, we also need to track definitions
+(LiveBegin) and last-uses (LiveEnd) of variables used within instructions of the
+basic block.  These are easy to detect while processing the instructions; data
+structure choices are described below.
+
+Live range construction
+^^^^^^^^^^^^^^^^^^^^^^^
+
+After the live-in and live-out sets are calculated, we construct each variable's
+live range (as an ordered list of instruction ranges, described above).  We do
+this by considering the live-in and live-out sets, combined with LiveBegin and
+LiveEnd information.  This is done separately for each basic block.
+
+As before, we need to take advantage of sparsity of variable uses across basic
+blocks, to avoid overly copying/merging data structures.  The following is what
+worked well for Subzero (after trying several other options).
+
+The basic liveness pass, described above, keeps track of when a variable's live
+range begins or ends within the block.  LiveBegin and LiveEnd are unordered
+vectors where each element is a pair of the variable and the instruction number,
+representing that the particular variable's live range begins or ends at the
+particular instruction.  When the liveness pass finds a variable whose live
+range begins or ends, it appends and entry to LiveBegin or LiveEnd.
+
+During live range construction, the LiveBegin and LiveEnd vectors are sorted by
+variable number.  Then we iterate across both vectors in parallel.  If a
+variable appears in both LiveBegin and LiveEnd, then its live range is entirely
+within this block.  If it appears in only LiveBegin, then its live range starts
+here and extends through the end of the block.  If it appears in only LiveEnd,
+then its live range starts at the beginning of the block and ends here.  (Note
+that this only covers the live range within this block, and this process is
+repeated across all blocks.)
+
+It is also possible that a variable is live within this block but its live range
+does not begin or end in this block.  These variables are identified simply by
+taking the intersection of the live-in and live-out sets.
+
+As a result of these data structures, performance of liveness analysis and live
+range construction tend to be stable across small, medium, and large functions,
+as measured by a fairly consistent proportion of total compilation time spent on
+the liveness passes.
+
+Linear-scan register allocation
+-------------------------------
+
+The basis of Subzero's register allocator is the allocator described by
+Hanspeter Mössenböck and Michael Pfeiffer in `Linear Scan Register Allocation in
+the Context of SSA Form and Register Constraints
+<ftp://ftp.ssw.uni-linz.ac.at/pub/Papers/Moe02.PDF>`_.  It allows live ranges to
+be a union of intervals rather than a single conservative interval, and it
+allows pre-coloring of variables with specific physical registers.
+
+The paper suggests an approach of aggressively coalescing variables across Phi
+instructions (i.e., trying to force Phi source and dest variables to have the
+same register assignment), but we omit that in favor of the more natural
+preference mechanism described below.
+
+We found the paper quite remarkable in that a straightforward implementation of
+its pseudo-code led to an entirely correct register allocator.  The only thing
+we found in the specification that was even close to a mistake is that it was
+too aggressive in evicting inactive ranges in the "else" clause of the
+AssignMemLoc routine.  An inactive range only needs to be evicted if it actually
+overlaps the current range being considered, whereas the paper evicts it
+unconditionally.  (Search for "original paper" in Subzero's register allocator
+source code.)
+
+Register preference
+-------------------
+
+The linear-scan algorithm from the paper talks about choosing an available
+register, but isn't specific on how to choose among several available registers.
+The simplest approach is to just choose the first available register, e.g. the
+lowest-numbered register.  Often a better choice is possible.
+
+Specifically, if variable ``V`` is assigned in an instruction ``V=f(S1,S2,...)``
+with source variables ``S1,S2,...``, and that instruction ends the live range of
+one of those source variables ``Sn``, and ``Sn`` was assigned a register, then
+``Sn``'s register is usually a good choice for ``V``.  This is especially true
+when the instruction is a simple assignment, because an assignment where the
+dest and source variables end up with the same register can be trivially elided,
+reducing the amount of register-shuffling code.
+
+This requires being able to find and inspect a variable's defining instruction,
+which is not an assumption in the original paper but is easily tracked in
+practice.
+
+Register overlap
+----------------
+
+Because Subzero does register allocation after basic lowering, the lowering has
+to be prepared for the possibility of any given program variable not getting a
+physical register.  It does this by introducing *must-have-register* temporary
+variables, and copies the program variable into the temporary to ensure that
+register requirements in the target instruction are met.
+
+In many cases, the program variable and temporary variable have overlapping live
+ranges, and would be forced to have different registers even if the temporary
+variable is effectively a read-only copy of the program variable.  We recognize
+this when the program variable has no definitions within the temporary
+variable's live range, and the temporary variable has no definitions within the
+program variable's live range with the exception of the copy assignment.
+
+This analysis is done as part of register preference detection.
+
+The main impact on the linear-scan implementation is that instead of
+setting/resetting a boolean flag to indicate whether a register is free or in
+use, we increment/decrement a number-of-uses counter.
+
+Register aliases
+----------------
+
+Sometimes registers of different register classes partially overlap.  For
+example, in x86, registers ``al`` and ``ah`` alias ``ax`` (though they don't
+alias each other), and all three alias ``eax`` and ``rax``.  And in ARM,
+registers ``s0`` and ``s1`` (which are single-precision floating-point
+registers) alias ``d0`` (double-precision floating-point), and registers ``d0``
+and ``d1`` alias ``q0`` (128-bit vector register).  The linear-scan paper
+doesn't address this issue; it assumes that all registers are independent.  A
+simple solution is to essentially avoid aliasing by disallowing a subset of the
+registers, but there is obviously a reduction in code quality when e.g. half of
+the registers are taken away.
+
+Subzero handles this more elegantly.  For each register, we keep a bitmask
+indicating which registers alias/conflict with it.  For example, in x86,
+``ah``'s alias set is ``ah``, ``ax``, ``eax``, and ``rax`` (but not ``al``), and
+in ARM, ``s1``'s alias set is ``s1``, ``d0``, and ``q0``.  Whenever we want to
+mark a register as being used or released, we do the same for all of its
+aliases.
+
+Before implementing this, we were a bit apprehensive about the compile-time
+performance impact.  Instead of setting one bit in a bit vector or decrementing
+one counter, this generally needs to be done in a loop that iterates over all
+aliases.  Fortunately, this seemed to make very little difference in
+performance, as the bulk of the cost ends up being in live range overlap
+computations, which are not affected by register aliasing.
+
+Repeat until convergence
+------------------------
+
+Sometimes the linear-scan algorithm makes a register assignment only to later
+revoke it in favor of a higher-priority variable, but it turns out that a
+different initial register choice would not have been revoked.  For relatively
+low compile-time cost, we can give those variables another chance.
+
+During register allocation, we keep track of the revoked variables and then do
+another round of register allocation targeted only to that set.  We repeat until
+no new register assignments are made, which is usually just a handful of
+successively cheaper iterations.
+
+Another approach would be to repeat register allocation for *all* variables that
+haven't had a register assigned (rather than variables that got a register that
+was subsequently revoked), but our experience is that this greatly increases
+compile-time cost, with little or no code quality gain.
+
+Delayed Phi lowering
+--------------------
+
+The linear-scan algorithm works for phi instructions as well as regular
+instructions, but it is tempting to lower phi instructions into non-SSA
+assignments before register allocation, so that register allocation need only
+happen once.
+
+Unfortunately, simple phi lowering imposes an arbitrary ordering on the
+resulting assignments that can cause artificial overlap/interference between
+lowered assignments, and can lead to worse register allocation decisions.  As a
+simple example, consider these two phi instructions which are semantically
+unordered::
+
+  A = phi(B) // B's live range ends here
+  C = phi(D) // D's live range ends here
+
+A straightforward lowering might yield::
+
+  A = B // end of B's live range
+  C = D // end of D's live range
+
+The potential problem here is that A and D's live ranges overlap, and so they
+are prevented from having the same register.  Swapping the order of lowered
+assignments fixes that (but then B and C would overlap), but we can't really
+know which is better until after register allocation.
+
+Subzero deals with this by doing the main register allocation before phi
+lowering, followed by phi lowering, and finally a special register allocation
+pass limited to the new lowered assignments.
+
+Phi lowering considers the phi operands separately for each predecessor edge,
+and starts by finding a topological sort of the Phi instructions, such that
+assignments can be executed in that order without violating dependencies on
+registers or stack locations.  If a topological sort is not possible due to a
+cycle, the cycle is broken by introducing a temporary, e.g. ``A=B;B=C;C=A`` can
+become ``T=A;A=B;B=C;C=T``.  The topological order is tuned to favor freeing up
+registers early to reduce register pressure.
+
+It then lowers the linearized assignments into machine instructions (which may
+require extra physical registers e.g. to copy from one stack location to
+another), and finally runs the register allocator limited to those instructions.
+
+In rare cases, the register allocator may fail on some *must-have-register*
+variable, because register pressure is too high to satisfy requirements arising
+from cycle-breaking temporaries and registers required for stack-to-stack
+copies.  In this case, we have to find a register with no active uses within the
+variable's live range, and actively spill/restore that register around the live
+range.  This makes the code quality suffer and may be slow to implement
+depending on compiler data structures, but in practice we find the situation to
+be vanishingly rare and so not really worth optimizing.
+
+Local live range splitting
+--------------------------
+
+The basic linear-scan algorithm has an "all-or-nothing" policy: a variable gets
+a register for its entire live range, or not at all.  This is unfortunate when a
+variable has many uses close together, but ultimately a large enough live range
+to prevent register assignment.  Another bad example is on x86 where all vector
+and floating-point registers (the ``xmm`` registers) are killed by call
+instructions, per the x86 call ABI, so such variables are completely prevented
+from having a register when their live ranges contain a call instruction.
+
+The general solution here is some policy for splitting live ranges.  A variable
+can be split into multiple copies and each can be register-allocated separately.
+The complication comes in finding a sane policy for where and when to split
+variables such that complexity doesn't explode, and how to join the different
+values at merge points.
+
+Subzero implements aggressive block-local splitting of variables.  Each basic
+block is handled separately and independently.  Within the block, we maintain a
+table ``T`` that maps each variable ``V`` to its split version ``T[V]``, with
+every variable ``V``'s initial state set (implicitly) as ``T[V]=V``.  For each
+instruction in the block, and for each *may-have-register* variable ``V`` in the
+instruction, we do the following:
+
+- Replace all uses of ``V`` in the instruction with ``T[V]``.
+
+- Create a new split variable ``V'``.
+
+- Add a new assignment ``V'=T[V]``, placing it adjacent to (either immediately
+  before or immediately after) the current instruction.
+
+- Update ``T[V]`` to be ``V'``.
+
+This leads to a chain of copies of ``V`` through the block, linked by assignment
+instructions.  The live ranges of these copies are usually much smaller, and
+more likely to get registers.  In fact, because of the preference mechanism
+described above, they are likely to get the same register whenever possible.
+
+One obvious question comes up: won't this proliferation of new variables cause
+an explosion in the running time of liveness analysis and register allocation?
+As it turns out, not really.
+
+First, for register allocation, the cost tends to be dominated by live range
+overlap computations, whose cost is roughly proportional to the size of the live
+range.  All the new variable copies' live ranges sum up to the original
+variable's live range, so the cost isn't vastly greater.
+
+Second, for liveness analysis, the cost is dominated by merging bit vectors
+corresponding to the set of variables that have multi-block liveness.  All the
+new copies are guaranteed to be single-block, so the main additional cost is
+that of processing the new assignments.
+
+There's one other key issue here.  The original variable and all of its copies
+need to be "linked", in the sense that all of these variables that get a stack
+slot (because they didn't get a register) are guaranteed to have the same stack
+slot.  This way, we can avoid generating any code related to ``V'=V`` when
+neither gets a register.  In addition, we can elide instructions that write a
+value to a stack slot that is linked to another stack slot, because that is
+guaranteed to be just rewriting the same value to the stack.
author	Jim Stichnoth <stichnot@chromium.org>
	Wed, 1 Feb 2017 17:03:02 +0000 (09:03 -0800)
committer	Jim Stichnoth <stichnot@chromium.org>
	Wed, 1 Feb 2017 17:03:02 +0000 (09:03 -0800)