From: Jim Stichnoth Date: Wed, 1 Feb 2017 17:03:02 +0000 (-0800) Subject: Subzero: Add a new document describing the register allocator. X-Git-Tag: android-x86-7.1-r1~148^2~7 X-Git-Url: http://git.osdn.net/view?a=commitdiff_plain;h=59b0dad77008c3f053c2675bde5d98fc9f825ea8;p=android-x86%2Fexternal-swiftshader.git Subzero: Add a new document describing the register allocator. There's some good work in Subzero's register allocation, which deserves to be captured in a single place. BUG= none R=kschimpf@google.com Review URL: https://codereview.chromium.org/2277493003 . --- diff --git a/docs/REGALLOC.rst b/docs/REGALLOC.rst new file mode 100644 index 000000000..99c6939f7 --- /dev/null +++ b/docs/REGALLOC.rst @@ -0,0 +1,411 @@ +Register allocation in Subzero +============================== + +Introduction +------------ + +`Subzero +`_ +is a fast code generator that translates architecture-independent `PNaCl bitcode +`_ into +architecture-specific machine code. PNaCl bitcode is LLVM bitcode that has been +simplified (e.g. weird-width primitive types like 57-bit integers are not +allowed) and has had architecture-independent optimizations already applied. +Subzero aims to generate high-quality code as fast as practical, and as such +Subzero needs to make tradeoffs between compilation speed and output code +quality. + +In Subzero, we have found register allocation to be one of the most important +optimizations toward achieving the best code quality, which is in tension with +register allocation's reputation for being complex and expensive. Linear-scan +register allocation is a modern favorite for getting fairly good register +allocation at relatively low cost. Subzero uses linear-scan for its core +register allocator, with a few internal modifications to improve its +effectiveness (specifically: register preference, register overlap, and register +aliases). Subzero also does a few high-level things on top of its core register +allocator to improve overall effectiveness (specifically: repeat until +convergence, delayed phi lowering, and local live range splitting). + +What we describe here are techniques that have worked well for Subzero, in the +context of its particular intermediate representation (IR) and compilation +strategy. Some of these techniques may not be applicable to another compiler, +depending on its particular IR and compilation strategy. Some key concepts in +Subzero are the following: + +- Subzero's ``Variable`` operand is an operand that resides either on the stack + or in a physical register. A Variable can be tagged as *must-have-register* + or *must-not-have-register*, but its default is *may-have-register*. All uses + of the Variable map to the same physical register or stack location. + +- Basic lowering is done before register allocation. Lowering is the process of + translating PNaCl bitcode instructions into native instructions. Sometimes a + native instruction, like the x86 ``add`` instruction, allows one of its + Variable operands to be either in a physical register or on the stack, in + which case the lowering is relatively simple. But if the lowered instruction + requires the operand to be in a physical register, we generate code that + copies the Variable into a *must-have-register* temporary, and then use that + temporary in the lowered instruction. + +- Instructions within a basic block appear in a linearized order (as opposed to + something like a directed acyclic graph of dependency edges). + +- An instruction has 0 or 1 *dest* Variables and an arbitrary (but usually + small) number of *source* Variables. Assuming SSA form, the instruction + begins the live range of the dest Variable, and may end the live range of one + or more of the source Variables. + +Executive summary +----------------- + +- Liveness analysis and live range construction are prerequisites for register + allocation. Without careful attention, they can be potentially very + expensive, especially when the number of variables and basic blocks gets very + large. Subzero uses some clever data structures to take advantage of the + sparsity of the data, resulting in stable performance as function size scales + up. This means that the proportion of compilation time spent on liveness + analysis stays roughly the same. + +- The core of Subzero's register allocator is taken from `Mössenböck and + Pfeiffer's paper `_ on + linear-scan register allocation. + +- We enhance the core algorithm with a good automatic preference mechanism when + more than one register is available, to try to minimize register shuffling. + +- We also enhance it to allow overlapping live ranges to share the same + register, when one variable is recognized as a read-only copy of the other. + +- We deal with register aliasing in a clean and general fashion. Register + aliasing is when e.g. 16-bit architectural registers share some bits with + their 32-bit counterparts, or 64-bit registers are actually pairs of 32-bit + registers. + +- We improve register allocation by rerunning the algorithm on likely candidates + that didn't get registers during the previous iteration, without imposing much + additional cost. + +- The main register allocation is done before phi lowering, because phi lowering + imposes early and unnecessary ordering constraints on the resulting + assigments, which create spurious interferences in live ranges. + +- Within each basic block, we aggressively split each variable's live range at + every use, so that portions of the live range can get registers even if the + whole live range can't. Doing this separately for each basic block avoids + merge complications, and keeps liveness analysis and register allocation fast + by fitting well into the sparsity optimizations of their data structures. + +Liveness analysis +----------------- + +With respect to register allocation, the main purpose of liveness analysis is to +calculate the live range of each variable. The live range is represented as a +set of instruction number ranges. Instruction numbers within a basic block must +be monotonically increasing, and the instruction ranges of two different basic +blocks must not overlap. + +Basic liveness +^^^^^^^^^^^^^^ + +Liveness analysis is a straightforward dataflow algorithm. For each basic +block, we keep track of the live-in and live-out set, i.e. the set of variables +that are live coming into or going out of the basic block. Processing of a +basic block starts by initializing a temporary set as the union of the live-in +sets of the basic block's successor blocks. (This basic block's live-out set is +captured as the initial value of the temporary set.) Then each instruction of +the basic block is processed in reverse order. All the source variables of the +instruction are marked as live, by adding them to the temporary set, and the +dest variable of the instruction (if any) is marked as not-live, by removing it +from the temporary set. When we finish processing all of the block's +instructions, we add/union the temporary set into the basic block's live-in set. +If this changes the basic block's live-in set, then we mark all of this basic +block's predecessor blocks to be reprocessed. Then we repeat for other basic +blocks until convergence, i.e. no more basic blocks are marked to be +reprocessed. If basic blocks are processed in reverse topological order, then +the number of times each basic block need to be reprocessed is generally its +loop nest depth. + +The result of this pass is the live-in and live-out set for each basic block. + +With so many set operations, choice of data structure is crucial for +performance. We tried a few options, and found that a simple dense bit vector +works best. This keeps the per-instruction cost very low. However, we found +that when the function gets very large, merging and copying bit vectors at basic +block boundaries dominates the cost. This is due to the number of variables +(hence the bit vector size) and the number of basic blocks getting large. + +A simple enhancement brought this under control in Subzero. It turns out that +the vast majority of variables are referenced, and therefore live, only in a +single basic block. This is largely due to the SSA form of PNaCl bitcode. To +take advantage of this, we partition variables by single-block versus +multi-block liveness. Multi-block variables get lower-numbered bit vector +indexes, and single-block variables get higher-number indexes. Single-block bit +vector indexes are reused across different basic blocks. As such, the size of +live-in and live-out bit vectors is limited to the number of multi-block +variables, and the temporary set's size can be limited to that plus the largest +number of single-block variables across all basic blocks. + +For the purpose of live range construction, we also need to track definitions +(LiveBegin) and last-uses (LiveEnd) of variables used within instructions of the +basic block. These are easy to detect while processing the instructions; data +structure choices are described below. + +Live range construction +^^^^^^^^^^^^^^^^^^^^^^^ + +After the live-in and live-out sets are calculated, we construct each variable's +live range (as an ordered list of instruction ranges, described above). We do +this by considering the live-in and live-out sets, combined with LiveBegin and +LiveEnd information. This is done separately for each basic block. + +As before, we need to take advantage of sparsity of variable uses across basic +blocks, to avoid overly copying/merging data structures. The following is what +worked well for Subzero (after trying several other options). + +The basic liveness pass, described above, keeps track of when a variable's live +range begins or ends within the block. LiveBegin and LiveEnd are unordered +vectors where each element is a pair of the variable and the instruction number, +representing that the particular variable's live range begins or ends at the +particular instruction. When the liveness pass finds a variable whose live +range begins or ends, it appends and entry to LiveBegin or LiveEnd. + +During live range construction, the LiveBegin and LiveEnd vectors are sorted by +variable number. Then we iterate across both vectors in parallel. If a +variable appears in both LiveBegin and LiveEnd, then its live range is entirely +within this block. If it appears in only LiveBegin, then its live range starts +here and extends through the end of the block. If it appears in only LiveEnd, +then its live range starts at the beginning of the block and ends here. (Note +that this only covers the live range within this block, and this process is +repeated across all blocks.) + +It is also possible that a variable is live within this block but its live range +does not begin or end in this block. These variables are identified simply by +taking the intersection of the live-in and live-out sets. + +As a result of these data structures, performance of liveness analysis and live +range construction tend to be stable across small, medium, and large functions, +as measured by a fairly consistent proportion of total compilation time spent on +the liveness passes. + +Linear-scan register allocation +------------------------------- + +The basis of Subzero's register allocator is the allocator described by +Hanspeter Mössenböck and Michael Pfeiffer in `Linear Scan Register Allocation in +the Context of SSA Form and Register Constraints +`_. It allows live ranges to +be a union of intervals rather than a single conservative interval, and it +allows pre-coloring of variables with specific physical registers. + +The paper suggests an approach of aggressively coalescing variables across Phi +instructions (i.e., trying to force Phi source and dest variables to have the +same register assignment), but we omit that in favor of the more natural +preference mechanism described below. + +We found the paper quite remarkable in that a straightforward implementation of +its pseudo-code led to an entirely correct register allocator. The only thing +we found in the specification that was even close to a mistake is that it was +too aggressive in evicting inactive ranges in the "else" clause of the +AssignMemLoc routine. An inactive range only needs to be evicted if it actually +overlaps the current range being considered, whereas the paper evicts it +unconditionally. (Search for "original paper" in Subzero's register allocator +source code.) + +Register preference +------------------- + +The linear-scan algorithm from the paper talks about choosing an available +register, but isn't specific on how to choose among several available registers. +The simplest approach is to just choose the first available register, e.g. the +lowest-numbered register. Often a better choice is possible. + +Specifically, if variable ``V`` is assigned in an instruction ``V=f(S1,S2,...)`` +with source variables ``S1,S2,...``, and that instruction ends the live range of +one of those source variables ``Sn``, and ``Sn`` was assigned a register, then +``Sn``'s register is usually a good choice for ``V``. This is especially true +when the instruction is a simple assignment, because an assignment where the +dest and source variables end up with the same register can be trivially elided, +reducing the amount of register-shuffling code. + +This requires being able to find and inspect a variable's defining instruction, +which is not an assumption in the original paper but is easily tracked in +practice. + +Register overlap +---------------- + +Because Subzero does register allocation after basic lowering, the lowering has +to be prepared for the possibility of any given program variable not getting a +physical register. It does this by introducing *must-have-register* temporary +variables, and copies the program variable into the temporary to ensure that +register requirements in the target instruction are met. + +In many cases, the program variable and temporary variable have overlapping live +ranges, and would be forced to have different registers even if the temporary +variable is effectively a read-only copy of the program variable. We recognize +this when the program variable has no definitions within the temporary +variable's live range, and the temporary variable has no definitions within the +program variable's live range with the exception of the copy assignment. + +This analysis is done as part of register preference detection. + +The main impact on the linear-scan implementation is that instead of +setting/resetting a boolean flag to indicate whether a register is free or in +use, we increment/decrement a number-of-uses counter. + +Register aliases +---------------- + +Sometimes registers of different register classes partially overlap. For +example, in x86, registers ``al`` and ``ah`` alias ``ax`` (though they don't +alias each other), and all three alias ``eax`` and ``rax``. And in ARM, +registers ``s0`` and ``s1`` (which are single-precision floating-point +registers) alias ``d0`` (double-precision floating-point), and registers ``d0`` +and ``d1`` alias ``q0`` (128-bit vector register). The linear-scan paper +doesn't address this issue; it assumes that all registers are independent. A +simple solution is to essentially avoid aliasing by disallowing a subset of the +registers, but there is obviously a reduction in code quality when e.g. half of +the registers are taken away. + +Subzero handles this more elegantly. For each register, we keep a bitmask +indicating which registers alias/conflict with it. For example, in x86, +``ah``'s alias set is ``ah``, ``ax``, ``eax``, and ``rax`` (but not ``al``), and +in ARM, ``s1``'s alias set is ``s1``, ``d0``, and ``q0``. Whenever we want to +mark a register as being used or released, we do the same for all of its +aliases. + +Before implementing this, we were a bit apprehensive about the compile-time +performance impact. Instead of setting one bit in a bit vector or decrementing +one counter, this generally needs to be done in a loop that iterates over all +aliases. Fortunately, this seemed to make very little difference in +performance, as the bulk of the cost ends up being in live range overlap +computations, which are not affected by register aliasing. + +Repeat until convergence +------------------------ + +Sometimes the linear-scan algorithm makes a register assignment only to later +revoke it in favor of a higher-priority variable, but it turns out that a +different initial register choice would not have been revoked. For relatively +low compile-time cost, we can give those variables another chance. + +During register allocation, we keep track of the revoked variables and then do +another round of register allocation targeted only to that set. We repeat until +no new register assignments are made, which is usually just a handful of +successively cheaper iterations. + +Another approach would be to repeat register allocation for *all* variables that +haven't had a register assigned (rather than variables that got a register that +was subsequently revoked), but our experience is that this greatly increases +compile-time cost, with little or no code quality gain. + +Delayed Phi lowering +-------------------- + +The linear-scan algorithm works for phi instructions as well as regular +instructions, but it is tempting to lower phi instructions into non-SSA +assignments before register allocation, so that register allocation need only +happen once. + +Unfortunately, simple phi lowering imposes an arbitrary ordering on the +resulting assignments that can cause artificial overlap/interference between +lowered assignments, and can lead to worse register allocation decisions. As a +simple example, consider these two phi instructions which are semantically +unordered:: + + A = phi(B) // B's live range ends here + C = phi(D) // D's live range ends here + +A straightforward lowering might yield:: + + A = B // end of B's live range + C = D // end of D's live range + +The potential problem here is that A and D's live ranges overlap, and so they +are prevented from having the same register. Swapping the order of lowered +assignments fixes that (but then B and C would overlap), but we can't really +know which is better until after register allocation. + +Subzero deals with this by doing the main register allocation before phi +lowering, followed by phi lowering, and finally a special register allocation +pass limited to the new lowered assignments. + +Phi lowering considers the phi operands separately for each predecessor edge, +and starts by finding a topological sort of the Phi instructions, such that +assignments can be executed in that order without violating dependencies on +registers or stack locations. If a topological sort is not possible due to a +cycle, the cycle is broken by introducing a temporary, e.g. ``A=B;B=C;C=A`` can +become ``T=A;A=B;B=C;C=T``. The topological order is tuned to favor freeing up +registers early to reduce register pressure. + +It then lowers the linearized assignments into machine instructions (which may +require extra physical registers e.g. to copy from one stack location to +another), and finally runs the register allocator limited to those instructions. + +In rare cases, the register allocator may fail on some *must-have-register* +variable, because register pressure is too high to satisfy requirements arising +from cycle-breaking temporaries and registers required for stack-to-stack +copies. In this case, we have to find a register with no active uses within the +variable's live range, and actively spill/restore that register around the live +range. This makes the code quality suffer and may be slow to implement +depending on compiler data structures, but in practice we find the situation to +be vanishingly rare and so not really worth optimizing. + +Local live range splitting +-------------------------- + +The basic linear-scan algorithm has an "all-or-nothing" policy: a variable gets +a register for its entire live range, or not at all. This is unfortunate when a +variable has many uses close together, but ultimately a large enough live range +to prevent register assignment. Another bad example is on x86 where all vector +and floating-point registers (the ``xmm`` registers) are killed by call +instructions, per the x86 call ABI, so such variables are completely prevented +from having a register when their live ranges contain a call instruction. + +The general solution here is some policy for splitting live ranges. A variable +can be split into multiple copies and each can be register-allocated separately. +The complication comes in finding a sane policy for where and when to split +variables such that complexity doesn't explode, and how to join the different +values at merge points. + +Subzero implements aggressive block-local splitting of variables. Each basic +block is handled separately and independently. Within the block, we maintain a +table ``T`` that maps each variable ``V`` to its split version ``T[V]``, with +every variable ``V``'s initial state set (implicitly) as ``T[V]=V``. For each +instruction in the block, and for each *may-have-register* variable ``V`` in the +instruction, we do the following: + +- Replace all uses of ``V`` in the instruction with ``T[V]``. + +- Create a new split variable ``V'``. + +- Add a new assignment ``V'=T[V]``, placing it adjacent to (either immediately + before or immediately after) the current instruction. + +- Update ``T[V]`` to be ``V'``. + +This leads to a chain of copies of ``V`` through the block, linked by assignment +instructions. The live ranges of these copies are usually much smaller, and +more likely to get registers. In fact, because of the preference mechanism +described above, they are likely to get the same register whenever possible. + +One obvious question comes up: won't this proliferation of new variables cause +an explosion in the running time of liveness analysis and register allocation? +As it turns out, not really. + +First, for register allocation, the cost tends to be dominated by live range +overlap computations, whose cost is roughly proportional to the size of the live +range. All the new variable copies' live ranges sum up to the original +variable's live range, so the cost isn't vastly greater. + +Second, for liveness analysis, the cost is dominated by merging bit vectors +corresponding to the set of variables that have multi-block liveness. All the +new copies are guaranteed to be single-block, so the main additional cost is +that of processing the new assignments. + +There's one other key issue here. The original variable and all of its copies +need to be "linked", in the sense that all of these variables that get a stack +slot (because they didn't get a register) are guaranteed to have the same stack +slot. This way, we can avoid generating any code related to ``V'=V`` when +neither gets a register. In addition, we can elide instructions that write a +value to a stack slot that is linked to another stack slot, because that is +guaranteed to be just rewriting the same value to the stack.