diff options
Diffstat (limited to 'docs/Beignet/Backend/unstructured_branches.mdwn')
-rw-r--r-- | docs/Beignet/Backend/unstructured_branches.mdwn | 271 |
1 files changed, 271 insertions, 0 deletions
diff --git a/docs/Beignet/Backend/unstructured_branches.mdwn b/docs/Beignet/Backend/unstructured_branches.mdwn new file mode 100644 index 00000000..37a294c5 --- /dev/null +++ b/docs/Beignet/Backend/unstructured_branches.mdwn @@ -0,0 +1,271 @@ +Unstructured Branches +===================== + +A major challenge in making a OpenCL compiler is certainly to handle any kind of +branches. Indeed LLVM does not make any distinction between structured branches. +See [here](http://llvm.org/docs/LangRef.html) for a complete description of +the LLVM assembly specification. + +The C branching code is simply lowered down in the following instructions: + +- `ret` to return from the current function +- `br` that, if predicated, possibly jumps to two destinations (one for the + taken branch and one for the other). +- `switch` that implements the C switch/case construct. +- `indirectbr` that implements a jump table +- `invoke` and `resume` mostly used to handle exceptions + +Exceptions and jump tables are not supported in OpenCL. Switch cases can be +lowered down to a sequence of if/else statements (using a divide and conquer +approach a switch/case can be dispatched in log(n) complexity where n is the +number of targets). + +This leads us to properly implement `br` and `ret` instructions. + +Solution 1 - Using Gen structured branches +------------------------------------------ + +Gen structured branches are the following instructions: + +`if` `else` `endif` `break` `continue` `while` `brd` `brc` + +Transforming the LLVM IR code into structured code results in basically +reverse-engineering the LLVM code into the original C code. +Unfortunately, there are several key problems: + +- OpenCL supports `goto` keyword that may jump to an arbitrary location +- LLVM can transform the control flow graph in any kind of form +- Worse is that a reducible control flow graph can be turned into an irreducible +one by the optimizer. + +This can lead to complicated code transform and basic block duplication. The +specification allows the compiler to abort if an irreducible control flow is +detected but as an implementor, this is quite awkward to abort the compilation +because the optimizer turns an reducible CFG to an irreducible one. Using +structured branches is the open door to many corner cases. + +Thing is it exists a pretty elegant solution that can be almost seamlessly +supported by Gen. This is the solution we retained. + +Solution 2 - Linearizing the control flow graph +----------------------------------------------- + +The general problem is to map a general control flow graph to a SIMD machine. +The problem is fairly well understood today. A recent research paper actually +dedicated to OpenCL like languages which use the "SPMD" (single program multiple +data) programming model present interesting insights about how to map SIMD +architectures to such languages (see [here] +(http://www.cdl.uni-saarland.de/papers/karrenberg_opencl.pdf)). + +### Core idea + +- Linearizing the CFG initially consists in removing all forward branches and +"replace" them by predication. Indeed, the program will be still correct if you +predicate instructions based instead of forward jumps. This is basically the +a control flow to data flow conversion. + +- Of course, removing all forward branches is inefficient. To improve that, we +simply introduce "if conditions" in the head of basic blocks to know if we run +the basic block. If no lanes is going to be activated in the basic block, we +jump to another basic block where _potentially_ some lanes are going to be +reactivated. + +Consider the following CFG: + +<pre> +o-------o +| | +| 1 |---->-----o +| | | +o-------o | + | | + | | +o-------o | +| | | +| 2 |---->-----------o +| | | | +o-------o | | + | | | + | | | + | o------o | | + | | | | | + | v | | | +o-------o | | | +| | | | | +| 3 | | | | +| | | | | +o-------o | | | + | | | | | + | o------o | | + | | | +o-------o | | +| | | | +| 4 |<---------o | +| | | +o-------o | + | | + | | +o-------o | +| | | +| 5 |<----------------o +| | +o-------o +</pre> + +Mapping it to a SIMD machine may seem challenging. Actually it is not too +complicated. The problem is with the 2->5 jump. Indeed, we have to be sure that +we are not missing any computation done in block 4. + +To do so: +- Instead of jumping from block 2 to block 5, we jump from block 2 to block 4. +- We implement a `JOIN` point on top of block 4. We check if any lane is going +to be reactivated for the block 4. If not, we jump to block 5. + +This leads to the following linearized CFG: +<pre> +o-------o +| | +| 1 |---->-----o +| | | +o-------o | + | | + | | +o-------o | +| | | +| 2 |---->-----------o +| | | | +o-------o | | + | | | + | | | + | o--<---o | | + | | | | | + | v | | | +o-------o | | | +| | | | | +| 3 | ^ | | +| | | | | +o-------o | | | + | | | | | + | o-->---o | | + | | | +o-------o | | +| |==========|=====|====O +| 4 |<---------|-----o | +| |<---------o | +o-------o | + | | + | | +o-------o | +| | | +| 5 |<====================O +| | +o-------o +</pre> + +There is a new jump from block 4 to block 5. + +### Implementation on Gen + +When using structured branches, Gen can supports auto-masking i.e. based on the +branches which are taken, the control flow is properly handled and masks are +automatically applied on all instructions. + +However, there is no similar support for unstructured branches. We therefore +decided to mask instructions manually and use single program flow. This is +actually quite easy to do since Gen is able to predicate any branches. + +Now, how to evaluate the if conditions in an efficient way? + +The choice we did is to use *per-lane block IPs*: for each SIMD lane, we store a +short (16 bits) for each lane in a regular 256 bits GPR (general purpose +register). This "blockIP" register is used in the following way: + +At the beginning of each block, we compare the blockIP register with the ID of +the block. The lane is going to be _activated_ if its blockIP is _smaller_ than +the ID of the block. Otherwise, the lane is deactivated. + +Therefore, we build a flag register at the entry of each basic block with a +single 16-wide uint16_t compare. If no lane is activated, a jump is performed to +the next block where some lanes is going to be activated. + +Since this is regular jumps, we just use `jmpi` instruction. With the help of +predication, we can express all the different possibilities: + +- backward branches are always taken if _any_ of lanes in the predicate is true. +We just use `<+f0.0.anyh>` predication. +- forward branches is *not* taken if some of the lanes are going to activated in +the next block. We therefore compare the blockIP with the ID of the _next_ +block. If all of them are strictly greater than the ID of the next block, we +jump. We therefore use the `<+f0.0.allh>` predicate in that case. +- `JOIN` points are even simpler. We simply jump if none of the lane is activated. +We therefore use the `<-f0.0.anyh>` predicate. + +The complete encoding is done in `src/backend/gen_insn_selection.cpp`. Forward +branches are handled by `SimpleSelection::emitForwardBranch`. Backward branches +are handled by `SimpleSelection::emitBackwardBranch`. Finally, since `JOIN` points +are at the top of each basic blocks, they are handled by +`SimpleSelection::emitLabelInstruction`. + +### Computing `JOIN` points + +The last problem is to compute `JOIN` point i.e. we need to know if we need to +jump at the beginning of each block and if we do, what is the target of the +branch. The code is relatively straightforward and can be found in +`src/backend/context.cpp`. Function is `Context::buildJIPs`. +</br> +Actually, the current implementation is not that elegant. A colleague, Thomas +Raoux, has a simpler and better idea to handle it. + +### Advantages and drawbacks of the method + +- The method has one decisive advantage: it is simple and extremely robust. It can +handle any kind of CFGs (reducible or not) and does not require any +transformation. The use of shorts is also not random. 16-wide compares is issued +in 2 cycles (so it is twice fast as 16-wide 32 bits compares). +- Main drawback will be performance. Even if this is not so bad, we still need +more instructions than if we used structured branches. Mostly + * one or two instructions for `JOIN` points + * three instructions for backward and forward jumps (two more than structured + branches that just require the branch instruction itself) + +Note that all extra instructions are 16 bits instructions (i.e. they use shorts) +so they will only cost 2 cycles anyway. + +The last point is that Gen encoding restricts conditional modifiers and +predicates to be the same in the instruction. This requires to copy or recompute +the flag register for compares and select. So one more instruction is required +for these two instructions. Once again, this would require only 2 cycles. + +Remarks on `ret` instructions +----------------------------- + +Since we can handle any kind of CFG, handling the return statements are +relatively straightforward. We first create one return block at the end of the +program. Then we replace all other returns by a unconditional jump to this +block. The CFG linearization will take care of the rest. +We then simply encode the (only one) return instruction as a End-Of-Thread +message (EOT). +Code examples +------------- + +Some tests were written to assert the correctness of the CFG linearization and the +code generation. They can be found in the _run-time_ code base here: + +`utest/compiler_if_else.cpp` + +`utest/compiler_lower_return0.cpp` + +`utest/compiler_lower_return1.cpp` + +`utest/compiler_lower_return2.cpp` + +`utest/compiler_short_scatter.cpp` + +`utest/compiler_unstructured_branch0.cpp` + +`utest/compiler_unstructured_branch1.cpp` + +`utest/compiler_unstructured_branch2.cpp` + +`utest/compiler_unstructured_branch3.cpp` + |