diff options
author | Zhigang Gong <zhigang.gong@linux.intel.com> | 2013-06-25 18:04:53 +0800 |
---|---|---|
committer | Zhigang Gong <zhigang.gong@linux.intel.com> | 2013-06-25 18:55:08 +0800 |
commit | 97c3a9b076ccd24af09fb6ce3f2be7b8e3f93a4f (patch) | |
tree | 1650dbb208d4ceae5a8ff7dd37a91c1a6e13941e /docs | |
parent | 7508682d00101231230b6dbbb20b51c5d67deb17 (diff) |
Docs: Rearrange documents according to wiki page on fd.o.
We use fd.o wiki to host our documents. To make the maintainence easier,
I change the direcotry and some links in the markdown files. And make
them the same as fd.o's.
Signed-off-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Diffstat (limited to 'docs')
-rw-r--r-- | docs/Beignet.mdwn | 138 | ||||
-rw-r--r-- | docs/Beignet/Backend.mdwn | 58 | ||||
-rw-r--r-- | docs/Beignet/Backend/TODO.mdwn | 109 | ||||
-rw-r--r-- | docs/Beignet/Backend/compiler_backend.mdwn | 110 | ||||
-rw-r--r-- | docs/Beignet/Backend/flat_address_space.mdwn | 98 | ||||
-rw-r--r-- | docs/Beignet/Backend/gen_ir.mdwn | 254 | ||||
-rw-r--r-- | docs/Beignet/Backend/unstructured_branches.mdwn | 271 |
7 files changed, 1038 insertions, 0 deletions
diff --git a/docs/Beignet.mdwn b/docs/Beignet.mdwn new file mode 100644 index 00000000..e31cce1d --- /dev/null +++ b/docs/Beignet.mdwn @@ -0,0 +1,138 @@ +Beignet +======= + +Beignet is an open source implementaion of the OpenCL specification - a generic +compute oriented API. This code base contains the code to run OpenCL programs on +Intel GPUs which bsically defines and implements the OpenCL host functions +required to initialize the device, create the command queues, the kernels and +the programs and run them on the GPU. The code base also contains the compiler +part of the stack which is included in `backend/`. For more specific information +about the compiler, please refer to `backend/README.md` + +How to build +------------ + +The project uses CMake with three profiles: + +1. Debug (-g) +2. RelWithDebInfo (-g with optimizations) +3. Release (only optimizations) + +Basically, from the root directory of the project + +`> mkdir build` + +`> cd build` + +`> cmake ../ # to configure` + +Choose whatever you want for the build. + +Then press 'c' to configure and 'g' to generate the code. + +`> make` + +The project depends on several external libraries: + +- Several X components (XLib, Xfixes, Xext) +- libdrm libraries (libdrm and libdrm\_intel) +- Various LLVM components +- The compiler backend itself (libgbe) +- Mesa git master version built with gbm enabled to support extension cl\_khr\_gl\_sharing. + +CMake will check the dependencies and will complain if it does not find them. + +The cmake will also build the backend project. Please refer to: +[[OpenCL Gen Backend|Beignet/Backend]] to get more dependencies. + +Once built, the run-time produces a shared object libcl.so which basically +directly implements the OpenCL API. A set of tests are also produced. They may +be found in `utests/`. + +Note that the compiler depends on LLVM (Low-Level Virtual Machine project). +Right now, the code has been compiled with LLVM 3.1/3.2. It will not compile +with any thing older. + +[http://llvm.org/releases/](http://llvm.org/releases/) + +LLVM 3.1,3.2 and 3.3 are supported. + +Also note that the code was compiled on GCC 4.6 and GCC 4.7. Since the code uses +really recent C++11 features, you may expect problems with older compilers. Last +time I tried, the code breaks ICC 12 and Clang with internal compiler errors +while compiling anonymous nested lambda functions. + +How to run +---------- + +Apart from the OpenCL library itself that can be used by any OpenCL application, +this code also produces various tests to ensure the compiler and the run-time +consistency. This small test framework uses a simple c++ registration system to +register all the unit tests. + +You need to set the variable `OCL_KERNEL_PATH` to locate the OCL kernels. They +are with the run-time in `./kernels`. + +Then in `utests/`: + +`> ./utest_run` + +will run all the unit tests one after the others + +`> ./utest_run some_unit_test0 some_unit_test1` + +will only run `some_unit_test0` and `some_unit_test1` tests + +Supported Hardware +------------------ + +The code was tested on IVB GT2 with ubuntu and fedora core distribution. +Currently Only IVB is supported right now. Actually, the code was only run on IVB GT2. You +may expect some issues with IVB GT1. + +TODO +---- + +The run-time is far from being complete. Most of the pieces have been put +together to test and develop the OpenCL compiler. A partial list of things to +do: + +- Complete cl\_khr\_gl\_sharing support. We lack of some APIs implementation such + as clCreateFromGLBuffer,clCreateFromGLRenderbuffer,clGetGLObjectInfo... Currently, + the working APIs are clCreateFromGLTexture,clCreateFromGLTexture2D. + +- Support for events. + +- Check that NDRangeKernels can be pushed into _different_ queues from several + threads. + +- Support for nonblocking mode Enqueue\*Buffer. Now we only use the map extension to + implement those Enqueue\*Buffer functions. + +- No state tracking at all. One batch buffer is created at each "draw call" + (i.e. for each NDRangeKernels). This is really inefficient since some + expensive pipe controls are issued for each batch buffer + +- Valgrind reports some leaks in libdrm. It sounds like a false positive but it + has to be checked. Idem for LLVM. There is one leak here to check. + +More generally, everything in the run-time that triggers the "FATAL" macro means +that something that must be supported is not implemented properly (either it +does not comply with the standard or it is just missing) + +Project repository +------------------ +Right now, we host our project on fdo at: git://anongit.freedesktop.org/beignet. + +The team +-------- +This project was created by Ben Segovia when he was working for Intel. Now we +have a team in China OTC graphics department continue to work on this project. +The official contact for this project is: Zou Nanhai (<nanhai.zou@intel.com>). + +How to contribute +----------------- +You are always welcome to contribute to this project, just need to subscribe +to the beignet mail list and send patches to it for review. +The official mail list is as below: +http://lists.freedesktop.org/mailman/listinfo/beignet diff --git a/docs/Beignet/Backend.mdwn b/docs/Beignet/Backend.mdwn new file mode 100644 index 00000000..99d678e7 --- /dev/null +++ b/docs/Beignet/Backend.mdwn @@ -0,0 +1,58 @@ +Beignet Compiler +================ + +This code base contains the compiler part of the Beignet OpenCL stack. The +compiler is responsible to take a OpenCL language string and to compile it into +a binary that can be executed on Intel integrated GPUs. + +Limitations +----------- + +Today, the compiler is far from complete. See [[here|Backend/TODO]] for a +(incomplete) lists of things to do. + +Interface with the run-time +--------------------------- + +Even if the compiler makes a very liberal use of C++ (templates, variadic +templates, macros), we really tried hard to make a very simple interface with +the run-time. The interface is therefore a pure C99 interface and it is defined +in `src/backend/program.h`. + +The goal is to hide the complexity of the inner data structures and to enable +simple run-time implementation using straightforward C99. + +Note that the data structures are fully opaque: this allows us to use both the +C++ simulator or the real Gen program in a relatively non-intrusive way. + +Various environment variables +----------------------------- + +Environment variables are used all over the code. Most important ones are: + +- `OCL_SIMD_WIDTH` `(8 or 16)`. Change the number of lanes per hardware thread + +- `OCL_OUTPUT_GEN_IR` `(0 or 1)`. Output Gen IR (scalar intermediate + representation) code + +- `OCL_OUTPUT_LLVM` `(0 or 1)`. Output LLVM code after the lowering passes + +- `OCL_OUTPUT_LLVM_BEFORE_EXTRA_PASS` `(0 or 1)`. Output LLVM code before the + lowering passes + +- `OCL_OUTPUT_ASM` `(0 or 1)`. Output Gen ISA + +- `OCL_OUTPUT_REG_ALLOC` `(0 or 1)`. Output Gen register allocations + +Implementation details +---------------------- + +Several key decisions may use the hardware in an usual way. See the following +documents for the technical details about the compiler implementation: + +- [[Flat address space|flat_address_space]] +- [[Unstructured branches|unstructured_branches]] +- [[Scalar intermediate representation|gen_ir]] +- [[Clean backend implementation|compiler_backend]] + +Ben Segovia. diff --git a/docs/Beignet/Backend/TODO.mdwn b/docs/Beignet/Backend/TODO.mdwn new file mode 100644 index 00000000..3f1ccb4c --- /dev/null +++ b/docs/Beignet/Backend/TODO.mdwn @@ -0,0 +1,109 @@ +TODO +==== + +The compiler is far from complete. Even if the skeleton is now done and should +be solid, There are a _lot_ of things to do from trivial to complex. + +OpenCL standard library +----------------------- + +Today we define the OpenCL API in header file `src/ocl_stdlib.h`. This file is +from being complete. + +By the way, one question remains: do we want to implement +the high-precision functions as _inline_ functions or as external functions to +call? Indeed, inlining all functions may lead to severe code bloats while +calling functions will require to implement a proper ABI. We certainly want to +do both actually. + +LLVM front-end +-------------- + +The code is defined in `src/llvm`. We used the PTX ABI and the OpenCL profile +to compile the code. Therefore, a good part of the job is already done. However, +many things must be implemented: + +- Lowering down of various intrinsics like `llvm.memcpy` + +- Conformance test for all OpenCL built-ins (`native_cos`, `native_sin`, + `mad`, atomic operations, barriers...). + +- Lowering down of int16 / int8 / float16 / char16 / char8 / char4 loads and + stores into the supported loads and stores + +- Support for local declaration of local array (the OpenCL profile will properly + declare them as global arrays) + +- Support for doubles + +- Support atomic extensions. + +- Better resolving of the PHI functions. Today, we always generate MOV + instructions at the end of each basic block . They can be easily optimized. + +Gen IR +------ + +The code is defined in `src/ir`. Main things to do are: + +- Bringing support for doubles + +- Adding support for atomic extensions. + +- Finishing the handling of function arguments (see the [[IR + description|gen_ir]] for more details) + +- Adding support for linking IR units together. OpenCL indeed allows to create + programs from several sources + +- Uniform analysys. This is a major performance improvement. A "uniform" value + is basically a value where regardless the control flow, all the activated + lanes will be identical. Trivial examples are immediate values, function + arguments. Also, operations on uniform will produce uniform values and so + on... + +- Merging of independent uniform loads (and samples). This is a major + performance improvement once the uniform analysis is done. Basically, several + uniform loads may be collapsed into one load if no writes happens in-between. + This will obviously impact both instruction selection and the register + allocation. + +Backend +------- + +The code is defined in `src/backend`. Main things to do are: + +- Implementing support for doubles + +- Implementing atomic extensions. + +- Implementing register spilling (see the [[compiler backend + description|compiler_backend]] for more details) + +- Implementing proper instruction selection. A "simple" tree matching algorithm + should provide good results for Gen + +- Improving the instruction scheduling pass + +General plumbing +---------------- + +I tried to keep the code clean, well, as far as C++ can be really clean. There +are some header cleaning steps required though, in particular in the backend +code. + +The context used in the IR code generation (see `src/ir/context.*pp`) should be +split up and cleaned up too. + +I also purely and simply copied and pasted the Gen ISA disassembler from Mesa. +This leads to code duplication. Also some messages used by OpenCL (untyped reads +and writes) are not properly decoded yet. + +There are some quick and dirty hacks also like the use of function call `system` +(...). This should be cleanly replaced by popen and stuff. I also directly +called the LLVM compiler executable instead of using Clang library. All of this +should be improved and cleaned up. Track "XXX" comments in the code. + +Parts of the code leaks memory when exceptions are used. There are some pointers +to track and replace with std::unique_ptr. Note that we also add a custom memory +debugger that nicely complements (i.e. it is fast) Valgrind. diff --git a/docs/Beignet/Backend/compiler_backend.mdwn b/docs/Beignet/Backend/compiler_backend.mdwn new file mode 100644 index 00000000..32028b6c --- /dev/null +++ b/docs/Beignet/Backend/compiler_backend.mdwn @@ -0,0 +1,110 @@ +Compiler Back End +================= + +Well, the complete code base is somehow a compiler backend for LLVM. Here, we +really speak about the final code generation passes that you may find in +`src/backend`. + +As explained in [[the scalar IR presentation|gen_ir]], we bet on a very +simple scalar IR to make it easy to parse and modify. The idea is to fix the +unrelated problem (very Gen specific) where we can i.e. when the code is +generated. + +The code generation in the compiler backend is classically divided into four +steps + +- Instruction selection (defined in `src/backend/gen_insn_selection.*pp`). We + expose an interface for the instruction selection engine. We implemented a + very simple selection (called `SimpleSelection`) that does a quick and dirty + one-to-many instruction generation. + +- Register allocation (defined in `src/backend/gen_reg_allocation.*pp`). The + code implements a linear scan allocator on the code selected in the previous + pass. See below for more details about register vector allocations. + +- Instruction scheduling. This one is not done yet. We just output the same + instruction order as the program order. Note that we plan to implement an + adaptive scheduling between register allocation and instruction selection (to + avoid spilling as much as possible) + +- Instruction encoding. This is the final step that encodes the program into Gen + ISA. + +Instruction selection +--------------------- + +Usually, the instruction selection consists in mapping `p` instructions to `q` +ISA instructions under a cost driven model. Each basic block is therefore _tiled_ +into some numbers of groups of ISA instructions such that the final cost is +minimized. + +The literature is particularly dense on the subject. Compilers usually use today +either tree matching methods or selection DAG techniques (as LLVM backends do) + +The instruction selection is still a work in progress in our compiler and we +only implement the most stupid (and inefficient) technique: we simply generate +as many instructions as we need for each _individual_ IR instructions. Since we +do not support immediate sources, this in particular leads to really ugly +looking code such as `mov (16) r2:f 1.f`. It is still a work in progress. + +Other than that, the instruction selection is really a book keeping structure. +We basically output `SelectionInstruction` objects which are the 1-to-1 mapping +of Gen ISA encoding functions defined in `src/backend/gen_encoder.*pp`. + +However, the `SelectionInstruction` still use unallocated virtual registers and +do *not* use vectors but simply tuples of virtual registers. + +Register allocation +------------------- + +The register allocation actually consists in two steps: + +1. Handling the vector for all the instructions that require them + +2. Performing the register allocation itself + +Step 1 consists in scanning all the vectors required by sends. Obviously, the +same register may be used in different vectors and that may lead to +interferences. We simply sort the vectors from the largest to the smallest and +allocate them in that order. As an optimization we also identify sub-vectors +i.e. vectors included in larger ones and no not allocate them. + +The code may be largely improved in particular if we take into account liveness +interferences as well. Basically, a register may be part of several vectors if the +registers that are not in both vectors at the same location are not alive at the +same time. + +This is still a work in progress. Code is right now handled by method +`GenRegAllocator::allocateVector`. + +Step 2 performs the register allocation i.e. it associates each virtual register +to one (or several) physical registers. The first thing is that the Gen register +file is very flexible i.e. it can (almost) be freely partitioned. To handle this +peculiarity, we simply implemented a free list based generic memory allocator as +done with `RegisterFilePartitioner` in `src/backend/context.cpp`. + +We then simply implemented a linear scan allocator (see +`gen_reg_allocation.cpp`). The spilling is not implemented and is still a work +in progress. The thing is that spilling must be specifically handled with Gen. +Indeed: + +1. Bad point. Spilling is expensive and require to assemble messages for it + +2. Good point. Gen is able to spill up to 256 _contiguous_ bytes in one message. +This must be used for high performance spilling and this may require to reorder +properly registers to spill. + +Instruction scheduling +---------------------- + +Intra-basic block instruction scheduling is relatively simple. It is not +implemented yet. + +Instruction encoding +-------------------- + +This is mostly done in `src/backend/gen_context.cpp` and +`src/backend/gen_encoder./*pp`. This is mostly glue code and it is pretty +straightforward. We just forward the selection code using the physically +allocated registers. There is nothing special here. Just boilerplate. + diff --git a/docs/Beignet/Backend/flat_address_space.mdwn b/docs/Beignet/Backend/flat_address_space.mdwn new file mode 100644 index 00000000..3018a295 --- /dev/null +++ b/docs/Beignet/Backend/flat_address_space.mdwn @@ -0,0 +1,98 @@ +Flat Address Space +================== + +Segmented address space... +-------------------------- + +The first challenge with OpenCL is its very liberal use of pointers. The memory +is segment into several address spaces: + +- private. This is the memory for each work item + +- global. These are buffers in memory shared by all work items and work groups + +- constant. These are constant buffers in memory shared by all work items and +work groups as well + +- local. These is a memory shared by all work items in the *same* work group + +... But with no restriction inside each address space +----------------------------------------------------- + +The challenge is that there is no restriction in OpenCL inside each address +space i.e. the full C semantic applies in particular regarding pointer +arithmetic. + +Therefore the following code is valid: + +<code> +\_\_kernel void example(\_\_global int *dst, \_\_global int *src0, \_\_global int *src1)<br/> +{<br/> + \_\_global int *from;<br/> + if (get\_global\_id(0) % 2)<br/> + from = src0;<br/> + else<br/> + from = src1;<br/> + dst[get\_global\_id(0)] = from[get\_global\_id(0)];<br/> +} +</code> + +As one may see, the load done in the last line actually mixes pointers from both +source src0 and src1. This typically makes the use of binding table indices +pretty hard. In we use binding table 0 for dst, 1 for src0 and 2 for src1 (for +example), we are not able to express the load in the last line with one send +only. + +No support for stateless in required messages +--------------------------------------------- + +Furthermore, in IVB, we are going four types of messages to implement the loads +and the stores + +- Byte scattered reads. They are used to read bytes/shorts/integers that are not +aligned on 4 bytes. This is a gather message i.e. the user provides up to 16 +addresses + +- Byte scattered writes. They are used to write bytes/shorts/integers that are not +aligned on 4 bytes. This is a scatter message i.e. the user provides up to 16 +addresses + +- Untyped reads. They allow to read from 1 to 4 double words (i.e 4 bytes) per +lane. This is also a gather message i.e. up to 16 address are provided per +message. + +- Untyped writes. They are the counter part of the untyped reads + +Problem is that IVB does not support stateless accesses for these messages. So +surfaces are required. Secondly, stateless messages are not that interesting +since all of them require a header which is still slow to assemble. + +Implemented solution +-------------------- + +The solution is actually quite simple. Even with no stateless support, it is +actually possible to simulate it with a surface. As one may see in the run-time +code in `intel/intel_gpgpu.c`, we simply create a surface: + +- 2GB big + +- Which starts at offset 0 + +Surprisingly, this surface can actually map the complete GTT address space which +is 2GB big. One may look at `flat_address_space` unit test in the run-time code +that creates and copies buffers in such a way that the complete GTT address +space is traversed. + +This solution brings a pretty simple implementation in the compiler side. +Basically, there is nothing to do when translating from LLVM to Gen ISA. A +pointer to `__global` or `__constant` memory is simply a 32 bits offset in that +surface. + +Related problems +---------------- + +There is one drawback for this approach. Since we use a 2GB surface that maps +the complete GTT space, there is no protection at all. Each write can therefore +potentially modify any buffer including the command buffer, the frame buffer or +the kernel code. There is *no* protection at all in the hardware to prevent +that. diff --git a/docs/Beignet/Backend/gen_ir.mdwn b/docs/Beignet/Backend/gen_ir.mdwn new file mode 100644 index 00000000..ae24729c --- /dev/null +++ b/docs/Beignet/Backend/gen_ir.mdwn @@ -0,0 +1,254 @@ +Scalar Intermediate Representation +================================== + +The IR code is included in `src/ir/` of the compiler code base +The IR as designed in this compiler is the fruit of a long reflection I mostly +have with Thomas Raoux. Note I usually call it "Gen IR". + +Scalar vs vector IR +------------------- + +This is actually the major question: do we need a vector IR or a scalar IR? On +the LLVM side, we have both. LLVM IR can manipulate vectors and scalars (and +even generalized values but we can ignore it for now). + +For that reason, the Clang front-end generates both scalar and vector code. +Typically, a `uint4` variable will output a vector of 4 integers. Arithmetic +computations will be directly done on vector variables. + +One the HW side, the situation is completely different: + +- We are going to use the parallel mode (align1) i.e. the struct-of-array mode + for the EU. This is a SIMD scalar mode. + +- The only source of vectors we are going to have is on the sends instructions + (and marginally for some other instructions like the div_rem math instruction) + +One may therefore argue that we need vector instructions to handle the sends. +Send will indeed require both vector destinations and sources. This may be a +strong argument *for* vectors in the IR. However, the situation is not that +good. + +Indeed, if we look carefully at the send instructions we see that they will +require vectors that are *not* vectors in LLVM IR. This code for example: + +<code> +__global uint4 *src;<br/> +uint4 x = src[get\_global\_id(0)];<br/> +</code> + +will be translated into an untyped write in the Gen ISA. Unfortunately, the +address and the values to write are in the *same* vector. However, LLVM IR will +output a store like: + +`store(%addr, %value)` + +which basically uses one scalar (the address) and one value (the vector to +write). Therefore even if we handle vectors in the IR, that will not directly +solve the problem we have at the end for the send instructions. + +We therefore decided to go the other direction: + +- We have a purely scalar IR + +- To replace vectors, we simply use multiple sources and destinations + +- Real vectors required by send instructions are handled at the very bottom of +the stack in the register allocation passes. + +This leads to a very simple intermediate representation which is mostly a pure +scalar RISC machine. + +Very limited IR +--------------- + +The other major question, in particular when you look similar stacks like NVidia +PTX, is: + +do we need to encode in the IR register modifiers (abs, negate...) and immediate +registers (like in add.f x y 1.0)? + +Contrary to other IRs (PTX and even LLVM that both supports immediates), we also +chose to have a very simply IR, much simpler than the final ISA, and to merge +back what we need at the instruction selection pass. Since we need instruction +selection, let us keep the IR simple. + +Also, there are a lot of major issues that can not be covered in the IR and +require to be specifically handled at the very end of the code: + +- send vectors (see previous section) + +- send headers (value and register allocation) which are also part of the vector +problem + +- SIMD8 mode in SIMD16 code. Some send messages do not support SIMD16 encoding +and require SIMD8. Typically examples are typed writes i.e. scatters to textures. +Also, this cannot be encoded in some way in a regular scalar IR. + +For these reasons, most of the problems directly related to Gen naturally find +their solutions in either the instruction selection or the register allocator. + +This leads to the following strategy: + +- Keep the IR very simple and limited + +- Use all the analysis tools you need in the IR before the final code generation +to build any information you need. This is pure "book-keeping". + +- Use any previous analysis and finish the job at the very end + +This classical approach leads to limit the complexity in the IR while forcing us +to write the proper tools in the final stages. + +Why not using LLVM IR directly? +------------------------------- + +We hesitated a long time between writing a dedicated IR (as we did) and just +using LLVM IR. Indeed, LLVM comes with a large set of tools that are parts of +"LLVM backends". LLVM provides a lot of tools to perform the instruction +selection (`SelectionDAG`) and the register allocation. Two things however +prevent us from choosing this path: + +- We only have a limited experience with LLVM and no experience at all with the +LLVM backends + +- LLVM register allocators do not handle at all the peculiarities of Gen: + + * flexible register file. Gen registers are more like memory than registers + and can be freely allocated and aliased. LLVM register allocators only + support partial aliasing like x86 machines do (rax -> eax -> ax) + + * no proper tools to handle vectors in the register allocator as we need for + sends + +Since we will need to do some significant work anyway, this leads us to choose a +more hard-coded path with a in-house IR. Note that will not prevent us from +implementing later a LLVM backend "by the book" as Nvidia does today with PTX +(using a LLVM backend to do the LLVM IR -> PTX conversion) + + +SSA or no SSA +------------- + +Since we have a purely scalar IR, implementing a SSA transformation on the IR +may be convenient. However, most the literature about compiler back-ends use +non-SSA representation of the code. Since the primary goal is to write a +compiler _back-end_ (instruction selection, register allocation and instruction +scheduling), we keep the code in non-SSA letting the higher level optimizations +to LLVM. + +Types, registers, instructions, functions and units +--------------------------------------------------- + +The IR is organized as follows: + +- Types (defined in `src/ir/type.*pp`). These are scalar types only. Since the + code is completely lowered down, there is no more reference to structures, + pointers or vectors. Everything is scalar values and when "vectors" or + "structures" would be needed, we use instead multiple scalar sources or + destinations. + +- Registers (defined in `src/ir/register.*pp`). They are untyped (since Gen IR + are untyped) and we have 65,535 of them per function + +- Instructions (defined in `src/ir/instruction.*pp`). They are typed (to + distinguish integer and FP adds for example) and possibly support multiple + destinations and sources. We also provide a convenient framework to introspect + the instruction in a simple (and memory efficient) way + +- Functions (defined in `src/ir/function.*pp`). They are basically the counter + part of LLVM functions or OpenCL kernels. Note that function arguments are a + problem. We actually use the PTX ABI. Everything smaller than the machine word + size (i.e. 32 bits for Gen) is passed by value with a register. Everything + else which is bigger than is passed by pointer with a ByVal attribute. + Note that requires some special treatment in the IR (see below) to make the + code faster by replacing function argument loads by "pushed constants". We + also defined one "register file" per function i.e. the registers are defined + relatively to the function that uses them. Each function is made of basic + blocks i.e. sequence of instructions that are executed linearly. + +- Units (defined in `src/ir/unit.*pp`). Units are just a collection of + functions and constants (not supported yet). + +Function arguments and pushed constants +--------------------------------------- + +Gen can push values into the register file i.e. some registers are preset when +the kernel starts to run. As detailed previously, the PTX ABI is convenient +since every argument is either one register or one pointer to load from or to +store to. + +However, when a pointer is used for an argument, loads are issued which may be +avoided by using constant pushes. + +Once again OCL makes the task a bit harder than expected. Indeed, the C +semantic once again applies to function arguments as well. + +Look at these three examples: + +### Case 1. Direct loads -> constant push can be used + +<code> +struct foo { int x; int y; }; </br> +\_\_kernel void case1(\_\_global int *dst, struct foo bar) </br> +{<br/> + dst[get\_global\_id(0)] = bar.x + bar.y;<br/> +} +</code> + +We use a _direct_ _load_ for `bar` with `bar.x` and `bar.y`. Values can be +pushed into registers and we can replace the loads by register reads. + +### Case 2. Indirect loads -> we need to load the values from memory + +<code> +struct foo { int x[16]; }; </br> +\_\_kernel void case1(\_\_global int *dst, struct foo bar) </br> +{<br/> + dst[get\_global\_id(0)] = bar.x[get\_local\_id(0)];<br/> +} +</code> + +We use an indirect load with `bar.x[get\_local\_id(0)]`. Here we need to issue a +load from memory (well, actually, we could do a gather from registers, but it is +not supported yet). + +### Case 3. Writes to arguments -> we need to spill the values to memory first + +<code> +struct foo { int x[16]; }; </br> +\_\_kernel void case1(\_\_global int *dst, struct foo bar) </br> +{<br/> +bar.x[0] = get\_global\_id(1);<br/> + dst[get\_global\_id(0)] = bar.x[get\_local\_id(0)];<br/> +} +</code> + +Here the values are written before being read. This causes some troubles since +we are running in SIMD mode. Indeed, we only have in memory *one* instance of +the function arguments. Here, *many* SIMD lanes and actually *many* hardware +threads are running at the same time. This means that we can not write the data +to memory. We need to allocate a private area for each SIMD lane. + +In that case, we need to spill back the function arguments into memory. We spill +once per SIMD lane. Then, we read from this private area rather than the +function arguments directly. + +This analysis is partially done today in `src/ir/lowering.*pp`. We identify all +the cases but only the case with constant pushing is fully implemented. +Actually, the two last cases are easy to implement but this requires one or two +days of work. + +Value and liveness analysis tools +--------------------------------- + +You may also notice that we provide a complete framework for value analysis +(i.e. to figure when a value or instruction destination is used and where the +instruction sources come from). The code is in `src/ir/value.*pp`. Well, today, +this code will burn a crazy amount of memory (use of std::set all over the +place) but it at least provides the analysis required by many other passes. +Compacting the data structures and using O(n) algorithms instead of the O(ln(n)) +are in the TODO list for sure :-) + +Finally, we also provide a liveness analysis tool which simply figures out which +registers are alive at the end of each block (classically "live out" sets). diff --git a/docs/Beignet/Backend/unstructured_branches.mdwn b/docs/Beignet/Backend/unstructured_branches.mdwn new file mode 100644 index 00000000..37a294c5 --- /dev/null +++ b/docs/Beignet/Backend/unstructured_branches.mdwn @@ -0,0 +1,271 @@ +Unstructured Branches +===================== + +A major challenge in making a OpenCL compiler is certainly to handle any kind of +branches. Indeed LLVM does not make any distinction between structured branches. +See [here](http://llvm.org/docs/LangRef.html) for a complete description of +the LLVM assembly specification. + +The C branching code is simply lowered down in the following instructions: + +- `ret` to return from the current function +- `br` that, if predicated, possibly jumps to two destinations (one for the + taken branch and one for the other). +- `switch` that implements the C switch/case construct. +- `indirectbr` that implements a jump table +- `invoke` and `resume` mostly used to handle exceptions + +Exceptions and jump tables are not supported in OpenCL. Switch cases can be +lowered down to a sequence of if/else statements (using a divide and conquer +approach a switch/case can be dispatched in log(n) complexity where n is the +number of targets). + +This leads us to properly implement `br` and `ret` instructions. + +Solution 1 - Using Gen structured branches +------------------------------------------ + +Gen structured branches are the following instructions: + +`if` `else` `endif` `break` `continue` `while` `brd` `brc` + +Transforming the LLVM IR code into structured code results in basically +reverse-engineering the LLVM code into the original C code. +Unfortunately, there are several key problems: + +- OpenCL supports `goto` keyword that may jump to an arbitrary location +- LLVM can transform the control flow graph in any kind of form +- Worse is that a reducible control flow graph can be turned into an irreducible +one by the optimizer. + +This can lead to complicated code transform and basic block duplication. The +specification allows the compiler to abort if an irreducible control flow is +detected but as an implementor, this is quite awkward to abort the compilation +because the optimizer turns an reducible CFG to an irreducible one. Using +structured branches is the open door to many corner cases. + +Thing is it exists a pretty elegant solution that can be almost seamlessly +supported by Gen. This is the solution we retained. + +Solution 2 - Linearizing the control flow graph +----------------------------------------------- + +The general problem is to map a general control flow graph to a SIMD machine. +The problem is fairly well understood today. A recent research paper actually +dedicated to OpenCL like languages which use the "SPMD" (single program multiple +data) programming model present interesting insights about how to map SIMD +architectures to such languages (see [here] +(http://www.cdl.uni-saarland.de/papers/karrenberg_opencl.pdf)). + +### Core idea + +- Linearizing the CFG initially consists in removing all forward branches and +"replace" them by predication. Indeed, the program will be still correct if you +predicate instructions based instead of forward jumps. This is basically the +a control flow to data flow conversion. + +- Of course, removing all forward branches is inefficient. To improve that, we +simply introduce "if conditions" in the head of basic blocks to know if we run +the basic block. If no lanes is going to be activated in the basic block, we +jump to another basic block where _potentially_ some lanes are going to be +reactivated. + +Consider the following CFG: + +<pre> +o-------o +| | +| 1 |---->-----o +| | | +o-------o | + | | + | | +o-------o | +| | | +| 2 |---->-----------o +| | | | +o-------o | | + | | | + | | | + | o------o | | + | | | | | + | v | | | +o-------o | | | +| | | | | +| 3 | | | | +| | | | | +o-------o | | | + | | | | | + | o------o | | + | | | +o-------o | | +| | | | +| 4 |<---------o | +| | | +o-------o | + | | + | | +o-------o | +| | | +| 5 |<----------------o +| | +o-------o +</pre> + +Mapping it to a SIMD machine may seem challenging. Actually it is not too +complicated. The problem is with the 2->5 jump. Indeed, we have to be sure that +we are not missing any computation done in block 4. + +To do so: +- Instead of jumping from block 2 to block 5, we jump from block 2 to block 4. +- We implement a `JOIN` point on top of block 4. We check if any lane is going +to be reactivated for the block 4. If not, we jump to block 5. + +This leads to the following linearized CFG: +<pre> +o-------o +| | +| 1 |---->-----o +| | | +o-------o | + | | + | | +o-------o | +| | | +| 2 |---->-----------o +| | | | +o-------o | | + | | | + | | | + | o--<---o | | + | | | | | + | v | | | +o-------o | | | +| | | | | +| 3 | ^ | | +| | | | | +o-------o | | | + | | | | | + | o-->---o | | + | | | +o-------o | | +| |==========|=====|====O +| 4 |<---------|-----o | +| |<---------o | +o-------o | + | | + | | +o-------o | +| | | +| 5 |<====================O +| | +o-------o +</pre> + +There is a new jump from block 4 to block 5. + +### Implementation on Gen + +When using structured branches, Gen can supports auto-masking i.e. based on the +branches which are taken, the control flow is properly handled and masks are +automatically applied on all instructions. + +However, there is no similar support for unstructured branches. We therefore +decided to mask instructions manually and use single program flow. This is +actually quite easy to do since Gen is able to predicate any branches. + +Now, how to evaluate the if conditions in an efficient way? + +The choice we did is to use *per-lane block IPs*: for each SIMD lane, we store a +short (16 bits) for each lane in a regular 256 bits GPR (general purpose +register). This "blockIP" register is used in the following way: + +At the beginning of each block, we compare the blockIP register with the ID of +the block. The lane is going to be _activated_ if its blockIP is _smaller_ than +the ID of the block. Otherwise, the lane is deactivated. + +Therefore, we build a flag register at the entry of each basic block with a +single 16-wide uint16_t compare. If no lane is activated, a jump is performed to +the next block where some lanes is going to be activated. + +Since this is regular jumps, we just use `jmpi` instruction. With the help of +predication, we can express all the different possibilities: + +- backward branches are always taken if _any_ of lanes in the predicate is true. +We just use `<+f0.0.anyh>` predication. +- forward branches is *not* taken if some of the lanes are going to activated in +the next block. We therefore compare the blockIP with the ID of the _next_ +block. If all of them are strictly greater than the ID of the next block, we +jump. We therefore use the `<+f0.0.allh>` predicate in that case. +- `JOIN` points are even simpler. We simply jump if none of the lane is activated. +We therefore use the `<-f0.0.anyh>` predicate. + +The complete encoding is done in `src/backend/gen_insn_selection.cpp`. Forward +branches are handled by `SimpleSelection::emitForwardBranch`. Backward branches +are handled by `SimpleSelection::emitBackwardBranch`. Finally, since `JOIN` points +are at the top of each basic blocks, they are handled by +`SimpleSelection::emitLabelInstruction`. + +### Computing `JOIN` points + +The last problem is to compute `JOIN` point i.e. we need to know if we need to +jump at the beginning of each block and if we do, what is the target of the +branch. The code is relatively straightforward and can be found in +`src/backend/context.cpp`. Function is `Context::buildJIPs`. +</br> +Actually, the current implementation is not that elegant. A colleague, Thomas +Raoux, has a simpler and better idea to handle it. + +### Advantages and drawbacks of the method + +- The method has one decisive advantage: it is simple and extremely robust. It can +handle any kind of CFGs (reducible or not) and does not require any +transformation. The use of shorts is also not random. 16-wide compares is issued +in 2 cycles (so it is twice fast as 16-wide 32 bits compares). +- Main drawback will be performance. Even if this is not so bad, we still need +more instructions than if we used structured branches. Mostly + * one or two instructions for `JOIN` points + * three instructions for backward and forward jumps (two more than structured + branches that just require the branch instruction itself) + +Note that all extra instructions are 16 bits instructions (i.e. they use shorts) +so they will only cost 2 cycles anyway. + +The last point is that Gen encoding restricts conditional modifiers and +predicates to be the same in the instruction. This requires to copy or recompute +the flag register for compares and select. So one more instruction is required +for these two instructions. Once again, this would require only 2 cycles. + +Remarks on `ret` instructions +----------------------------- + +Since we can handle any kind of CFG, handling the return statements are +relatively straightforward. We first create one return block at the end of the +program. Then we replace all other returns by a unconditional jump to this +block. The CFG linearization will take care of the rest. +We then simply encode the (only one) return instruction as a End-Of-Thread +message (EOT). +Code examples +------------- + +Some tests were written to assert the correctness of the CFG linearization and the +code generation. They can be found in the _run-time_ code base here: + +`utest/compiler_if_else.cpp` + +`utest/compiler_lower_return0.cpp` + +`utest/compiler_lower_return1.cpp` + +`utest/compiler_lower_return2.cpp` + +`utest/compiler_short_scatter.cpp` + +`utest/compiler_unstructured_branch0.cpp` + +`utest/compiler_unstructured_branch1.cpp` + +`utest/compiler_unstructured_branch2.cpp` + +`utest/compiler_unstructured_branch3.cpp` + |