summaryrefslogtreecommitdiff
path: root/docs
diff options
context:
space:
mode:
authorZhigang Gong <zhigang.gong@linux.intel.com>2013-06-25 18:04:53 +0800
committerZhigang Gong <zhigang.gong@linux.intel.com>2013-06-25 18:55:08 +0800
commit97c3a9b076ccd24af09fb6ce3f2be7b8e3f93a4f (patch)
tree1650dbb208d4ceae5a8ff7dd37a91c1a6e13941e /docs
parent7508682d00101231230b6dbbb20b51c5d67deb17 (diff)
Docs: Rearrange documents according to wiki page on fd.o.
We use fd.o wiki to host our documents. To make the maintainence easier, I change the direcotry and some links in the markdown files. And make them the same as fd.o's. Signed-off-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Diffstat (limited to 'docs')
-rw-r--r--docs/Beignet.mdwn138
-rw-r--r--docs/Beignet/Backend.mdwn58
-rw-r--r--docs/Beignet/Backend/TODO.mdwn109
-rw-r--r--docs/Beignet/Backend/compiler_backend.mdwn110
-rw-r--r--docs/Beignet/Backend/flat_address_space.mdwn98
-rw-r--r--docs/Beignet/Backend/gen_ir.mdwn254
-rw-r--r--docs/Beignet/Backend/unstructured_branches.mdwn271
7 files changed, 1038 insertions, 0 deletions
diff --git a/docs/Beignet.mdwn b/docs/Beignet.mdwn
new file mode 100644
index 00000000..e31cce1d
--- /dev/null
+++ b/docs/Beignet.mdwn
@@ -0,0 +1,138 @@
+Beignet
+=======
+
+Beignet is an open source implementaion of the OpenCL specification - a generic
+compute oriented API. This code base contains the code to run OpenCL programs on
+Intel GPUs which bsically defines and implements the OpenCL host functions
+required to initialize the device, create the command queues, the kernels and
+the programs and run them on the GPU. The code base also contains the compiler
+part of the stack which is included in `backend/`. For more specific information
+about the compiler, please refer to `backend/README.md`
+
+How to build
+------------
+
+The project uses CMake with three profiles:
+
+1. Debug (-g)
+2. RelWithDebInfo (-g with optimizations)
+3. Release (only optimizations)
+
+Basically, from the root directory of the project
+
+`> mkdir build`
+
+`> cd build`
+
+`> cmake ../ # to configure`
+
+Choose whatever you want for the build.
+
+Then press 'c' to configure and 'g' to generate the code.
+
+`> make`
+
+The project depends on several external libraries:
+
+- Several X components (XLib, Xfixes, Xext)
+- libdrm libraries (libdrm and libdrm\_intel)
+- Various LLVM components
+- The compiler backend itself (libgbe)
+- Mesa git master version built with gbm enabled to support extension cl\_khr\_gl\_sharing.
+
+CMake will check the dependencies and will complain if it does not find them.
+
+The cmake will also build the backend project. Please refer to:
+[[OpenCL Gen Backend|Beignet/Backend]] to get more dependencies.
+
+Once built, the run-time produces a shared object libcl.so which basically
+directly implements the OpenCL API. A set of tests are also produced. They may
+be found in `utests/`.
+
+Note that the compiler depends on LLVM (Low-Level Virtual Machine project).
+Right now, the code has been compiled with LLVM 3.1/3.2. It will not compile
+with any thing older.
+
+[http://llvm.org/releases/](http://llvm.org/releases/)
+
+LLVM 3.1,3.2 and 3.3 are supported.
+
+Also note that the code was compiled on GCC 4.6 and GCC 4.7. Since the code uses
+really recent C++11 features, you may expect problems with older compilers. Last
+time I tried, the code breaks ICC 12 and Clang with internal compiler errors
+while compiling anonymous nested lambda functions.
+
+How to run
+----------
+
+Apart from the OpenCL library itself that can be used by any OpenCL application,
+this code also produces various tests to ensure the compiler and the run-time
+consistency. This small test framework uses a simple c++ registration system to
+register all the unit tests.
+
+You need to set the variable `OCL_KERNEL_PATH` to locate the OCL kernels. They
+are with the run-time in `./kernels`.
+
+Then in `utests/`:
+
+`> ./utest_run`
+
+will run all the unit tests one after the others
+
+`> ./utest_run some_unit_test0 some_unit_test1`
+
+will only run `some_unit_test0` and `some_unit_test1` tests
+
+Supported Hardware
+------------------
+
+The code was tested on IVB GT2 with ubuntu and fedora core distribution.
+Currently Only IVB is supported right now. Actually, the code was only run on IVB GT2. You
+may expect some issues with IVB GT1.
+
+TODO
+----
+
+The run-time is far from being complete. Most of the pieces have been put
+together to test and develop the OpenCL compiler. A partial list of things to
+do:
+
+- Complete cl\_khr\_gl\_sharing support. We lack of some APIs implementation such
+ as clCreateFromGLBuffer,clCreateFromGLRenderbuffer,clGetGLObjectInfo... Currently,
+ the working APIs are clCreateFromGLTexture,clCreateFromGLTexture2D.
+
+- Support for events.
+
+- Check that NDRangeKernels can be pushed into _different_ queues from several
+ threads.
+
+- Support for nonblocking mode Enqueue\*Buffer. Now we only use the map extension to
+ implement those Enqueue\*Buffer functions.
+
+- No state tracking at all. One batch buffer is created at each "draw call"
+ (i.e. for each NDRangeKernels). This is really inefficient since some
+ expensive pipe controls are issued for each batch buffer
+
+- Valgrind reports some leaks in libdrm. It sounds like a false positive but it
+ has to be checked. Idem for LLVM. There is one leak here to check.
+
+More generally, everything in the run-time that triggers the "FATAL" macro means
+that something that must be supported is not implemented properly (either it
+does not comply with the standard or it is just missing)
+
+Project repository
+------------------
+Right now, we host our project on fdo at: git://anongit.freedesktop.org/beignet.
+
+The team
+--------
+This project was created by Ben Segovia when he was working for Intel. Now we
+have a team in China OTC graphics department continue to work on this project.
+The official contact for this project is: Zou Nanhai (<nanhai.zou@intel.com>).
+
+How to contribute
+-----------------
+You are always welcome to contribute to this project, just need to subscribe
+to the beignet mail list and send patches to it for review.
+The official mail list is as below:
+http://lists.freedesktop.org/mailman/listinfo/beignet
diff --git a/docs/Beignet/Backend.mdwn b/docs/Beignet/Backend.mdwn
new file mode 100644
index 00000000..99d678e7
--- /dev/null
+++ b/docs/Beignet/Backend.mdwn
@@ -0,0 +1,58 @@
+Beignet Compiler
+================
+
+This code base contains the compiler part of the Beignet OpenCL stack. The
+compiler is responsible to take a OpenCL language string and to compile it into
+a binary that can be executed on Intel integrated GPUs.
+
+Limitations
+-----------
+
+Today, the compiler is far from complete. See [[here|Backend/TODO]] for a
+(incomplete) lists of things to do.
+
+Interface with the run-time
+---------------------------
+
+Even if the compiler makes a very liberal use of C++ (templates, variadic
+templates, macros), we really tried hard to make a very simple interface with
+the run-time. The interface is therefore a pure C99 interface and it is defined
+in `src/backend/program.h`.
+
+The goal is to hide the complexity of the inner data structures and to enable
+simple run-time implementation using straightforward C99.
+
+Note that the data structures are fully opaque: this allows us to use both the
+C++ simulator or the real Gen program in a relatively non-intrusive way.
+
+Various environment variables
+-----------------------------
+
+Environment variables are used all over the code. Most important ones are:
+
+- `OCL_SIMD_WIDTH` `(8 or 16)`. Change the number of lanes per hardware thread
+
+- `OCL_OUTPUT_GEN_IR` `(0 or 1)`. Output Gen IR (scalar intermediate
+ representation) code
+
+- `OCL_OUTPUT_LLVM` `(0 or 1)`. Output LLVM code after the lowering passes
+
+- `OCL_OUTPUT_LLVM_BEFORE_EXTRA_PASS` `(0 or 1)`. Output LLVM code before the
+ lowering passes
+
+- `OCL_OUTPUT_ASM` `(0 or 1)`. Output Gen ISA
+
+- `OCL_OUTPUT_REG_ALLOC` `(0 or 1)`. Output Gen register allocations
+
+Implementation details
+----------------------
+
+Several key decisions may use the hardware in an usual way. See the following
+documents for the technical details about the compiler implementation:
+
+- [[Flat address space|flat_address_space]]
+- [[Unstructured branches|unstructured_branches]]
+- [[Scalar intermediate representation|gen_ir]]
+- [[Clean backend implementation|compiler_backend]]
+
+Ben Segovia.
diff --git a/docs/Beignet/Backend/TODO.mdwn b/docs/Beignet/Backend/TODO.mdwn
new file mode 100644
index 00000000..3f1ccb4c
--- /dev/null
+++ b/docs/Beignet/Backend/TODO.mdwn
@@ -0,0 +1,109 @@
+TODO
+====
+
+The compiler is far from complete. Even if the skeleton is now done and should
+be solid, There are a _lot_ of things to do from trivial to complex.
+
+OpenCL standard library
+-----------------------
+
+Today we define the OpenCL API in header file `src/ocl_stdlib.h`. This file is
+from being complete.
+
+By the way, one question remains: do we want to implement
+the high-precision functions as _inline_ functions or as external functions to
+call? Indeed, inlining all functions may lead to severe code bloats while
+calling functions will require to implement a proper ABI. We certainly want to
+do both actually.
+
+LLVM front-end
+--------------
+
+The code is defined in `src/llvm`. We used the PTX ABI and the OpenCL profile
+to compile the code. Therefore, a good part of the job is already done. However,
+many things must be implemented:
+
+- Lowering down of various intrinsics like `llvm.memcpy`
+
+- Conformance test for all OpenCL built-ins (`native_cos`, `native_sin`,
+ `mad`, atomic operations, barriers...).
+
+- Lowering down of int16 / int8 / float16 / char16 / char8 / char4 loads and
+ stores into the supported loads and stores
+
+- Support for local declaration of local array (the OpenCL profile will properly
+ declare them as global arrays)
+
+- Support for doubles
+
+- Support atomic extensions.
+
+- Better resolving of the PHI functions. Today, we always generate MOV
+ instructions at the end of each basic block . They can be easily optimized.
+
+Gen IR
+------
+
+The code is defined in `src/ir`. Main things to do are:
+
+- Bringing support for doubles
+
+- Adding support for atomic extensions.
+
+- Finishing the handling of function arguments (see the [[IR
+ description|gen_ir]] for more details)
+
+- Adding support for linking IR units together. OpenCL indeed allows to create
+ programs from several sources
+
+- Uniform analysys. This is a major performance improvement. A "uniform" value
+ is basically a value where regardless the control flow, all the activated
+ lanes will be identical. Trivial examples are immediate values, function
+ arguments. Also, operations on uniform will produce uniform values and so
+ on...
+
+- Merging of independent uniform loads (and samples). This is a major
+ performance improvement once the uniform analysis is done. Basically, several
+ uniform loads may be collapsed into one load if no writes happens in-between.
+ This will obviously impact both instruction selection and the register
+ allocation.
+
+Backend
+-------
+
+The code is defined in `src/backend`. Main things to do are:
+
+- Implementing support for doubles
+
+- Implementing atomic extensions.
+
+- Implementing register spilling (see the [[compiler backend
+ description|compiler_backend]] for more details)
+
+- Implementing proper instruction selection. A "simple" tree matching algorithm
+ should provide good results for Gen
+
+- Improving the instruction scheduling pass
+
+General plumbing
+----------------
+
+I tried to keep the code clean, well, as far as C++ can be really clean. There
+are some header cleaning steps required though, in particular in the backend
+code.
+
+The context used in the IR code generation (see `src/ir/context.*pp`) should be
+split up and cleaned up too.
+
+I also purely and simply copied and pasted the Gen ISA disassembler from Mesa.
+This leads to code duplication. Also some messages used by OpenCL (untyped reads
+and writes) are not properly decoded yet.
+
+There are some quick and dirty hacks also like the use of function call `system`
+(...). This should be cleanly replaced by popen and stuff. I also directly
+called the LLVM compiler executable instead of using Clang library. All of this
+should be improved and cleaned up. Track "XXX" comments in the code.
+
+Parts of the code leaks memory when exceptions are used. There are some pointers
+to track and replace with std::unique_ptr. Note that we also add a custom memory
+debugger that nicely complements (i.e. it is fast) Valgrind.
diff --git a/docs/Beignet/Backend/compiler_backend.mdwn b/docs/Beignet/Backend/compiler_backend.mdwn
new file mode 100644
index 00000000..32028b6c
--- /dev/null
+++ b/docs/Beignet/Backend/compiler_backend.mdwn
@@ -0,0 +1,110 @@
+Compiler Back End
+=================
+
+Well, the complete code base is somehow a compiler backend for LLVM. Here, we
+really speak about the final code generation passes that you may find in
+`src/backend`.
+
+As explained in [[the scalar IR presentation|gen_ir]], we bet on a very
+simple scalar IR to make it easy to parse and modify. The idea is to fix the
+unrelated problem (very Gen specific) where we can i.e. when the code is
+generated.
+
+The code generation in the compiler backend is classically divided into four
+steps
+
+- Instruction selection (defined in `src/backend/gen_insn_selection.*pp`). We
+ expose an interface for the instruction selection engine. We implemented a
+ very simple selection (called `SimpleSelection`) that does a quick and dirty
+ one-to-many instruction generation.
+
+- Register allocation (defined in `src/backend/gen_reg_allocation.*pp`). The
+ code implements a linear scan allocator on the code selected in the previous
+ pass. See below for more details about register vector allocations.
+
+- Instruction scheduling. This one is not done yet. We just output the same
+ instruction order as the program order. Note that we plan to implement an
+ adaptive scheduling between register allocation and instruction selection (to
+ avoid spilling as much as possible)
+
+- Instruction encoding. This is the final step that encodes the program into Gen
+ ISA.
+
+Instruction selection
+---------------------
+
+Usually, the instruction selection consists in mapping `p` instructions to `q`
+ISA instructions under a cost driven model. Each basic block is therefore _tiled_
+into some numbers of groups of ISA instructions such that the final cost is
+minimized.
+
+The literature is particularly dense on the subject. Compilers usually use today
+either tree matching methods or selection DAG techniques (as LLVM backends do)
+
+The instruction selection is still a work in progress in our compiler and we
+only implement the most stupid (and inefficient) technique: we simply generate
+as many instructions as we need for each _individual_ IR instructions. Since we
+do not support immediate sources, this in particular leads to really ugly
+looking code such as `mov (16) r2:f 1.f`. It is still a work in progress.
+
+Other than that, the instruction selection is really a book keeping structure.
+We basically output `SelectionInstruction` objects which are the 1-to-1 mapping
+of Gen ISA encoding functions defined in `src/backend/gen_encoder.*pp`.
+
+However, the `SelectionInstruction` still use unallocated virtual registers and
+do *not* use vectors but simply tuples of virtual registers.
+
+Register allocation
+-------------------
+
+The register allocation actually consists in two steps:
+
+1. Handling the vector for all the instructions that require them
+
+2. Performing the register allocation itself
+
+Step 1 consists in scanning all the vectors required by sends. Obviously, the
+same register may be used in different vectors and that may lead to
+interferences. We simply sort the vectors from the largest to the smallest and
+allocate them in that order. As an optimization we also identify sub-vectors
+i.e. vectors included in larger ones and no not allocate them.
+
+The code may be largely improved in particular if we take into account liveness
+interferences as well. Basically, a register may be part of several vectors if the
+registers that are not in both vectors at the same location are not alive at the
+same time.
+
+This is still a work in progress. Code is right now handled by method
+`GenRegAllocator::allocateVector`.
+
+Step 2 performs the register allocation i.e. it associates each virtual register
+to one (or several) physical registers. The first thing is that the Gen register
+file is very flexible i.e. it can (almost) be freely partitioned. To handle this
+peculiarity, we simply implemented a free list based generic memory allocator as
+done with `RegisterFilePartitioner` in `src/backend/context.cpp`.
+
+We then simply implemented a linear scan allocator (see
+`gen_reg_allocation.cpp`). The spilling is not implemented and is still a work
+in progress. The thing is that spilling must be specifically handled with Gen.
+Indeed:
+
+1. Bad point. Spilling is expensive and require to assemble messages for it
+
+2. Good point. Gen is able to spill up to 256 _contiguous_ bytes in one message.
+This must be used for high performance spilling and this may require to reorder
+properly registers to spill.
+
+Instruction scheduling
+----------------------
+
+Intra-basic block instruction scheduling is relatively simple. It is not
+implemented yet.
+
+Instruction encoding
+--------------------
+
+This is mostly done in `src/backend/gen_context.cpp` and
+`src/backend/gen_encoder./*pp`. This is mostly glue code and it is pretty
+straightforward. We just forward the selection code using the physically
+allocated registers. There is nothing special here. Just boilerplate.
+
diff --git a/docs/Beignet/Backend/flat_address_space.mdwn b/docs/Beignet/Backend/flat_address_space.mdwn
new file mode 100644
index 00000000..3018a295
--- /dev/null
+++ b/docs/Beignet/Backend/flat_address_space.mdwn
@@ -0,0 +1,98 @@
+Flat Address Space
+==================
+
+Segmented address space...
+--------------------------
+
+The first challenge with OpenCL is its very liberal use of pointers. The memory
+is segment into several address spaces:
+
+- private. This is the memory for each work item
+
+- global. These are buffers in memory shared by all work items and work groups
+
+- constant. These are constant buffers in memory shared by all work items and
+work groups as well
+
+- local. These is a memory shared by all work items in the *same* work group
+
+... But with no restriction inside each address space
+-----------------------------------------------------
+
+The challenge is that there is no restriction in OpenCL inside each address
+space i.e. the full C semantic applies in particular regarding pointer
+arithmetic.
+
+Therefore the following code is valid:
+
+<code>
+\_\_kernel void example(\_\_global int *dst, \_\_global int *src0, \_\_global int *src1)<br/>
+{<br/>
+&nbsp;&nbsp;\_\_global int *from;<br/>
+&nbsp;&nbsp;if (get\_global\_id(0) % 2)<br/>
+&nbsp;&nbsp;&nbsp;&nbsp;from = src0;<br/>
+&nbsp;&nbsp;else<br/>
+&nbsp;&nbsp;&nbsp;&nbsp;from = src1;<br/>
+&nbsp;&nbsp;dst[get\_global\_id(0)] = from[get\_global\_id(0)];<br/>
+}
+</code>
+
+As one may see, the load done in the last line actually mixes pointers from both
+source src0 and src1. This typically makes the use of binding table indices
+pretty hard. In we use binding table 0 for dst, 1 for src0 and 2 for src1 (for
+example), we are not able to express the load in the last line with one send
+only.
+
+No support for stateless in required messages
+---------------------------------------------
+
+Furthermore, in IVB, we are going four types of messages to implement the loads
+and the stores
+
+- Byte scattered reads. They are used to read bytes/shorts/integers that are not
+aligned on 4 bytes. This is a gather message i.e. the user provides up to 16
+addresses
+
+- Byte scattered writes. They are used to write bytes/shorts/integers that are not
+aligned on 4 bytes. This is a scatter message i.e. the user provides up to 16
+addresses
+
+- Untyped reads. They allow to read from 1 to 4 double words (i.e 4 bytes) per
+lane. This is also a gather message i.e. up to 16 address are provided per
+message.
+
+- Untyped writes. They are the counter part of the untyped reads
+
+Problem is that IVB does not support stateless accesses for these messages. So
+surfaces are required. Secondly, stateless messages are not that interesting
+since all of them require a header which is still slow to assemble.
+
+Implemented solution
+--------------------
+
+The solution is actually quite simple. Even with no stateless support, it is
+actually possible to simulate it with a surface. As one may see in the run-time
+code in `intel/intel_gpgpu.c`, we simply create a surface:
+
+- 2GB big
+
+- Which starts at offset 0
+
+Surprisingly, this surface can actually map the complete GTT address space which
+is 2GB big. One may look at `flat_address_space` unit test in the run-time code
+that creates and copies buffers in such a way that the complete GTT address
+space is traversed.
+
+This solution brings a pretty simple implementation in the compiler side.
+Basically, there is nothing to do when translating from LLVM to Gen ISA. A
+pointer to `__global` or `__constant` memory is simply a 32 bits offset in that
+surface.
+
+Related problems
+----------------
+
+There is one drawback for this approach. Since we use a 2GB surface that maps
+the complete GTT space, there is no protection at all. Each write can therefore
+potentially modify any buffer including the command buffer, the frame buffer or
+the kernel code. There is *no* protection at all in the hardware to prevent
+that.
diff --git a/docs/Beignet/Backend/gen_ir.mdwn b/docs/Beignet/Backend/gen_ir.mdwn
new file mode 100644
index 00000000..ae24729c
--- /dev/null
+++ b/docs/Beignet/Backend/gen_ir.mdwn
@@ -0,0 +1,254 @@
+Scalar Intermediate Representation
+==================================
+
+The IR code is included in `src/ir/` of the compiler code base
+The IR as designed in this compiler is the fruit of a long reflection I mostly
+have with Thomas Raoux. Note I usually call it "Gen IR".
+
+Scalar vs vector IR
+-------------------
+
+This is actually the major question: do we need a vector IR or a scalar IR? On
+the LLVM side, we have both. LLVM IR can manipulate vectors and scalars (and
+even generalized values but we can ignore it for now).
+
+For that reason, the Clang front-end generates both scalar and vector code.
+Typically, a `uint4` variable will output a vector of 4 integers. Arithmetic
+computations will be directly done on vector variables.
+
+One the HW side, the situation is completely different:
+
+- We are going to use the parallel mode (align1) i.e. the struct-of-array mode
+ for the EU. This is a SIMD scalar mode.
+
+- The only source of vectors we are going to have is on the sends instructions
+ (and marginally for some other instructions like the div_rem math instruction)
+
+One may therefore argue that we need vector instructions to handle the sends.
+Send will indeed require both vector destinations and sources. This may be a
+strong argument *for* vectors in the IR. However, the situation is not that
+good.
+
+Indeed, if we look carefully at the send instructions we see that they will
+require vectors that are *not* vectors in LLVM IR. This code for example:
+
+<code>
+__global uint4 *src;<br/>
+uint4 x = src[get\_global\_id(0)];<br/>
+</code>
+
+will be translated into an untyped write in the Gen ISA. Unfortunately, the
+address and the values to write are in the *same* vector. However, LLVM IR will
+output a store like:
+
+`store(%addr, %value)`
+
+which basically uses one scalar (the address) and one value (the vector to
+write). Therefore even if we handle vectors in the IR, that will not directly
+solve the problem we have at the end for the send instructions.
+
+We therefore decided to go the other direction:
+
+- We have a purely scalar IR
+
+- To replace vectors, we simply use multiple sources and destinations
+
+- Real vectors required by send instructions are handled at the very bottom of
+the stack in the register allocation passes.
+
+This leads to a very simple intermediate representation which is mostly a pure
+scalar RISC machine.
+
+Very limited IR
+---------------
+
+The other major question, in particular when you look similar stacks like NVidia
+PTX, is:
+
+do we need to encode in the IR register modifiers (abs, negate...) and immediate
+registers (like in add.f x y 1.0)?
+
+Contrary to other IRs (PTX and even LLVM that both supports immediates), we also
+chose to have a very simply IR, much simpler than the final ISA, and to merge
+back what we need at the instruction selection pass. Since we need instruction
+selection, let us keep the IR simple.
+
+Also, there are a lot of major issues that can not be covered in the IR and
+require to be specifically handled at the very end of the code:
+
+- send vectors (see previous section)
+
+- send headers (value and register allocation) which are also part of the vector
+problem
+
+- SIMD8 mode in SIMD16 code. Some send messages do not support SIMD16 encoding
+and require SIMD8. Typically examples are typed writes i.e. scatters to textures.
+Also, this cannot be encoded in some way in a regular scalar IR.
+
+For these reasons, most of the problems directly related to Gen naturally find
+their solutions in either the instruction selection or the register allocator.
+
+This leads to the following strategy:
+
+- Keep the IR very simple and limited
+
+- Use all the analysis tools you need in the IR before the final code generation
+to build any information you need. This is pure "book-keeping".
+
+- Use any previous analysis and finish the job at the very end
+
+This classical approach leads to limit the complexity in the IR while forcing us
+to write the proper tools in the final stages.
+
+Why not using LLVM IR directly?
+-------------------------------
+
+We hesitated a long time between writing a dedicated IR (as we did) and just
+using LLVM IR. Indeed, LLVM comes with a large set of tools that are parts of
+"LLVM backends". LLVM provides a lot of tools to perform the instruction
+selection (`SelectionDAG`) and the register allocation. Two things however
+prevent us from choosing this path:
+
+- We only have a limited experience with LLVM and no experience at all with the
+LLVM backends
+
+- LLVM register allocators do not handle at all the peculiarities of Gen:
+
+ * flexible register file. Gen registers are more like memory than registers
+ and can be freely allocated and aliased. LLVM register allocators only
+ support partial aliasing like x86 machines do (rax -> eax -> ax)
+
+ * no proper tools to handle vectors in the register allocator as we need for
+ sends
+
+Since we will need to do some significant work anyway, this leads us to choose a
+more hard-coded path with a in-house IR. Note that will not prevent us from
+implementing later a LLVM backend "by the book" as Nvidia does today with PTX
+(using a LLVM backend to do the LLVM IR -> PTX conversion)
+
+
+SSA or no SSA
+-------------
+
+Since we have a purely scalar IR, implementing a SSA transformation on the IR
+may be convenient. However, most the literature about compiler back-ends use
+non-SSA representation of the code. Since the primary goal is to write a
+compiler _back-end_ (instruction selection, register allocation and instruction
+scheduling), we keep the code in non-SSA letting the higher level optimizations
+to LLVM.
+
+Types, registers, instructions, functions and units
+---------------------------------------------------
+
+The IR is organized as follows:
+
+- Types (defined in `src/ir/type.*pp`). These are scalar types only. Since the
+ code is completely lowered down, there is no more reference to structures,
+ pointers or vectors. Everything is scalar values and when "vectors" or
+ "structures" would be needed, we use instead multiple scalar sources or
+ destinations.
+
+- Registers (defined in `src/ir/register.*pp`). They are untyped (since Gen IR
+ are untyped) and we have 65,535 of them per function
+
+- Instructions (defined in `src/ir/instruction.*pp`). They are typed (to
+ distinguish integer and FP adds for example) and possibly support multiple
+ destinations and sources. We also provide a convenient framework to introspect
+ the instruction in a simple (and memory efficient) way
+
+- Functions (defined in `src/ir/function.*pp`). They are basically the counter
+ part of LLVM functions or OpenCL kernels. Note that function arguments are a
+ problem. We actually use the PTX ABI. Everything smaller than the machine word
+ size (i.e. 32 bits for Gen) is passed by value with a register. Everything
+ else which is bigger than is passed by pointer with a ByVal attribute.
+ Note that requires some special treatment in the IR (see below) to make the
+ code faster by replacing function argument loads by "pushed constants". We
+ also defined one "register file" per function i.e. the registers are defined
+ relatively to the function that uses them. Each function is made of basic
+ blocks i.e. sequence of instructions that are executed linearly.
+
+- Units (defined in `src/ir/unit.*pp`). Units are just a collection of
+ functions and constants (not supported yet).
+
+Function arguments and pushed constants
+---------------------------------------
+
+Gen can push values into the register file i.e. some registers are preset when
+the kernel starts to run. As detailed previously, the PTX ABI is convenient
+since every argument is either one register or one pointer to load from or to
+store to.
+
+However, when a pointer is used for an argument, loads are issued which may be
+avoided by using constant pushes.
+
+Once again OCL makes the task a bit harder than expected. Indeed, the C
+semantic once again applies to function arguments as well.
+
+Look at these three examples:
+
+### Case 1. Direct loads -> constant push can be used
+
+<code>
+struct foo { int x; int y; }; </br>
+\_\_kernel void case1(\_\_global int *dst, struct foo bar) </br>
+{<br/>
+&nbsp;&nbsp;dst[get\_global\_id(0)] = bar.x + bar.y;<br/>
+}
+</code>
+
+We use a _direct_ _load_ for `bar` with `bar.x` and `bar.y`. Values can be
+pushed into registers and we can replace the loads by register reads.
+
+### Case 2. Indirect loads -> we need to load the values from memory
+
+<code>
+struct foo { int x[16]; }; </br>
+\_\_kernel void case1(\_\_global int *dst, struct foo bar) </br>
+{<br/>
+&nbsp;&nbsp;dst[get\_global\_id(0)] = bar.x[get\_local\_id(0)];<br/>
+}
+</code>
+
+We use an indirect load with `bar.x[get\_local\_id(0)]`. Here we need to issue a
+load from memory (well, actually, we could do a gather from registers, but it is
+not supported yet).
+
+### Case 3. Writes to arguments -> we need to spill the values to memory first
+
+<code>
+struct foo { int x[16]; }; </br>
+\_\_kernel void case1(\_\_global int *dst, struct foo bar) </br>
+{<br/>
+bar.x[0] = get\_global\_id(1);<br/>
+&nbsp;&nbsp;dst[get\_global\_id(0)] = bar.x[get\_local\_id(0)];<br/>
+}
+</code>
+
+Here the values are written before being read. This causes some troubles since
+we are running in SIMD mode. Indeed, we only have in memory *one* instance of
+the function arguments. Here, *many* SIMD lanes and actually *many* hardware
+threads are running at the same time. This means that we can not write the data
+to memory. We need to allocate a private area for each SIMD lane.
+
+In that case, we need to spill back the function arguments into memory. We spill
+once per SIMD lane. Then, we read from this private area rather than the
+function arguments directly.
+
+This analysis is partially done today in `src/ir/lowering.*pp`. We identify all
+the cases but only the case with constant pushing is fully implemented.
+Actually, the two last cases are easy to implement but this requires one or two
+days of work.
+
+Value and liveness analysis tools
+---------------------------------
+
+You may also notice that we provide a complete framework for value analysis
+(i.e. to figure when a value or instruction destination is used and where the
+instruction sources come from). The code is in `src/ir/value.*pp`. Well, today,
+this code will burn a crazy amount of memory (use of std::set all over the
+place) but it at least provides the analysis required by many other passes.
+Compacting the data structures and using O(n) algorithms instead of the O(ln(n))
+are in the TODO list for sure :-)
+
+Finally, we also provide a liveness analysis tool which simply figures out which
+registers are alive at the end of each block (classically "live out" sets).
diff --git a/docs/Beignet/Backend/unstructured_branches.mdwn b/docs/Beignet/Backend/unstructured_branches.mdwn
new file mode 100644
index 00000000..37a294c5
--- /dev/null
+++ b/docs/Beignet/Backend/unstructured_branches.mdwn
@@ -0,0 +1,271 @@
+Unstructured Branches
+=====================
+
+A major challenge in making a OpenCL compiler is certainly to handle any kind of
+branches. Indeed LLVM does not make any distinction between structured branches.
+See [here](http://llvm.org/docs/LangRef.html) for a complete description of
+the LLVM assembly specification.
+
+The C branching code is simply lowered down in the following instructions:
+
+- `ret` to return from the current function
+- `br` that, if predicated, possibly jumps to two destinations (one for the
+ taken branch and one for the other).
+- `switch` that implements the C switch/case construct.
+- `indirectbr` that implements a jump table
+- `invoke` and `resume` mostly used to handle exceptions
+
+Exceptions and jump tables are not supported in OpenCL. Switch cases can be
+lowered down to a sequence of if/else statements (using a divide and conquer
+approach a switch/case can be dispatched in log(n) complexity where n is the
+number of targets).
+
+This leads us to properly implement `br` and `ret` instructions.
+
+Solution 1 - Using Gen structured branches
+------------------------------------------
+
+Gen structured branches are the following instructions:
+
+`if` `else` `endif` `break` `continue` `while` `brd` `brc`
+
+Transforming the LLVM IR code into structured code results in basically
+reverse-engineering the LLVM code into the original C code.
+Unfortunately, there are several key problems:
+
+- OpenCL supports `goto` keyword that may jump to an arbitrary location
+- LLVM can transform the control flow graph in any kind of form
+- Worse is that a reducible control flow graph can be turned into an irreducible
+one by the optimizer.
+
+This can lead to complicated code transform and basic block duplication. The
+specification allows the compiler to abort if an irreducible control flow is
+detected but as an implementor, this is quite awkward to abort the compilation
+because the optimizer turns an reducible CFG to an irreducible one. Using
+structured branches is the open door to many corner cases.
+
+Thing is it exists a pretty elegant solution that can be almost seamlessly
+supported by Gen. This is the solution we retained.
+
+Solution 2 - Linearizing the control flow graph
+-----------------------------------------------
+
+The general problem is to map a general control flow graph to a SIMD machine.
+The problem is fairly well understood today. A recent research paper actually
+dedicated to OpenCL like languages which use the "SPMD" (single program multiple
+data) programming model present interesting insights about how to map SIMD
+architectures to such languages (see [here]
+(http://www.cdl.uni-saarland.de/papers/karrenberg_opencl.pdf)).
+
+### Core idea
+
+- Linearizing the CFG initially consists in removing all forward branches and
+"replace" them by predication. Indeed, the program will be still correct if you
+predicate instructions based instead of forward jumps. This is basically the
+a control flow to data flow conversion.
+
+- Of course, removing all forward branches is inefficient. To improve that, we
+simply introduce "if conditions" in the head of basic blocks to know if we run
+the basic block. If no lanes is going to be activated in the basic block, we
+jump to another basic block where _potentially_ some lanes are going to be
+reactivated.
+
+Consider the following CFG:
+
+<pre>
+o-------o
+| |
+| 1 |---->-----o
+| | |
+o-------o |
+ | |
+ | |
+o-------o |
+| | |
+| 2 |---->-----------o
+| | | |
+o-------o | |
+ | | |
+ | | |
+ | o------o | |
+ | | | | |
+ | v | | |
+o-------o | | |
+| | | | |
+| 3 | | | |
+| | | | |
+o-------o | | |
+ | | | | |
+ | o------o | |
+ | | |
+o-------o | |
+| | | |
+| 4 |<---------o |
+| | |
+o-------o |
+ | |
+ | |
+o-------o |
+| | |
+| 5 |<----------------o
+| |
+o-------o
+</pre>
+
+Mapping it to a SIMD machine may seem challenging. Actually it is not too
+complicated. The problem is with the 2->5 jump. Indeed, we have to be sure that
+we are not missing any computation done in block 4.
+
+To do so:
+- Instead of jumping from block 2 to block 5, we jump from block 2 to block 4.
+- We implement a `JOIN` point on top of block 4. We check if any lane is going
+to be reactivated for the block 4. If not, we jump to block 5.
+
+This leads to the following linearized CFG:
+<pre>
+o-------o
+| |
+| 1 |---->-----o
+| | |
+o-------o |
+ | |
+ | |
+o-------o |
+| | |
+| 2 |---->-----------o
+| | | |
+o-------o | |
+ | | |
+ | | |
+ | o--<---o | |
+ | | | | |
+ | v | | |
+o-------o | | |
+| | | | |
+| 3 | ^ | |
+| | | | |
+o-------o | | |
+ | | | | |
+ | o-->---o | |
+ | | |
+o-------o | |
+| |==========|=====|====O
+| 4 |<---------|-----o |
+| |<---------o |
+o-------o |
+ | |
+ | |
+o-------o |
+| | |
+| 5 |<====================O
+| |
+o-------o
+</pre>
+
+There is a new jump from block 4 to block 5.
+
+### Implementation on Gen
+
+When using structured branches, Gen can supports auto-masking i.e. based on the
+branches which are taken, the control flow is properly handled and masks are
+automatically applied on all instructions.
+
+However, there is no similar support for unstructured branches. We therefore
+decided to mask instructions manually and use single program flow. This is
+actually quite easy to do since Gen is able to predicate any branches.
+
+Now, how to evaluate the if conditions in an efficient way?
+
+The choice we did is to use *per-lane block IPs*: for each SIMD lane, we store a
+short (16 bits) for each lane in a regular 256 bits GPR (general purpose
+register). This "blockIP" register is used in the following way:
+
+At the beginning of each block, we compare the blockIP register with the ID of
+the block. The lane is going to be _activated_ if its blockIP is _smaller_ than
+the ID of the block. Otherwise, the lane is deactivated.
+
+Therefore, we build a flag register at the entry of each basic block with a
+single 16-wide uint16_t compare. If no lane is activated, a jump is performed to
+the next block where some lanes is going to be activated.
+
+Since this is regular jumps, we just use `jmpi` instruction. With the help of
+predication, we can express all the different possibilities:
+
+- backward branches are always taken if _any_ of lanes in the predicate is true.
+We just use `<+f0.0.anyh>` predication.
+- forward branches is *not* taken if some of the lanes are going to activated in
+the next block. We therefore compare the blockIP with the ID of the _next_
+block. If all of them are strictly greater than the ID of the next block, we
+jump. We therefore use the `<+f0.0.allh>` predicate in that case.
+- `JOIN` points are even simpler. We simply jump if none of the lane is activated.
+We therefore use the `<-f0.0.anyh>` predicate.
+
+The complete encoding is done in `src/backend/gen_insn_selection.cpp`. Forward
+branches are handled by `SimpleSelection::emitForwardBranch`. Backward branches
+are handled by `SimpleSelection::emitBackwardBranch`. Finally, since `JOIN` points
+are at the top of each basic blocks, they are handled by
+`SimpleSelection::emitLabelInstruction`.
+
+### Computing `JOIN` points
+
+The last problem is to compute `JOIN` point i.e. we need to know if we need to
+jump at the beginning of each block and if we do, what is the target of the
+branch. The code is relatively straightforward and can be found in
+`src/backend/context.cpp`. Function is `Context::buildJIPs`.
+</br>
+Actually, the current implementation is not that elegant. A colleague, Thomas
+Raoux, has a simpler and better idea to handle it.
+
+### Advantages and drawbacks of the method
+
+- The method has one decisive advantage: it is simple and extremely robust. It can
+handle any kind of CFGs (reducible or not) and does not require any
+transformation. The use of shorts is also not random. 16-wide compares is issued
+in 2 cycles (so it is twice fast as 16-wide 32 bits compares).
+- Main drawback will be performance. Even if this is not so bad, we still need
+more instructions than if we used structured branches. Mostly
+ * one or two instructions for `JOIN` points
+ * three instructions for backward and forward jumps (two more than structured
+ branches that just require the branch instruction itself)
+
+Note that all extra instructions are 16 bits instructions (i.e. they use shorts)
+so they will only cost 2 cycles anyway.
+
+The last point is that Gen encoding restricts conditional modifiers and
+predicates to be the same in the instruction. This requires to copy or recompute
+the flag register for compares and select. So one more instruction is required
+for these two instructions. Once again, this would require only 2 cycles.
+
+Remarks on `ret` instructions
+-----------------------------
+
+Since we can handle any kind of CFG, handling the return statements are
+relatively straightforward. We first create one return block at the end of the
+program. Then we replace all other returns by a unconditional jump to this
+block. The CFG linearization will take care of the rest.
+We then simply encode the (only one) return instruction as a End-Of-Thread
+message (EOT).
+Code examples
+-------------
+
+Some tests were written to assert the correctness of the CFG linearization and the
+code generation. They can be found in the _run-time_ code base here:
+
+`utest/compiler_if_else.cpp`
+
+`utest/compiler_lower_return0.cpp`
+
+`utest/compiler_lower_return1.cpp`
+
+`utest/compiler_lower_return2.cpp`
+
+`utest/compiler_short_scatter.cpp`
+
+`utest/compiler_unstructured_branch0.cpp`
+
+`utest/compiler_unstructured_branch1.cpp`
+
+`utest/compiler_unstructured_branch2.cpp`
+
+`utest/compiler_unstructured_branch3.cpp`
+