summaryrefslogtreecommitdiff
path: root/docs/Beignet/Backend.mdwn
blob: 583e5d2d2502f892066009a27ba1cb063ed820e4 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
Beignet Compiler
================

This code base contains the compiler part of the Beignet OpenCL stack. The
compiler is responsible to take a OpenCL language string and to compile it into
a binary that can be executed on Intel integrated GPUs.

Status
------

After two years development, beignet is mature now. It now supports all the
OpenCL 1.2 mandatory features. Beignet get almost 100% pass rate with both
OpenCV 3.0 test suite and the piglit opencl test suite. There are some 
performance tuning related items remained, see [[here|Backend/TODO]] for a
(incomplete) lists of things to do.

Interface with the run-time
---------------------------

Even if the compiler makes a very liberal use of C++ (templates, variadic
templates, macros), we really tried hard to make a very simple interface with
the run-time. The interface is therefore a pure C99 interface and it is defined
in `src/backend/program.h`.

The goal is to hide the complexity of the inner data structures and to enable
simple run-time implementation using straightforward C99.

Note that the data structures are fully opaque: this allows us to use both the
C++ simulator or the real Gen program in a relatively non-intrusive way.

Various environment variables
-----------------------------

Environment variables are used all over the code. Most important ones are:

- `OCL_STRICT_CONFORMANCE` `(0 or 1)`. Gen does not provide native high
  precision math instructions compliant with OpenCL Spec. So we provide a
  software version to meet the high precision requirement. Obviously the
  software version's performance is not as good as native version supported by
  GEN hardware. What's more, most graphics application don't need this high
  precision, so we choose 0 as the default value. So OpenCL apps do not suffer
  the performance penalty for using high precision math functions.

- `OCL_SIMD_WIDTH` `(8 or 16)`. Select the number of lanes per hardware thread,
  Normally, you don't need to set it, we will select suitable simd width for
  a given kernel. Default value is 16.

- `OCL_OUTPUT_KENERL_SOURCE` `(0 or 1)`. Output the building or compiling kernel's
  source code.

- `OCL_OUTPUT_GEN_IR` `(0 or 1)`. Output Gen IR (scalar intermediate
  representation) code

- `OCL_OUTPUT_LLVM_BEFORE_LINK` `(0 or 1)`. Output LLVM code before llvm link

- `OCL_OUTPUT_LLVM_AFTER_LINK` `(0 or 1)`. Output LLVM code after llvm link

- `OCL_OUTPUT_LLVM_AFTER_GEN` `(0 or 1)`. Output LLVM code after the lowering
  passes, Gen IR is generated based on it.

- `OCL_OUTPUT_ASM` `(0 or 1)`. Output Gen ISA

- `OCL_OUTPUT_REG_ALLOC` `(0 or 1)`. Output Gen register allocations, including
  virtual register to physical register mapping, live ranges.

- `OCL_OUTPUT_BUILD_LOG` `(0 or 1)`. Output error messages if there is any
  during CL kernel compiling and linking.

- `OCL_OUTPUT_CFG` `(0 or 1)`. Output control flow graph in .dot file.

- `OCL_OUTPUT_CFG_ONLY` `(0 or 1)`. Output control flow graph in .dot file,
  but without instructions in each BasicBlock.

- `OCL_PRE_ALLOC_INSN_SCHEDULE` `(0 or 1)`. The instruction scheduler in
  beignet are currently splitted into two passes: before and after register
  allocation. The pre-alloc scheduler tend to decrease register pressure.
  This variable is used to disable/enable pre-alloc scheduler. This pass is
  disabled now for some bugs.

- `OCL_POST_ALLOC_INSN_SCHEDULE` `(0 or 1)`. Disable/enable post-alloc
  instruction scheduler. The post-alloc scheduler tend to reduce instruction
  latency. By default, this is enabled now.

- `OCL_SIMD16_SPILL_THRESHOLD` `(0 to 256)`. Tune how much registers can be
  spilled under SIMD16. Default value is 16. We find spill too much register
  under SIMD16 is not as good as fall back to SIMD8 mode. So we set the
  variable to control spilled register number under SIMD16.

- `OCL_USE_PCH` `(0 or 1)`. The default value is 1. If it is enabled, we use
  a pre compiled header file which include all basic ocl headers. This would
  reduce the compile time.

Implementation details
----------------------

Several key decisions may use the hardware in an usual way. See the following
documents for the technical details about the compiler implementation:

- [[Mixed buffer pointer)|mixed_buffer_pointer]] 
- [[Unstructured branches|unstructured_branches]]
- [[Scalar intermediate representation|gen_ir]]
- [[Clean backend implementation|compiler_backend]]

Ben Segovia.