docs/Beignet/Backend/compiler_backend.mdwn


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118

Compiler Back End
=================

Well, the complete code base is somehow a compiler backend for LLVM. Here, we
really speak about the final code generation passes that you may find in
`src/backend`.

As explained in [[the scalar IR presentation|gen_ir]], we bet on a very
simple scalar IR to make it easy to parse and modify. The idea is to fix the
unrelated problem (very Gen specific) where we can i.e. when the code is
generated.

The code generation in the compiler backend is classically divided into four
steps

- Instruction selection (defined in `src/backend/gen_insn_selection.*pp`). We
  expose an interface for the instruction selection engine. We implemented a
  very simple selection (called `SimpleSelection`) that does a quick and dirty
  one-to-many instruction generation.

- Register allocation (defined in `src/backend/gen_reg_allocation.*pp`). The
  code implements a linear scan allocator on the code selected in the previous
  pass. See below for more details about register vector allocations.

- Instruction scheduling. This one is not done yet. We just output the same
  instruction order as the program order. Note that we plan to implement an
  adaptive scheduling between register allocation and instruction  selection (to
  avoid spilling as much as possible)

- Instruction encoding. This is the final step that encodes the program into Gen
  ISA.

Instruction selection
---------------------

Usually, the instruction selection consists in mapping `p` instructions to `q`
ISA instructions under a cost driven model. Each basic block is therefore _tiled_
into some numbers of groups of ISA instructions such that the final cost is
minimized.

The literature is particularly dense on the subject. Compilers usually use today
either tree matching methods or selection DAG techniques (as LLVM backends do)

The instruction selection is still a work in progress in our compiler and we
only implement the most stupid (and inefficient) technique: we simply generate
as many instructions as we need for each _individual_ IR instructions. Since we
do not support immediate sources, this in particular leads to really ugly
looking code such as `mov (16) r2:f 1.f`. It is still a work in progress.

Other than that, the instruction selection is really a book keeping structure.
We basically output `SelectionInstruction` objects which are the 1-to-1 mapping
of Gen ISA encoding functions defined in `src/backend/gen_encoder.*pp`.

However, the `SelectionInstruction` still use unallocated virtual registers and
do *not* use vectors but simply tuples of virtual registers.

Register allocation
-------------------

The register allocation actually consists in two steps:

1. Handling the vector for all the instructions that require them

2. Performing the register allocation itself

Step 1 consists in scanning all the vectors required by sends. Obviously, the
same register may be used in different vectors and that may lead to
interferences. We simply sort the vectors from the largest to the smallest and
allocate them in that order. As an optimization we also identify sub-vectors
i.e. vectors included in larger ones and no not allocate them.

The code may be largely improved in particular if we take into account liveness
interferences as well. Basically, a register may be part of several vectors if the
registers that are not in both vectors at the same location are not alive at the
same time.

This is still a work in progress. Code is right now handled by method
`GenRegAllocator::allocateVector`.

Step 2 performs the register allocation i.e. it associates each virtual register
to one (or several) physical registers. The first thing is that the Gen register
file is very flexible i.e. it can (almost) be freely partitioned. To handle this
peculiarity, we simply implemented a free list based generic memory allocator as
done with `RegisterFilePartitioner` in `src/backend/context.cpp`.

We provide two directions of memory allocation. From tail to head direction is
used for normal register, and from head to tail is for the curbe payload register
allocation.

We then simply implemented a linear scan allocator (see
`gen_reg_allocation.cpp`). The spilling is implemented in the same file. The
heuristics we used is the register's end point. It always try to spill the
register with largest liveness end point if possible. Although Gen support to
spill 4 SIMD8 register at once, we only support one currently. Need to optimize
it latter, at least for the vectors' spilling. Maybe a new pass in the backend
to find opportunity to gatter more spilled register into one contiguous area
is also worth to do. We also can consider the spill register's interval to
do smarter scratch memory allocation to reduce scratch memory requirement.

Instruction scheduling
----------------------

Intra-basic block instruction scheduling is relatively simple. It is implemented
but has known bug, we need further effort to fix it.

Instruction encoding
--------------------

This is mostly done in `src/backend/gen_context.cpp` and
`src/backend/gen_encoder./*pp`. This is mostly glue code and it is pretty
straightforward. We just forward the selection code using the physically
allocated registers. There is nothing special here. Just boilerplate.

There are plenty of huge macro instructions in the `gen_context.cpp` currently.
Most of them are for the long/double support on a Gen platform which doesn't support
long/double in the hardware level. We may need to clean up and move those non-hardware
related functions into upper layer. Too many huge instruction which will totally
make the register spilling and dead code elimination harder and inefficient.