1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
|
This page is a collection of thoughts on how to best do SSA form in the
scalar backend IR on i965. There are a variety of different issues
discussed here. I've tried my best to put them into a reasonable order.
# General Considerations
There are a number of general things that will probably need to change in
order to nicely handle SSA form.
1. Register numbers. Right now, we "allocate" an integer name for each
register and treat each value as if it comes out of a magical
variable-length register file. For SSA values, we will probably want
to do something similar to NIR where we just point back to the
instruction that generates the value.
# Validation
One of the nicest things in all of NIR is the validator. A validation
layer cannot check that the shader is correct, but it can check that it is
in the right form. Having a good validator has been a massive help in NIR
and we really should repeat it in the backend.
# Registers
Most of the interesting bits of SSA have to do with how you handle
registers/values.
## Intel GEN registers
On Intel graphics, we 128 registers each of which holds 32 bytes. These
32 bytes can be carved into 8 floats, 16 half-floats, 32 1-byte values, or
4 doubles. Since we are working on a SIMD architecture, each register
corresponds to a single float value in the source language and the 8 float
values in the register correspond to the 8 SIMD channels. Our architecture
can also execute in SIMD16 and SIMD32 mode where you need 2 or 4 registers
respectively for a single float value.
There are some cases (such as messages) where an instruction reads up to 14
consecutive registers. Therefore, we can't just allocate registers
one-at-a-time. We can also indirectly index registers. This allows us to
implement arrays by allocating a block of contiguous registers.
Our hardware also has a register regioning system that allows you to access
up to two physical registers with a single instruction (so you can get 8
doubles or 16 floats).
More information about register regioning can be found in any Intel
graphics PRM. The one for Ivy Bridge is pretty easy to navigate.
## Register sizes
Every register has a concept of size. Hardware registers (physical or
otherwise) and messages will be specified in terms of a number of physical
registers. Other registers have a three-dimensional size; a number of
bits, a width, and a length. The number of bits is fairly obvious: 32 for
float/int, 16 for hf, 64 for double, etc. The width is the number of SIMD
channels stored in the register.
## Execution masks and interference
Most values, in particular anything in a GLSL variable, exist in a
per-SIMD-channel way. These registers can be treated with the normal
SSA-based interference rules. Consider the following example:
float a = 1.0;
if (foo) {
b = 0.5;
use(b);
} else {
use(a);
}
If `foo` has different values for different SIMD channels, then the only
channels of `b` that are set to 0.5 and used will be the channels where
`foo` is true. Similarly, the only channels of `a` that will be used are
the channels where `foo` is false. Therefore, we can freely assign `a` and
`b` to the same register without any harm.
The situation is quite different if the assignment to `b` happens to all
channels regardless of the execution mask. In this case, assigning `a` and
`b` to the same register will cause the values we want in `a` to be
squashed by the assignment to be so they have to be in separate registers.
For the sake of clearer discussion, will use the term **SIMD-local** to
refer to a register that is only used in a per-SIMD-channel way and
**SIMD-global** to refer to anything else.
### Bit sizes and exec offsets.
These issues become even more interesting when you throw in quarter control
and non-32-bit types. With these added complexities, the execution masks
do not always apply to the same 32-bit slice of any given register.
Instead, it may apply to 16-bit or 64-bit slices. For alternate quarter
controls, you can have the second half of the mask apply to the given set
of 32-bit slices instead of the first.
## Register files
In the current backend IR's, we have a variety of different register files
that indicate different things. This has worked pretty well, and we
probably want to do the same type of thing in a new IR. I think we want
the following register files:
* **ATTR**: Push attribute registers. These registers will be pushed by
the vertex fetching hardware and are considered to be "defined" at the
start of the program. These are per-SIMD-channel and as such, have a
width equal to the dispatch width.
* **UNIFORM**: Push uniform registers. These registers store uniform push
constants that are loaded by the hardware for use and are considered to
be "defined" at the start of the program. These are uniform values and
as such always have a width of 1.
* **HW_REG**: An actual physical hardware register. These are needed for
a variety of setup tasks. Usually, we only use g0 and g1, but there are
other times when we need a particular physical register. Hardware
registers don't really have a width per se but we do need some way of
specifying what hardware registers get touched. Hardware registers will
allow full register regioning. Reads/writes to physcal hardware
registers cannot be reordered.
* **Allocated HW_REG**: An allocated hardware register is similar to a
hardware register in that it allows multiple writes, partial writes,
full register regioning, and are SIMD-global. The primary difference
between allocated and physical hardware registers is that allocated
hardware registers get their actual register number assigned by the
allocator. In order to allow for SSA-based allocation, any given
allocated hardware register may only be used in a single basic block.
(It will be considered to be defined at the first use/def and all other
use/defs will be considered uses.) We may implement this by simply
using a register number of -1 to indicate that it should be allocated.
* **MRF**: Message registers. These are only ever defined by a
LOAD_PAYLOAD type operation and used by send messages. On SNB-, these
will be allocated out of a special pool of 16 registers (24 on SNB?).
On IVB+, these will be allocated out of the same pool as the rest of the
registers. Messages are special in that they are partially SIMD-local
and partially SIMD-global. How will we handle this in the allocator? I
have no idea.
* **IMM**: Immediate values. We could do what NIR does and handle
immediates by using a constant load instruction. This may be a good
idea, but it doesn't map very well to the hardware. It's probably
better to use a special register file/type for these.
* **FLAG**: Boolean flag values. These store boolean-only SSA values and
get allocated into one of the available flag registers. Since we have
very few flag registers, we may have to "spill" a flag value to a
regular SSA value by inserting a move-to-flag shortly before the uses of
the flag value. However, we do want the ability to track flag values
explicitly rather than the current "writes flag" boolean. There is a
tricky bit here because the CMP instruction *always* writes the flag
whether you want it to or not so we can't quite use SSA interference for
this.
* **SSA**: These are the regular values. They have a number of bits,
a width (1, 8, 16, or 32), a quarter (first half, second quarter, etc.)
and a length. In general, the width will be 1 for values that are
uniform and 8, 16, or 32 for non-uniform values. Non-uniform value
with a width less than the dispatch width will also have a specified
quarter to denote which SIMD channels it corresponds to. Unlike
hardware registers, partial writes and interesting regions are *not*
allowed on SSA values. Also, SSA values have exactly one definition and
this definition must dominate its entire live-range.
One more we *may* want, but I'm not sure what to do with it:
* **ACCUM**: We also have an accumulator register. Some instructions just
write to this register by default. If we use the accumulator value, we
need to be very careful with re-ordering of instructions. I don't know
if we want to track this as an SSA value or not, but we do need to track
it.
# Register Allocation
One of the big advantages to SSA in the backend IR is the ability to use
SSA for doing register allocation. More than the allocation itself
SSA-based allocation schemes gives you the ability to know the exact
register pressure at any point in the program and make decisions based on
it. The benefits of this apply to many areas of the backend compiler such
as value numbering, scheduling, spilling, and many more. If we can, we
really want an SSA-based scheme.
## SSA-based allocation
The problem here is that most SSA-based register allocation schemes are
fairly thoroughly CPU-focused. In our setting, there are some additional
constraints:
1. Register sizes. In the usual CPU setting, each register is a single
atomic entity. In the GPU setting, however, we have to deal with
allocating blocks of contiguous registers. This means that we have to
solve a packing problem.
2. SIMD masking. In the usual CPU setting, if two registers don't
interfere in the SSA sense of interference, you can assign them to the
same register. In the SIMD setting, SIMD-global registers interfere
with more than just what the dominance tree tells you. Also, registers
with different bit-sizes and different quarters can interfere even if
they don't interfere in the usual SSA sense.
Connor and I (Jason) have discussed this quite a bit and have come up with
two different schemes that seem to at least start to address these
problems. However, the problem is very far from being solved.
## In the mean time...
Before we actually solve the SSA-based register allocation problem, we can
probably do better than we are today by using a hybrid approach. This
would be to combine both SSA-based interference with the interval-based
interference we use today. One way to do this would be as follows:
bool
regs_interfere(reg *a, reg *b)
{
if (regs_ssa_interfere(a, b))
return true;
if (!regs_interval_interfere(a, b))
return false;
return a->width != b->width ||
a->bits != b->bits ||
a->quarter != b->quarter ||
a->simd_global || b->simd_global;
}
This would get us an interference graph with fewer edges than the purely
interval-based approach but would also allow us to avoid the sharp edges
given by working on a SIMD architecture. We would then use the same
register allocator that we have today.
The problem with this approach is, of course, that we wouldn't get the nice
register pressure properties of *real* SSA-based register allocation.
However, it may save us a few registers and gain us some SIMD16 programs.
|