I965ScalarSSA.mdwn


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235

This page is a collection of thoughts on how to best do SSA form in the
scalar backend IR on i965.  There are a variety of different issues
discussed here.  I've tried my best to put them into a reasonable order.

# General Considerations

There are a number of general things that will probably need to change in
order to nicely handle SSA form.

 1. Register numbers.  Right now, we "allocate" an integer name for each
    register and treat each value as if it comes out of a magical
    variable-length register file.  For SSA values, we will probably want
    to do something similar to NIR where we just point back to the
    instruction that generates the value.

# Validation

One of the nicest things in all of NIR is the validator.  A validation
layer cannot check that the shader is correct, but it can check that it is
in the right form.  Having a good validator has been a massive help in NIR
and we really should repeat it in the backend.

# Registers

Most of the interesting bits of SSA have to do with how you handle
registers/values.

## Intel GEN registers

On Intel graphics, we 128 registers each of which holds 32 bytes.  These
32 bytes can be carved into 8 floats, 16 half-floats, 32 1-byte values, or
4 doubles.  Since we are working on a SIMD architecture, each register
corresponds to a single float value in the source language and the 8 float
values in the register correspond to the 8 SIMD channels.  Our architecture
can also execute in SIMD16 and SIMD32 mode where you need 2 or 4 registers
respectively for a single float value.

There are some cases (such as messages) where an instruction reads up to 14
consecutive registers.  Therefore, we can't just allocate registers
one-at-a-time.  We can also indirectly index registers.  This allows us to
implement arrays by allocating a block of contiguous registers.

Our hardware also has a register regioning system that allows you to access
up to two physical registers with a single instruction (so you can get 8
doubles or 16 floats).

More information about register regioning can be found in any Intel
graphics PRM.  The one for Ivy Bridge is pretty easy to navigate.

## Register sizes

Every register has a concept of size.  Hardware registers (physical or
otherwise) and messages will be specified in terms of a number of physical
registers.  Other registers have a three-dimensional size; a number of
bits, a width, and a length.  The number of bits is fairly obvious: 32 for
float/int, 16 for hf, 64 for double, etc.  The width is the number of SIMD
channels stored in the register.

## Execution masks and interference

Most values, in particular anything in a GLSL variable, exist in a
per-SIMD-channel way.  These registers can be treated with the normal
SSA-based interference rules.  Consider the following example:

    float a = 1.0;
    if (foo) {
        b = 0.5;
        use(b);
    } else {
        use(a);
    }

If `foo` has different values for different SIMD channels, then the only
channels of `b` that are set to 0.5 and used will be the channels where
`foo` is true.  Similarly, the only channels of `a` that will be used are
the channels where `foo` is false.  Therefore, we can freely assign `a` and
`b` to the same register without any harm.

The situation is quite different if the assignment to `b` happens to all
channels regardless of the execution mask.  In this case, assigning `a` and
`b` to the same register will cause the values we want in `a` to be
squashed by the assignment to be so they have to be in separate registers.

For the sake of clearer discussion, will use the term **SIMD-local** to
refer to a register that is only used in a per-SIMD-channel way and
**SIMD-global** to refer to anything else.

### Bit sizes and exec offsets.

These issues become even more interesting when you throw in quarter control
and non-32-bit types.  With these added complexities, the execution masks
do not always apply to the same 32-bit slice of any given register.
Instead, it may apply to 16-bit or 64-bit slices.  For alternate quarter
controls, you can have the second half of the mask apply to the given set
of 32-bit slices instead of the first.

## Register files

In the current backend IR's, we have a variety of different register files
that indicate different things.  This has worked pretty well, and we
probably want to do the same type of thing in a new IR.  I think we want
the following register files:

 * **ATTR**: Push attribute registers.  These registers will be pushed by
   the vertex fetching hardware and are considered to be "defined" at the
   start of the program.  These are per-SIMD-channel and as such, have a
   width equal to the dispatch width.

 * **UNIFORM**: Push uniform registers.  These registers store uniform push
   constants that are loaded by the hardware for use and are considered to
   be "defined" at the start of the program.  These are uniform values and
   as such always have a width of 1.

 * **HW_REG**: An actual physical hardware register.  These are needed for
   a variety of setup tasks.  Usually, we only use g0 and g1, but there are
   other times when we need a particular physical register.  Hardware
   registers don't really have a width per se but we do need some way of
   specifying what hardware registers get touched.  Hardware registers will
   allow full register regioning.  Reads/writes to physcal hardware
   registers cannot be reordered.

 * **Allocated HW_REG**: An allocated hardware register is similar to a
   hardware register in that it allows multiple writes, partial writes,
   full register regioning, and are SIMD-global.  The primary difference
   between allocated and physical hardware registers is that allocated
   hardware registers get their actual register number assigned by the
   allocator.  In order to allow for SSA-based allocation, any given
   allocated hardware register may only be used in a single basic block.
   (It will be considered to be defined at the first use/def and all other
   use/defs will be considered uses.)  We may implement this by simply
   using a register number of -1 to indicate that it should be allocated.

 * **MRF**: Message registers.  These are only ever defined by a
   LOAD_PAYLOAD type operation and used by send messages.  On SNB-, these
   will be allocated out of a special pool of 16 registers (24 on SNB?).
   On IVB+, these will be allocated out of the same pool as the rest of the
   registers.  Messages are special in that they are partially SIMD-local
   and partially SIMD-global.  How will we handle this in the allocator?  I
   have no idea.

 * **IMM**: Immediate values.  We could do what NIR does and handle
   immediates by using a constant load instruction.  This may be a good
   idea, but it doesn't map very well to the hardware.  It's probably
   better to use a special register file/type for these.

 * **FLAG**: Boolean flag values.  These store boolean-only SSA values and
   get allocated into one of the available flag registers.  Since we have
   very few flag registers, we may have to "spill" a flag value to a
   regular SSA value by inserting a move-to-flag shortly before the uses of
   the flag value.  However, we do want the ability to track flag values
   explicitly rather than the current "writes flag" boolean.  There is a
   tricky bit here because the CMP instruction *always* writes the flag
   whether you want it to or not so we can't quite use SSA interference for
   this.

 * **SSA**: These are the regular values.  They have a number of bits,
   a width (1, 8, 16, or 32), a quarter (first half, second quarter, etc.)
   and a length.  In general, the width will be 1 for values that are
   uniform and 8, 16, or 32 for non-uniform values.  Non-uniform value
   with a width less than the dispatch width will also have a specified
   quarter to denote which SIMD channels it corresponds to.  Unlike
   hardware registers, partial writes and interesting regions are *not*
   allowed on SSA values.  Also, SSA values have exactly one definition and
   this definition must dominate its entire live-range.

One more we *may* want, but I'm not sure what to do with it:

 * **ACCUM**: We also have an accumulator register.  Some instructions just
   write to this register by default.  If we use the accumulator value, we
   need to be very careful with re-ordering of instructions.  I don't know
   if we want to track this as an SSA value or not, but we do need to track
   it.

# Register Allocation

One of the big advantages to SSA in the backend IR is the ability to use
SSA for doing register allocation.  More than the allocation itself
SSA-based allocation schemes gives you the ability to know the exact
register pressure at any point in the program and make decisions based on
it.  The benefits of this apply to many areas of the backend compiler such
as value numbering, scheduling, spilling, and many more.  If we can, we
really want an SSA-based scheme.

## SSA-based allocation

The problem here is that  most SSA-based register allocation schemes are
fairly thoroughly CPU-focused.  In our setting, there are some additional
constraints:

 1. Register sizes.  In the usual CPU setting, each register is a single
    atomic entity.  In the GPU setting, however, we have to deal with
    allocating blocks of contiguous registers.  This means that we have to
    solve a packing problem.

 2. SIMD masking.  In the usual CPU setting, if two registers don't
    interfere in the SSA sense of interference, you can assign them to the
    same register.  In the SIMD setting,  SIMD-global registers interfere
    with more than just what the dominance tree tells you.  Also, registers
    with different bit-sizes and different quarters can interfere even if
    they don't interfere in the usual SSA sense.

Connor and I (Jason) have discussed this quite a bit and have come up with
two different schemes that seem to at least start to address these
problems.  However, the problem is very far from being solved.

## In the mean time...

Before we actually solve the SSA-based register allocation problem, we can
probably do better than we are today by using a hybrid approach.  This
would be to combine both SSA-based interference with the interval-based
interference we use today.  One way to do this would be as follows:

    bool
    regs_interfere(reg *a, reg *b)
    {
        if (regs_ssa_interfere(a, b))
            return true;

        if (!regs_interval_interfere(a, b))
            return false;

        return a->width != b->width ||
               a->bits != b->bits ||
               a->quarter != b->quarter ||
               a->simd_global || b->simd_global;
    }

This would get us an interference graph with fewer edges than the purely
interval-based approach but would also allow us to avoid the sharp edges
given by working on a SIMD architecture.  We would then use the same
register allocator that we have today.

The problem with this approach is, of course, that we wouldn't get the nice
register pressure properties of *real* SSA-based register allocation.
However, it may save us a few registers and gain us some SIMD16 programs.