1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
|
Control Flow
============
Background
----------
Traditional Compilers
~~~~~~~~~~~~~~~~~~~~~
In most IR's, functions consist of *basic blocks* (often shortened to just
*blocks*), which are a series of instructions that always execute linearly from
the beginning to the end, and the *edges* that connect them. Each basic block
ends with a *jump* or *branch* that transfers execution either unconditionally
to one basic block or conditionally to two (or sometimes more) blocks. Together,
the basic blocks and edges comprise the *Control Flow Graph* or CFG. For
example, this snippet of code:
::
while (foo) {
...
if (baz) {
...
} else {
...
}
if (bar) {
...
}
}
...
could be translated to a CFG that looks like:
.. graphviz::
digraph {
post [label="..."];
/* force 'post' to be at the bottom*/
{rank="sink" post}
header [label="foo?"];
block1 [label="...\nbaz?"];
then [label="...\nbar?"];
else [label="...\nbar?"];
block2 [label="...\nfoo?"];
header -> block1;
header -> post;
block1 -> then;
block1 -> else;
then -> header;
else -> header;
then -> block2;
else -> block2;
block2 -> block1;
block2 -> post;
}
Note that the CFG is a little different from the original code, since it's
been optimized somewhat; for example, we've folded the second ``if`` into the
then and else branches of the first. This is a good thing for traditional,
scalar, hardware, but as we'll see, these types of optimizations are usually
unnecessary and sometimes harmful for GPU's. However, this is the standard
model that most literature on compiler theory assumes.
GPU's
~~~~~
A unique feature of most GPU's is that they are designed to run many different
instances of the same program in lock step in order to reduce the size of the
scheduling hardware by sharing it between many different "cores." When control
flow *diverges*, meaning that when two different instances (fragments,
vertices, etc.) branch in different directions, then the GPU will take both
sides of the branch. For example, if both thread 1 and thread 2 are currently
in block A, and thread 1 wants to branch to block B while thread 2 wants to
branch to block C, then the GPU will first branch to block B with thread 1
enabled and thread 2 disabled, and then when execution reaches a predefined
"merge block," the GPU will jump back to block C while flipping the enabled
threads and run until the merge block is reached, at which point the control
flow has *converged* and both thread 1 and thread 2 can be enabled.
Although most GPU's do something like this internally, the details of how it
works can vary greatly from vendor to vendor and generation to generation.
Some GPU's, such as the Mali T6xx series, give each instance a separate
program counter and don't keep track of divergance at all! All Intel chips
have instructions such as ``IF``, ``ELSE``, ``ENDIF``, and ``WHILE``, and on
all current generations, they're faster than the branch-with-merge
instructions described above (although they do a similar thing under the
hood). Also, the rules as to when control flow re-converges can vary based on
the implementation as well as choices made in the backend.
There's one place besides modifying control-flow that these details matter:
*cross-channel operations*. There are a few operations that GPU's can do where
separate instances (sometimes also called *channels* or *lanes*) can share
information. One important example is the operation of taking a derivative in
screen-space in fragment shaders, where the value of the input is exchanged
between fragments in a 2x2 block. Derivatives can be taken either explicitly,
or implicitly in order to calculate the appropriate level of detail when
sampling from a texture. Source languages such as GLSL require that these
derivatives be taken in *uniform control flow*, meaning that every channel
must be enabled or disabled together, in order that the inputs to the
derivative are well-defined, and most backends take advantage of this
guarantee. Therefore, optimizations in any intermediate IR must respect this
invariant, which means that the IR must have a good idea of when control flow
diverges. For example, in the following snippet:
.. code-block:: c
vec2 color = texture(tex, coordinate * 2.0)
if (foo) {
/* the only time we ever use 'color' */
output = color;
}
we can only push the texture operation into the ``if`` if ``foo`` is
*uniform*, meaning it takes the same value for each fragment, since the
texture takes an implicit derivative.
The NIR Control Flow Model
--------------------------
In order to support many different backends, as well as maintain the
structured control information currently given by source languages such as
GLSL and required by some backends such as Intel, NIR uses a control flow
model that explicitly contains structured control flow elements such as loops
and if statements. This approach also gives a clear model for when control
flow converges and diverges that's guaranteed to be supported by every GPU.
Nevertheless, there's still the need to support the vast existing literature
that takes basic blocks as fundamental. So NIR includes basic blocks as a
primitive as well. Control flow in NIR consists of a *control flow tree* whose
elements are if statements and infinite loops, and whose leaves are basic
blocks. In addition, the *successors* of each block, which are the blocks that
a given block branches to, and the *predecessors* of each block, which are the
blocks that branch to a given block, are always kept up-to-date. Finally,
there's the *start block*, which is the first block in the function, and *end
block*, which is a fake basic block that return statements point to.
Together, the start block, end block, and graph created by the successors and
predecessors form the control flow graph that complements the control flow
tree. For example, this:
.. code-block:: c
void main(void) {
if (foo)
return;
while (bar) {
if (baz)
continue;
...
}
}
would become, in NIR:
.. graphviz::
digraph {
clusterrank="local";
subgraph cluster_main {
style="solid";
label="main";
start [label="(start)"];
start -> then1_block;
start -> else1_block;
subgraph cluster_if1 {
style="filled";
fillcolor="lightgrey";
label="if (foo)";
subgraph cluster_then1 {
label="then";
fillcolor="white";
then1_block [label="return"];
}
subgraph cluster_else1 {
label="else";
fillcolor="white";
else1_block [label="(empty)"];
}
}
pre_loop [label="(empty)"];
else1_block -> pre_loop;
pre_loop -> loop_header;
subgraph cluster_loop {
style="filled";
fillcolor="lightgrey";
label="loop";
loop_header [label="(empty)"];
then3_block -> loop_header [constraint=false];
loop_header -> then2_block;
loop_header -> else2_block;
subgraph cluster_if2 {
fillcolor="white";
label="if (!bar)";
subgraph cluster_then2 {
fillcolor="lightgrey";
label="then";
then2_block [label="break"];
}
subgraph cluster_else2 {
fillcolor="lightgrey";
label="else";
else2_block [label="(empty)"];
}
}
loop_middle_block [label="(empty)"];
else2_block -> loop_middle_block;
loop_middle_block -> then3_block;
loop_middle_block -> else3_block;
subgraph cluster_if3 {
fillcolor="white";
label="if (baz)";
subgraph cluster_then3 {
fillcolor="lightgrey";
label="then";
then3_block [label="continue"];
}
subgraph cluster_else3 {
fillcolor="lightgrey";
label="else";
else3_block [label="(empty)"];
}
}
loop_end_block [label="...", rank="max"];
else3_block -> loop_end_block;
loop_end_block -> loop_header [constraint=false];
}
post_loop [label="(empty)"];
then2_block -> post_loop;
loop_end_block -> post_loop [style="invis"];
post_loop -> end_block;
then1_block -> end_block;
end_block [label="(end)"];
}
}
where the if statements and loops are represented by boxes and the basic
blocks are represented by ovals. One thing that may be initially surprising is
that if statements always have at least one empty basic block in the "else"
portion -- that is, if-then statements are turned into if-then-else
statements. This helps optimizations that push operations into if statements,
since there could be a statement that only ever executes when the condition is
false, and adding the empty block creates a place where those statements can
be moved. On the basic block level, creating the empty block removes a
*critical edge*, which is an edge from a block with more than one successor to
another with more than one predecessor. Consider this if-then statement:
.. code-block:: c
if (foo) {
bar = 1;
}
...
and its basic block representation:
.. graphviz::
digraph {
pre [label="foo?"];
then [label="bar = 1;"];
post [label="..."];
pre -> then;
pre -> post [color="red"];
then -> post;
}
The red edge is a critical edge, since its one of two incoming edges and one
of two outgoing edges. Before running optimizations like Partial Redundancy
Elimination (PRE) and Global Code Motion (GCM) whose aim is to move code into
less frequently executed paths, most compilers will *split* the critical edge
by inserting an empty basic block:
.. graphviz::
digraph {
pre [label="foo?"];
then [label="bar = 1;"];
else [label="(empty)"];
post [label="..."];
pre -> then;
pre -> else;
then -> post;
else -> post;
}
However, in basic-block-focused compilers, keeping critical edges split all
the time would interfere with other optimizations that aim to reduce the
number of jumps that have to be executed. But because NIR keeps control flow
structured, those sorts of optimizations are either done very differently or
not done at all, and therefore it makes sense to always keep critical edges
split. It's for the same reason that NIR doesn't have a "predicated break" or
"predicated continue" instruction, which is supported by most GPU's: they add
critical edges to the CFG and prevent the compiler from being able to make
code execute only when the break or continue executes. In both cases, it's
easy enough for the backend to perform the optimizations to remove the extra
blocks if necessary.
We've now explained why most of the extra empty basic blocks were inserted in
the example NIR control flow, but there's still one left. There's an empty
block in between the first if statement and the loop, so that the then and
else branches branch to the empty block and then to the first block of the
loop instead of jumping directly to the loop. Clearly, it isn't there to
remove a critical edge. So why insert it? Well, imagine that there was a
statement in the loop that we determined to be *loop-independent*, so that we
could move it outside the loop, but it was used inside the loop so we couldn't
move it after the loop. The empty block before the loop then comes in handy as
a place to move it. Just as splitting critical edges helps optimizations such
as PRE, inserting so-called *padding blocks* before and after the loop can
help optimizations that do Loop-Invariant Code Motion (LICM), including GCM.
Putting it Together
~~~~~~~~~~~~~~~~~~~
We can put all the rules we've created into a guide for constructing the
control flow tree. To do this, we'll need a few different data types:
* A *control flow node* (often shortened to "cf node" and defined as
``nir_cf_node`` in nir.h) is the base class for everything in the control
flow tree. It can be a loop, an if statement, or a basic block.
* A *control flow list* (often shortened to "cf list") is a list of control
flow nodes that corresponds to a series of statements in GLSL. It's used to
represent the body of a function and a loop as well as the then and else
branches of an if statement. In NIR, it's implemented as an intrusive
linked list.
* An *if statement* (defined as ``nir_if``) contains a control flow list for
the then and else branches as well as a condition.
* A *loop* (defined as ``nir_loop``) is an infinite loop (the only way to
exit is through ``break`` statements). It only contains a control flow list
representing the body.
* A *basic block*, in addition to its previous definition, is now a leaf of
the control-flow tree. In NIR, basic blocks are defined in a structure
called ``nir_block``.
as well as two rules, which together will cover both the if-then-else and loop
padding situations: a control flow list must end and begin with a basic block
and must contain one (and exactly one) block between each non-block control
flow node (i.e. loop or if statement). That is, control flow lists must look
like:
::
block
loop/if
block
loop/if
...
loop/if
block
and they have to consist of at least one (possibly empty) basic block.
Finally, there are a class of instructions called "jump instructions", defined
as ``nir_jump_instr`` in nir.h, which is how breaks, continues, and returns
are represented in NIR Note that "multilevel breaks" and "multilevel
continues", i.e. jumping to a loop outside of the innermost one, are currently
not supported, although they may be in the future. There must be at most one
jump instruction per basic block, and it must be at the end of the block.
If you aren't sure, you should go and convince yourself that the example NIR
control flow given earlier satisfies all these rules, in addition to being
free of critical edges.
Modifying Control Flow
----------------------
We've seen that there are two complimentary ways of describing control flow
in NIR, the control flow tree and the control flow graph, which contain
redundant information. To ease the burden of keeping both forms up-to-date,
core NIR provides a number of helpers for rewriting the control flow graph.
They allow you to manipulate the program as if it consists of a series of
statements, like in GLSL, while "under the hood" they guarantee that the
control flow tree is in the correct form and the successors and predecessors
of the basic blocks involved are updated. Currently, these functions include:
* ``nir_cf_node_insert_before``
* ``nir_cf_node_insert_after``
* ``nir_cf_node_insert_begin``
* ``nir_cf_node_insert_end``
* ``nir_cf_node_remove``
For details see nir.h.
|