1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
|
/* Register allocation yet again
*
* alloc/spill/reload/free
*
* alloc: allocates
* spill: register won't be needed until I call reload
* reload: need a spilled register again
* free: I will never need this register again
*
* A register may be moved to another register across a spill/reload
* cycle.
*
* spill/reload store information in a "spill_info_t spill"
*
* Maybe spill_info should be variable?
*
* op = var_get_reg (&var, GP/XMM/etc);
* var_spill (&var);
* op = var_reload (&var)
* var_fini (&var);
*/
/*
- We need an "outer" driver from which constants can be requested.
API:
outer->register_constant_4x32 (outer, jit, 0x01010101)
outer->get_constant_4x32 (outer, jit, 0x01010101)
The outer driver will:
- at the end of the kernel, it will generate a sequence of constants that
can be addressed in a RIP relative way (What about x32?)
- at register time, it will allocate a register and load the constant
- get_constant() will then return either a memory location or a register
- Register allocator should support "get location" of a register. If
the register is spilled, the returned op will be a memory location.
If not, it will be a register.
- Biting the bullet on a real register allocator:
- Introduce new OP_VAR that represent a variable to be
register allocated.
- Introduce new OP_VAR_MEM{8,16,32,64} that correspond to
memory references where the involved registers are
all unallocated variables.
- register allocator hands out variables when given a register
pool
- Ghetto liveness analysis:
1. Variable is live between first assignment
and last read
2. For each jump target, for all variables live there
make them live at the jump source
3. Repeat 2 until it converges
- Linear scan allocation for each register class
- Run through code and generate allocated machine code
- Spilled variables will sometimes need to be loaded into
registers; when that happens, we need to be able to
- Information required from the assembler:
- which arguments are read and written
- in which cases is a register *required*?
- this second one could just be done by
adding an api to "check" an instruction:
check (I_mov, PTR (eax), PTR (ebx))
for example would not be allowed since there
are two memory references.
The technique for dealing with spilled variables
is to:
1. first check the instruction with memory
references
2. if that fails, then load one of the operands
into a register
3. if that fails, then load the other into
a register
4. if that fails, then load both into registers
(this should be rare/impossible)
- When a variable absolutely *must* be in a register, do it like this:
mov t1, [t2 + 8 * t3]
t1, t2 and t3 here *must* be in registers. Replace instruction with
MOV tt1, t1
MOV tt2, t2
MOV tt3, t3
MOV tt1, [tt2 + 8 * tt3]
MOV t1, tt1
and mark tt1, tt2, and tt3 as "unspillable" and (since they
are not live at the same time as their counterparts, "hint"
that they should be stored in the same register.
Implications:
- live ranges must be more exact (ie., must be able to
contain holes)
- variables can now be unspillable; as long as such
variables can only occur as the result of the
above construction, we should be fine.
- when the linear scan allocator encounters a
situation where there are unspillable variables, it
processes the unspillable variables first, assigning
them to their hints if they have one and their hint
was not itself spilled, and if that register is
available.
- mov elision: a mov between the same register should
be discarded.
Concepts:
- 'code': array of uint64, can be register allocated or not. If not,
it may contain OP_VAR and OP_VAR_MEM{8,16,32,64} operands.
- assembler: knows how to convert register allocated code to
machine code
- fragment: Can take 'code' and turn it into annotated machine
code. Should probably go away and become just 'code' since
the details of annotation are irrelevant to the outside
- register allocator: will hand out OP_VARs, knows how to
convert non-register-allocated into register-allocated. Is
specific to a register class.
- stack manager: will hand out stack offsets based on size etc.
Will also tell how much stack space is needed in total
- code manager: will hand out blocks of writable/executable
memory that can be written into. Will (eventually) deal with
handling ELF files. This will need to be OS specific.
- BEGIN_ASM/END_ASM should become BEGIN_CODE/END_CODE and just
return a pointer; maybe even
BEGIN_CODE (&code)
END_CODE ();
which will copy the code into malloced memory and store a
pointer to it in code. (It will be appended, so the memory
management needs to be a little more complex here).
- jit: Contains all the above, and will turn non-register
allocated code into a pointer to executable code.
The 'code' data structure:
- needs ability to be appended to from an array
- needs ability to be 'rewritten', ie., a linear scan through it
where instructions are inserted and removed
- needs to be concatenatable
It's tempting to say that the various operations will free and allocate:
- code = register_allocate (code)
will free the original code and return a new one, and
- executable = assembler_assemble (code, .... );
will free all the code arrays. There should also be a regular
code_free. So,
typedef struct
{
int n_elements;
uint64_t code[1];
} code_t;
and
register_allocator_t allocator;
register_alloc_init (&allocator, &stack_man);
op1 = register_alloc_new (&allocator, gp_pool);
op2 = register_alloc_new (&allocator, xmm_pool);
op3 = register_alloc_new (&allocator, xmm_pool);
code_t *code = code_new();
BEGIN_CODE (&code)
...,
END_CODE();
code = register_alloc_allocate (&gp_alloc, code);
Note: An interesting fix to x86 crazyness is to just turn
mov d, [b + x * k + i]
into
mov t1, x
shl t1, k
add t1, i
add t1, b
mov d, [t1]
and then have a peephole pass that recognizes the above and
turns it back into x86 addressing when possible.
Ie., recognize
live t1
mov r0, r1,
[shl r0, {1,2,3}],
[add r0, imm],
add r0, r2,
mov d, [t1]
dead t1
Flow:
- outer loop:
- generates outer loop
- src/mask/dest have pre-scanline setups called
- body is generated by destination iter
- src/mask/dest have post-scanline finalizers called
- destination iter:
- for each alignment loop, it generates an inner loop
- body of inner loop is generated by combiner
- combiner returns pixels or an indication that
no change is required
- destination iter writes back if necessary
- advances pointer
- combiner drives generation of pixels
- if there is a mask, then
- tell mask iter to load n pixels
- generate code to check if all pixels are 0
- if it is, then
- depending on operator
- return 0
- return "no change needed"
else:
- tell source iter to load n pixels
- add code to multiply them together
- depending on operator
- add code to check if a result is 0xff
- if it is, then
- return source
else:
- tell destination to read a pixel
- generate code to combine with source
- return result
tell mask iter to advance
else:
- tell source iter to load n pixels
- depending on operator
- add code to check if source is 0xff
- if it is, then
- return source
else:
- tell destination to read pixel
- generate code to combine with source
- return result
tell source iter to advance
-=-
Issues:
- The stack based register allocator is not a good fit for this
scheme. How about the concept of register "groups"?
At all times, one group is "active". Allocations are done in this
group; registers from other groups can't be used. If such registers
are needed, they must be reallocated in the new group.
When there is no longer enough free registers in a group, a
register from another group is kicked out to the stack. When that
other group becomes active, all kicked-out registers are moved
back in (lazily?). Note, you can never allocate more than the
number of real registers within one group.
API:
reg_alloc_init (&ra, stack_manager_t *sman, reg_set_t *regs);
reg_alloc_switch (&ra, "name");
op_t = reg_alloc_alloc (&ra);
// The register becomes allocated in the current group
// with its current value
reg_alloc_preserve (&ra, op_t);
// After a switch to another group, this make sure the
// given registers are reloaded from the stack.
reg_alloc_reload (&ra, ..., -1);
See reggroups.[ch] for a start
-=-
Iterator based JIT compiler
There is a new structure that can be either source or
destination. It contains function pointers that know how to generate
pixels. The function pointers:
Source iterators:
- begin(): generates code that runs before the kernel
- begin_line(): generates code that runs before each scanline
- get_one_pixel(): generates code that reads one pixel
- get_two_pixels(): --- two pixels
- get_four_pixels(): --- four pixels
- get_eight_pixels(): --- eight pixels
- get_sixteen_pixels(): ---
- end_line();
- end();
An iterator is associated with a format:
format_4_a8r8g8b8
format_16_a8
format_8_a8r8g8b8
and possibly others
The iterator is supposed to return pixels in that format. It will
never be asked to fetch more pixels than the format supports. Ie.,
get_sixteen_pixels() will never be called if the format is
format_4_a8r8g8b8().
All the get_n_pixels() return a register where they left the pixel.
There are also combiners. These are also associated with a format
and know how to combine pixels in that format with each other.
Destination iterators are responsible for driving a lot of the access?
They have a method: process_scanline (src_iter, mask_iter, combiner)
which is responsible for:
- generating the inner loop,
- calling src iter,
- callling mask iter,
- calling combiner
- writing back
They are specific to a format, and an intermediate format.
There may also be setup()/setup_line() methods required.
*/
|