/* Register allocation yet again * * alloc/spill/reload/free * * alloc: allocates * spill: register won't be needed until I call reload * reload: need a spilled register again * free: I will never need this register again * * A register may be moved to another register across a spill/reload * cycle. * * spill/reload store information in a "spill_info_t spill" * * Maybe spill_info should be variable? * * op = var_get_reg (&var, GP/XMM/etc); * var_spill (&var); * op = var_reload (&var) * var_fini (&var); */ /* - We need an "outer" driver from which constants can be requested. API: outer->register_constant_4x32 (outer, jit, 0x01010101) outer->get_constant_4x32 (outer, jit, 0x01010101) The outer driver will: - at the end of the kernel, it will generate a sequence of constants that can be addressed in a RIP relative way (What about x32?) - at register time, it will allocate a register and load the constant - get_constant() will then return either a memory location or a register - Register allocator should support "get location" of a register. If the register is spilled, the returned op will be a memory location. If not, it will be a register. - Biting the bullet on a real register allocator: - Introduce new OP_VAR that represent a variable to be register allocated. - Introduce new OP_VAR_MEM{8,16,32,64} that correspond to memory references where the involved registers are all unallocated variables. - register allocator hands out variables when given a register pool - Ghetto liveness analysis: 1. Variable is live between first assignment and last read 2. For each jump target, for all variables live there make them live at the jump source 3. Repeat 2 until it converges - Linear scan allocation for each register class - Run through code and generate allocated machine code - Spilled variables will sometimes need to be loaded into registers; when that happens, we need to be able to - Information required from the assembler: - which arguments are read and written - in which cases is a register *required*? - this second one could just be done by adding an api to "check" an instruction: check (I_mov, PTR (eax), PTR (ebx)) for example would not be allowed since there are two memory references. The technique for dealing with spilled variables is to: 1. first check the instruction with memory references 2. if that fails, then load one of the operands into a register 3. if that fails, then load the other into a register 4. if that fails, then load both into registers (this should be rare/impossible) - When a variable absolutely *must* be in a register, do it like this: mov t1, [t2 + 8 * t3] t1, t2 and t3 here *must* be in registers. Replace instruction with MOV tt1, t1 MOV tt2, t2 MOV tt3, t3 MOV tt1, [tt2 + 8 * tt3] MOV t1, tt1 and mark tt1, tt2, and tt3 as "unspillable" and (since they are not live at the same time as their counterparts, "hint" that they should be stored in the same register. Implications: - live ranges must be more exact (ie., must be able to contain holes) - variables can now be unspillable; as long as such variables can only occur as the result of the above construction, we should be fine. - when the linear scan allocator encounters a situation where there are unspillable variables, it processes the unspillable variables first, assigning them to their hints if they have one and their hint was not itself spilled, and if that register is available. - mov elision: a mov between the same register should be discarded. Concepts: - 'code': array of uint64, can be register allocated or not. If not, it may contain OP_VAR and OP_VAR_MEM{8,16,32,64} operands. - assembler: knows how to convert register allocated code to machine code - fragment: Can take 'code' and turn it into annotated machine code. Should probably go away and become just 'code' since the details of annotation are irrelevant to the outside - register allocator: will hand out OP_VARs, knows how to convert non-register-allocated into register-allocated. Is specific to a register class. - stack manager: will hand out stack offsets based on size etc. Will also tell how much stack space is needed in total - code manager: will hand out blocks of writable/executable memory that can be written into. Will (eventually) deal with handling ELF files. This will need to be OS specific. - BEGIN_ASM/END_ASM should become BEGIN_CODE/END_CODE and just return a pointer; maybe even BEGIN_CODE (&code) END_CODE (); which will copy the code into malloced memory and store a pointer to it in code. (It will be appended, so the memory management needs to be a little more complex here). - jit: Contains all the above, and will turn non-register allocated code into a pointer to executable code. The 'code' data structure: - needs ability to be appended to from an array - needs ability to be 'rewritten', ie., a linear scan through it where instructions are inserted and removed - needs to be concatenatable It's tempting to say that the various operations will free and allocate: - code = register_allocate (code) will free the original code and return a new one, and - executable = assembler_assemble (code, .... ); will free all the code arrays. There should also be a regular code_free. So, typedef struct { int n_elements; uint64_t code[1]; } code_t; and register_allocator_t allocator; register_alloc_init (&allocator, &stack_man); op1 = register_alloc_new (&allocator, gp_pool); op2 = register_alloc_new (&allocator, xmm_pool); op3 = register_alloc_new (&allocator, xmm_pool); code_t *code = code_new(); BEGIN_CODE (&code) ..., END_CODE(); code = register_alloc_allocate (&gp_alloc, code); Note: An interesting fix to x86 crazyness is to just turn mov d, [b + x * k + i] into mov t1, x shl t1, k add t1, i add t1, b mov d, [t1] and then have a peephole pass that recognizes the above and turns it back into x86 addressing when possible. Ie., recognize live t1 mov r0, r1, [shl r0, {1,2,3}], [add r0, imm], add r0, r2, mov d, [t1] dead t1 Flow: - outer loop: - generates outer loop - src/mask/dest have pre-scanline setups called - body is generated by destination iter - src/mask/dest have post-scanline finalizers called - destination iter: - for each alignment loop, it generates an inner loop - body of inner loop is generated by combiner - combiner returns pixels or an indication that no change is required - destination iter writes back if necessary - advances pointer - combiner drives generation of pixels - if there is a mask, then - tell mask iter to load n pixels - generate code to check if all pixels are 0 - if it is, then - depending on operator - return 0 - return "no change needed" else: - tell source iter to load n pixels - add code to multiply them together - depending on operator - add code to check if a result is 0xff - if it is, then - return source else: - tell destination to read a pixel - generate code to combine with source - return result tell mask iter to advance else: - tell source iter to load n pixels - depending on operator - add code to check if source is 0xff - if it is, then - return source else: - tell destination to read pixel - generate code to combine with source - return result tell source iter to advance -=- Issues: - The stack based register allocator is not a good fit for this scheme. How about the concept of register "groups"? At all times, one group is "active". Allocations are done in this group; registers from other groups can't be used. If such registers are needed, they must be reallocated in the new group. When there is no longer enough free registers in a group, a register from another group is kicked out to the stack. When that other group becomes active, all kicked-out registers are moved back in (lazily?). Note, you can never allocate more than the number of real registers within one group. API: reg_alloc_init (&ra, stack_manager_t *sman, reg_set_t *regs); reg_alloc_switch (&ra, "name"); op_t = reg_alloc_alloc (&ra); // The register becomes allocated in the current group // with its current value reg_alloc_preserve (&ra, op_t); // After a switch to another group, this make sure the // given registers are reloaded from the stack. reg_alloc_reload (&ra, ..., -1); See reggroups.[ch] for a start -=- Iterator based JIT compiler There is a new structure that can be either source or destination. It contains function pointers that know how to generate pixels. The function pointers: Source iterators: - begin(): generates code that runs before the kernel - begin_line(): generates code that runs before each scanline - get_one_pixel(): generates code that reads one pixel - get_two_pixels(): --- two pixels - get_four_pixels(): --- four pixels - get_eight_pixels(): --- eight pixels - get_sixteen_pixels(): --- - end_line(); - end(); An iterator is associated with a format: format_4_a8r8g8b8 format_16_a8 format_8_a8r8g8b8 and possibly others The iterator is supposed to return pixels in that format. It will never be asked to fetch more pixels than the format supports. Ie., get_sixteen_pixels() will never be called if the format is format_4_a8r8g8b8(). All the get_n_pixels() return a register where they left the pixel. There are also combiners. These are also associated with a format and know how to combine pixels in that format with each other. Destination iterators are responsible for driving a lot of the access? They have a method: process_scanline (src_iter, mask_iter, combiner) which is responsible for: - generating the inner loop, - calling src iter, - callling mask iter, - calling combiner - writing back They are specific to a format, and an intermediate format. There may also be setup()/setup_line() methods required. */