- AVX 512 - More registers, optional operands, opmasks. Potential solutions: * encode all the additional stuff into one new operand OPTION(k0, z, broadcast) * Add pseudo-registers k0m, k1m, k2m, ..., k7m k0z, k1z, k2z, ..., k7z that corresponds to opmasks with zeroing/masking. This assumes that opmasks are only used when zero/masking is also allowed, which may not be true. * Just add new mandatory ops: vaddps, zmm, k*, m/z, zmm, zmm, option and require the user to specify, even when they want the default. * Support optional operands. Is this doable? It will require some intelligence in variant selection. But the optional args will have recognizable types, so it may not be impossible. * It looks like the optional operands are generally considered decorations on another existing operand. - New disp8 addressing scheme. Requires emit_reg_regm to deal with it and produce additional information to be stored in EVEX. - There may (or may not) be additinal XMM and YMM registers, so we now need the ability to choose EVEX encoding on the fly. The way to do this is probably to introduce new optypes XMM_LO, YMM_LO, ZMM_LO(?), and then change XMM to be XMM_LO | XMM_HI. Ie., similar to the existing cases where some instructions can be encoded more efficiently when the register is ax, in this case some instructions can be encoded with VEX when the registers are *mm0-*mm15 - A problem is that some of the new instructions have the same name as existing ones, but different number of arguments. For example VPGATHERDD zmm1 {k1}, vm32z in AVX-512 vs VPGATHERDD xmm1, vm32x, xmm2 in AVX-2. There are also things like VADDPS zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst {er} vs VADDPS ymm1, ymm2, ymm3/m256 A possibility is to just rename the avx-512 instructions to use a vz prefix instead of v. - Size directives - There are a number of cases where the op size can't be inferred from the instruction. For example, movzx eax, PTR(ebx) How many bytes should be zero-extended? In this case we have coped by simply adding movzx.8/16/32 variants. But there are other cases, like add PTR(ebx), imm8 which can have four sizes too. That goes for all the ALU ops in fact. Proposed solution: Add new OP_MEM8/16/32/64 and RIP_REL8/16/32/64 types that are designed such that they can be generated with BYTE_PTR WORD_PTR DWORD_PTR QWORD_PTR macros that are appended to the INDEX/BASE/PTR/RIPREL macros. - The A_MEM will include the new OP_MEM so that you can give the size directive if you want to. And compute_op_size() will check that if you do, it is correct. - Some instructions will only allow the size directed variants. - The pextrw instruction got a new variant in SSE 4.1 where the destination can now be either a memory location or a register. In earlier versions it could only be a register. That means this code pinsrw eax, xmm1, IMM (7) should be encoded with the earlier version in preference to the newer one. - Consider changing labels to be integers instead of strings. This will make it possible in many cases for code to be stored in static const arrays. - Storing the encoding in the info field. Maybe rename info to encoding or details. Feature checking A block of assembly should have a declared set of features that will be used. Then, - then the assembler will check that unsupported instructions are not inadvertently used - at runtime, if the feature set is not supported by the CPU, it will report the error gracefully instead of SIGILL. - For code that is doing its own checking, escape hatches will be necessary. An potential issue with this is that we may want to specialize at the individual instruction level rather than at the full assembly block level. Test suite: - Register allocation braindump: Register allocator has these methods: - begin_spill_context (reg1, reg2, ...) - end_spill_context (reg1, reg2, ...) when you enter a spill context, all registers not mentioned in the arguments become eligible for spilling. When new registers are allocated, an eligible one may be spilled. When leaving a spill context, all registers allocated in the meantime (except those mentioned in the argument) are deallocated, and the registers that were spilled are moved back in from memory. If there are conflicts between registers that are preserved and those that are spilled, the preserved ones will be given new locations. Everything can fail, and if it does, the whole thing is aborted. - Generate intermediate array of code with temp variables, then run register allocator on that? - Ie., an array of { char *mnemonic, uint64_t ops[4] } - Extend instruction table to contain information about what is written and read. And what is clobbered. - Would also allow primitive optimizations such as mov-to-the-same-register - Also would allow the possibility of automatically generating 2-operand instructions from 3-operand input to deal with both AVX and non-AVX. - Also instruction scheduling - Some of this could be portable between architectures. Alternatively, we could make the ops 32 bit, and then pass immediates and labels as multiple ops. An issue with this is that RIP_REL can be used as a memory reference, which would mean memory references can be both one and two ops. If labels were just numbers, this problem would be easy enough. A sick hack would be to just compute a hash, and then have a debug mode where if two labels actually collide an error is printed. The problem though sit that memory indices also need 64 bits. Even with 64 bit ops, there might be architectures where we can't rely on the pointers having room for the OP_ tag. Probably strings and immediates should just be interned to 24 bit numbers numbers. They could just be stored in a table in the assembler. done: -=-=-=-=-=-=-=- - CRC32 checksum generated machine code Moving to array of uint64 instead of varargs - Avoids the issue of enums and 64 bit - Can still be formatted as if it's assembly language - If we also change labels to be integers, a lot of the time code can be statically generated and stored in static const arrays. - opens the door to more complex code analysis, perhaps more advanced register allocation. Alignment and values - Sketich of algorithm is described in a FIXME - Annotations should have a beginning and end - An annotation can just be 'code' - At emit time, the last annotation is closed - Every instruction in principle results in an annotation, but if both this and the previous annotation were just 'code', they are compressed. - This means we can now patch up the code by simply walking the list of annotations. - maintain displacement - switch (annotation type) { code: just copy jump: expand if necessary, add to displacement align: has associated 'skip'. Adjust this skip such that the code will align. and change displacement accordingly label: do nothing } add displacement to both begin and end. - do this until nothing changes. - The stuff about values below is still good. If the user adds a value we have seen before, we should generally reuse it. However, if we haven't, it's probably better to keep it close to the actual code than storing it somewhere else. Maybe. Or maybe not. - The assembler right now supports alignment, but it's a pain to keep this working when jumps can be converted to 8 bits. - Also there is no support for reusing the same values in more than one generated fast path - So the assembler should just support ".var" pseudoinstructions that will be used to define values: .var32 "asdf" 0x00000000ffffffff0f0f0f0f0f000000000000000ffffffff0f0f0f0f0f000000 .var16 "name1" 0xff00ff00ff00ff00ff00ff00ff00ff00 .var8 "name2" 0xffff0000ffff0000 .var4 "name3" 0xdeadbeef and so on. These are made available as labels, but there is no guarantee that they will be stored anywhere near where they are defined. (Except that they won't be more than 2GB away so that they can be used with RIP relative addresses). Note that it has to be done this way. Just putting the values in some table in pixman itself won't necessarily ensure that the values are close enough for RIP relative addressing. The assembler will make sure only one copy of each value is stored per file in the code manager. The code and the data should probably grow from opposite ends of the file. -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- New deal: * Code is generated in single-lane intermediate language with typed variables. Available types: un8 un8x4 pointer * Each backend must decide how to represent those types * Each backend will vectorize the types as best it can * There is a "loop" instruction that takes a variable and a bound For 8888x8x8888, the code looks like this: loop ("outer", h, height); pointer src_row = src + h * stride; pointer mask_row = mask + h * stride; pointer dest_row = dest + h * stride; loop ("inner", w, width); read_u8 (m, mask, 1, w); read_4_x_u8 (s, src, 4, w); broadcast4 (mmmm, m); read_4_x_u8 (d, dest, 4, w); mul (s, s, m); shuffle (sa, s, SHUFFLE_AAAA); xor (sa, sa, 0xff); mul_add (d, sa, s); write (d, dest); AVX (and SSE3) has palignr - maybe add something like that to the intermediate format and emulate it on older instruction sets. Using it requires an extra register per unaligned input though. Useful optimizations: - Peephole optimizations to eliminate redundant shuffles etc. - Dead code elimination - in some cases we will likely end up computing stuff that is not used - Move-from-dead-register. Basically, x = y where y is dead should be eliminated. We are going to generate a number of these. - Constant propagation could make generation of intermediate code simpler - Invariant code motion. Solids could then be generated in the loop itself, rather than being special cased. Component alpha: Normal and component alpha can be treated largely the same way by having the combiner function take (src, alpha, dest), and generating alpha differently in the two cases. if (component_alpha) alpha = src-alpha * mask; src = src * mask; else alpha = src_alpha x 4 src = src; The vector size for an operation is determined by - the intermediate format a8 a8r8g8b8 a16r16g16b16 a16? and later on a32f r32f g32f b32f - whether multiplications are involved, in which case we may need twice the room in the vector. - how many pixels at a time We will never be able to support everything on everything - some things will just have to fall back to interpretation. Floating point on an MMX-only CPU for example. (I am not generating x87 code). It is also not clear that we should care deeply about ARM < v7. If the generated code is not super optimal, big deal. So basically, a backend must be allowed to bail out. If it does, the intermediate code will be interpreted instead. Another case where bail out may be required is MMX + 16 bit intermediate. But an option here is for the back end to allocate two mmx registers as one 16 byte register. It is tempting to load the mask as a8a8a8a8, but it is probably better to keep it simple and load it as a8x8x8x8, shuffle as necessary, then peephole optimize later. For example m = m << 24 m = shuffle (m, 3, 3, 3, 3); can be peepholed into m = shuffle (m, 0, 0, 0, 0); The required vector size is MAX (n_pixels * intermediate_size, 2 * intermediate_size) The 2 * intermediate_size comes from the fact that the frontend won't deal with shuffling across different registers. Ie., you can't put a 4 channel pixel into two separate registers. The backend will have to fake a big enough register if it doesn't want to bail. So while backends should report their preferred vector size to allow the front end to determine a good number of pixels per operation, they *will* be faced with 16 byte types, including 4 * float and 4 * u32. The max number of pixels is determined by if (pref_vs < 2 * intermediate_size) n_pixels = 1; else n_pixels = pref_vs / intermediate_size pref_vs has to be a multiple of 4. And the vector size that we load stuff into is given by vector_size = MAX (n_pixels * intermediate_size, 2 * intermediate_size) (And the "2" is only if the op/intermediate combination requires extra space for multiplication). Ie. the vector type to be used is (vector_size / intermediate_size, intermediate_type). The code generated is then while (w) { while (w >= 1 && !aligned_to_two_times_intermediate (dest)) { w--; } while (w >= 2 && !aligned_to_four_times_intermediate (dest)) { w--; } ... while (w >= n_pixels) { code (n_pixels, vector_size, intermediate_format, op, src, mask, dest); w--; } } ie. [while] for (n_pixels = 1; n_pixels <= max_pixels; n_pixels *= 2) { [while] [end while] } [end while] Here is code to compute the various properties: static void compute_stuff (int op, pixman_format_code_t src, pixman_format_code_t mask, pixman_format_code_t dest) { const char *intermediate; int pref_vs; int intermediate_size; int mult; int n_pixels; if (op == PIXMAN_OP_ADD && mask == PIXMAN_null && src == PIXMAN_a8 && dest == PIXMAN_a8) { intermediate = "a8"; intermediate_size = 1; } else if (PIXMAN_FORMAT_16BPC (src) || PIXMAN_FORMAT_16BPC (mask) || PIXMAN_FORMAT_16BPC (dest)) { intermediate = "a16r16g16b16"; intermediate_size = 8; } else { intermediate = "a8r8g8b8"; intermediate_size = 4; } if (strcmp (intermediate, "af8rf8gf8bf8") == 0) { mult = 1; } else { mult = 2; if (mask == PIXMAN_null) { if ((op == PIXMAN_OP_ADD) || (op == PIXMAN_OP_OVER && PIXMAN_A (src) == 0) || (op == PIXMAN_OP_SRC)) { mult = 1; } } for (pref_vs = 4; pref_vs <= 16; pref_vs *= 2) { int vector_size; if (pref_vs < mult * intermediate_size) n_pixels = 1; else n_pixels = pref_vs / intermediate_size; vector_size = max (n_pixels * intermediate_size, mult * intermediate_size); printf ("%2d vector type: <%d x %s> (%d pixels)\n", pref_vs, vector_size / intermediate_size, intermediate, n_pixels); } } Larrabee: Larrabee does not fit this pattern because it can't do arithmetic on 8 bit channels; it always expands to at least 32 bits. On the other hand, it has 32 registers, so we could simply fake the un8 type with extra registers. A better approach is to have the backend give the number of pixels it can deal with for a given type. The only problem with that is that the backend will have to artificially limit the number of pixels per register since it has to be able to do multiplications. Hmm, on the other hand, the front end will need to write out registers to memory, so the packed representation is necessary. But maybe only for output purposes? If the intermediate format has Larrabee-like instructions, maybe the backends can optimize themselves? We do need the typed vectors so that Larrabee can use the generous representation internally for things like un8. (Ie., we can't just use 16 bit integers there and do the shift tricks in the frontend, because Larrabee would prefer to do things differently). For un8 types, maybe there should actually be two types: un8 and un8m, where multiplications are only allowed on the 'm' variation. Another possibility is to forget about the multiplication optimization and always reserve enough space. For the vast majority of operations, we do need multiplications, and for those where we don't we could just add special fast paths. This seems like the correct approach: - vector variables are typed; for example they could be un8 or u16, or f16, f64 etc. - conceptually, and as far as the frontend is concerned, the vectors are fully packed, so if you write them to memory, they will be stored in a packed format. Ie., the front end will consider a vector of 16 un8s to be 16 bytes wide. - Internally, the backend may choose a different representation (and almost all backends will need to use 16 bits for the un8 type). - When vectors are written to memory, the backends will need to do whatever packing and unpacking they need to undo their internal representation. The SSE2 backend would want to optimize two memory stores into one pack + move, rather than two pack-with-0-then-store. One issue with this is what to do about several multiplications in a row. These can be optimized because there is no need to divide with 255 until all the adds have taken place. Of course, if a backend wishes to, it may consider the internal type "un88", even when the external type is un8. That just means that in addition to packing, it first has to divide by 255. And to do this it has to prove that the intermediate results can not overflow. Another issue is what to do with formats like 565 or a1 or weirder. And is it possible to multiply two vectors together with different types? If it is, what is the resulting type? The answer is likely that you can't multiply vectors with different types, so the front end will need to decide on an intermediate format, and generate code to load it. This means we will need to be able to convert from n of one type to n of another. r5g6b5 should probably be treated specially because the channels have different widths. Maybe we just need a convert-r5g6b5-to-8888 instruction. With typed vectors, it is quite difficult to generate code to convert for it. What about for a8? - find out how many un8s we can deal with - divide by 4 - load that many u8s - convert to u32 - shift/mask/etc. Algorithm: - Determine intermediate format (max of all the involved formats, though at least un8). - Given intermediate format, determine n_pixels for source, mask and destination. Take minimum. - src = load (4 * n_pixel, format) mask = load (4 * n_pixel, format) dest = load (4 * n_pixel, format) It is the backend's problem how it wants to represent those internally. For example if there is no mask, and we are compositing a8r8g8b8 x a8r8g8b8, then the SSE2 backend might report that it can deal with 4 pixels. Internally, one 4 pixel IR register would be stored in 2 xmm registers, unpacked. (Problem: a simple-minded liveness analysis well not know that the two xmm registers could die separately). Though there could be a "split" pass that would split such variables into two. Or there could be IR instructions that would load two registers. Though, how would the frontend know to do that? And how would the split pass deal with loads? Intermediate format is probably always given as "un16" or "f32" or somesuch (as opposed to a16r16g16b16 etc.) All the backends need to do then is to provide a backend_get_n_pixels() that returns the number of pixels it can handle for a given type. Or the number of lanes, really. It is likely to be useful to allow backends to do their own rewriting before register allocation. For example, the arm backend will probably want to replace a shuffle-multiply with something that it can use to generate normal multiplication code with. Similarly, the mmx backend can only do word-level shuffling, so it will want to replace shuffle () expand () with expand shuffle () These types of things should be done on the dependency graph of the intermediate code. (And should be therefore restricted to things that are semantically *identical*, not just equivalent) t = shuffle_byte (s, ...); v = expand (t) can be transformed into t = expand (s) v = shuffle (t) but only if there are no other dependencies on t. Alternatively, it could be turned into t = shuffle_byte (s, ...); tt = expand (s); v = shuffle (tt); with the hope that the initial shuffle_byte would then be dead. The first one seems simpler: - no termination arguments necessary - no dead code elimination - potentially less powerful, but meh. Intermediate format Benefits: - Can be interpreted. Eventually only the jit code should be necessary - Multiple backends with different size registers - Register allocation - Optimization? - Typed variables. Should there be a 0.8 type? The intel (and other) backend can do the expanding itself, and ARM can do more efficient multiplication. intel: src, mask loaded nmask = xor (mask, 0xff) s = src * mask expand (src) -> src1, src2 expand (mask) -> mask1, mask2 pixmul pixmul pack -> s d = dest * nmask expand (dest) -> ... pack -> d nmask = shuffle (src, s = d (+) s store s in dest which is just as good as before. In fact it's a little better because we save a negation instruction. Actually, we should keep the computed alpha around, and then negate afterwards, and then multiply. So there will be extra shuffling going on with this scheme. Arm src, mask loaded nmask = xor (mask, 0xff) s = src * mask: uxtb16 s_low, src uxtb16 s_hi, src, ror 8 mla s_low, s_low, mask, 0x80 mla s_hi, s_hi, mask, 0x80 uxtab16 s_low, s_low, s_low, rot 8 uxtab16 s_hi, s_hi, s_hi, rot 8 and s_hi, 0xff00ff00 uxtab16 s_low, s_hi, s_low, rot 8 One difference is that Arm wants the mask as a single byte whereas intel wants it duplicated. The correct solution may be to simply read it as an ARGB, then the intel can shuffle and the Arm can do its own multiply with its builtin shifting. Another possibility is to add an IR instruction to multiply a vector with one number. On intel this would result in a pshuf + multiply. On ARM it could be done with just two mulitplications. On intel we would actually need to multiply with a number given in a field of another vector. Unfortunately, this can't be implemented on ARM without assuming that there is room for overflow. An intelligent instruction selector on ARM would probably be a big benefit. Or just assume that NEON is the future. It seems saner than v6 SIMD. - Three-register instructions It is what future x86s will have, it is what ARM wants, it is much easier to generate code for, and it is easy to turn three-register code into two-register code, but not the other way around. Older notes: - The generated ops should have a simpler prototype than the normal one. Maybe something like. void (* CompositeOp) (uint32_t *src_start, uint32_t src_skip, /* = src_stride - (width * src_bpp) / 8 */ uint32_t *mask_start, uint32_t mask_skip, uint32_t *dest_start, uint32_t dest_skip, uint16_t width, uint16_t height); The amount of setup in generated code should be minimized, both to simplify code generation and to reduce the memory overhead of the code generation. For ops where source or mask is solid, src/mask_start should point to an 8888 pixel arranged similarly to the dest format. Ie., unpacking should happen before the op is called. If we add transformations and filters, they can be added at the end of the argument list - that way the code won't have to change too much. The problem with this though is that we need essentially random access to do transformations. It also doesn't work for gradients. - It should be figured out how to deal with alignment restrictions. There is movdqa, which requires 16 byte alignment, and there is movdqu, which doesn't. Agner says movdqu is slow and may better be replaced with two movq's. Maybe it's best to just always use correctly aligned reads: - Generate a version for every combination of alignments? - Dynamically determine the instruction to use? - This is not necessarily that slow since branch prediction will usually get it right. In all cases we will need to be able to handle single pixels. So maybe start out just generating those. The optimal thing to do would be. - compute n_pixels(sse, op, src_format, mask_format, dest_format) the number of pixels that fits in a register. - For each line compute the initial unaligned strip of pixels. If this is the same for all three, then set sw to that number, otherwise set it to the full width. inner_loop: tmp_wid = width; with tmp_wid on the stack. w = that_number; tmp_wid -= sw; - while (sw--) { do_one_pixel; } Then compute the number of aligned pixels that can be done, and while (aw--) { do_full_size() } Finally while (w--) { do_one_pixel; } the "do_one_pixel" loop can be in a separate function to avoid duplicating the code. In fact, both one and full size could be the same code with a parameter to determine the difference. Branch prediction would make this relatively inexpensive. Although, the full size code is quite different since it has to deal with packed pixels. Also we need the images to be identically aligned - it's not good enough to only align say the source to 16 bytes. It may be worth having an initial pass that deals with any leading and trailing unaligned data. Dealing with alignment - Size of operation: can be 1 or 2 bytes (depending on whether it's multiplication or just addition) - available_channels = vector_size / op_size; In principle we could always read in as many pixels as possible from the shortest input. Ie., if we have an a8 mask, composite 16 pixels at a time, unrolling as necessary. The register pressure is likely to make this a loss though. 1 register set to 0 1 register to hold the 16 mask pixels 1 register to hold the 4 src pixels 1 register to hold the 4 dest pixels 1 register to hold two expanded src pixels 1 register to hold two expanded dest pixels 1 register to hold two expanded alpha pixels that's 8 registers already, which is all we have got unless we are in 64 bit mode. We could make use of mmx registers too - the currently inactive pixels could be held there. Ie., movdqa *src, xmm0 movdq2q xmm0, mm0 psrldq xmm0, 64 movdq2q xmm0, mm1 movdqa *mask, xmm0, movdq2q xmm0, mm2 psrldq xmm0, 64 movdq2q xmm0, mm3 movdqa *mask, xmm0, movdq2q xmm0, mm2 psrldq xmm0, 64 movdq2q xmm0, mm3 A simple, but probably pretty good scheme: - Generate two versions of each op, one where everything is aligned, and one where alignment is detected on the fly. In both cases n_pixels is computed, the number of pixels to handle per iteration. In the aligned case, we then just read in that many pixels as efficiently as possible. For the first iteration, that probably means 2 pixels in many cases, but eventually it would be nice to unroll once to get to four pixels. So both versions have a preamble that reads the source, mask and destinations into sse registers. Then afterwards, the computations are the same, then finally two different versions generate the final write to the destination. Another possibility: - Deal with unaligned or short lines in separate loops before and after the aligned loop. Could be in a function. Current thinking: - Just ensure destination is aligned, and use movdqu for sources. Computing the rendering: - Reserve three registers for 0x00.. 0x0080.. and 0x00ff - Compute (src IN mask) and (srca) where (src IN mask) is returned as a packed vector, and srca1 and srca2 are expanded. (srca is only computed if actually needed). - Read dest into one of the remaining two registers - In the last register expand low part of d and combine it onto src1 - In the last register expand high part of d and combine it onto src2 At this point src1 and src2 represents the 'fd' that will get added onto s - Pack src1 and src2 together in src1 (which is now the fd we will add to s) - Unpack low part of s into src2 (mov s, s2; unpack z, s2); - Unpack low part of d into last register - Shuffle last register to become dest alpha - Combine onto src2 - Unpack high part of s into last register - Unpack high part of d into itself - Combine onto s - Pack s and src2 into src2 - Add src1 and src2 - Store the result. Unfortunately, we are one register short of being able to do F_a's and F_b's where both src_a and dst_a are involved. Note: there is no integer division in SSE, so this would possibly have to be done with floating point. Ie,. using cvtqd2ps. - It is important that the register allocator is not too dumb - EAX is the only register we can use for multiplication - There are not a lot of registers. - Need to do something sensible when we run out of registers. Spilling to memory is a possibility - just giving up is another. We need the ability to unallocate a register, and then use it in a following instruction. That way a register can be reused in the middle of an instruction. (moves from a register to itself should be culled). This probably also implies that we need the ability to ref count registers, and that callees will often have to take ownership of a passed-in register. Pixels are always 1, 2, 4, or 8 bytes wide, which means we can make good use of the x86 addressing mode: Only one register is necessary for the X coordinate, we can then index into images using displacements and shifts: while (h--) { s_line = src_pixels + h * s_stride; m_line = mask_pixels + h * m_stride; d_line = dest_pixels + h * d_stride; while (w--) { s = s_line + w * s_bpp; m = m_line + w * m_bpp; d = d_line + w * d_bpp; } } Registers: h, w, s_line, m_line, d_line, Or: s = src_pixels; m = mask_pixels; d = dest_pixels; while (h--) { while (w--) { [s + w * bpp] [d + w * bpp] [m + w * bpp] } s += s_stride; m += m_stride; d += d_stride; } Registers: h, w, s, d, m, s_stride, m_stride, d_stride. - Optimistic Register Allocator Simply hand out registers as code asks for them. If we run out of registers, then spill the register somewhere on the stack. (At emit time we will make room for spills). When a spilled variable is being used, some random other variable is stored on the stack, and the used variable gets the register. Possibly using the LRU algorithm to decide the eviction. Before jumps and before labels, a 'normalization' is run where things are put into the locations they are supposed to be in. Ie,. all variables have two fields: where *are* they, and where do they *belong*. The invariant is that at entry to any basic block, variables are in their assigned location. Combined with not register allocating constants, this really should be good enough. - Code generator / runtime assembler Eventually, the x86_codegen.h file should be folded into codex86.c. Things like membase emission can be done straight from the ops. - Need to check malloc() returns. - An SSE/MMX register allocator is probably required. It would be nice if we could figure out up front how many registers we are going to need so that constants can be register allocated and hoisted out of the loop when possible. Have a "dry run" mode that just finds out which registers are used? - Should probably swap arguments to memindex and membase to make them look more like at&t syntax. Also find out where to put immediate operators. Maybe I need to face the fact that at&t syntax is crack. - The code could be made smaller and faster if op_t were made smaller by compressing fields into uint8_t's. This would cause gdb to not show registers in enums though. - Decide on syntax: AT&T vs Intel? Right now we are mostly at&t, but that's kind of weird for things like test, and cmp. - Public API: pixman-sse-jit.h: pixman_sse_jit_t *pixman_sse_jit_new (); pixman_sse_code_t *pixman_sse_jit_get_op (pixman_sse_jit_t *jit, pixman_op_t op, const pixman_sse_image_info *src, const pixman_sse_image_info *mask, const pixman_sse_image_info *dest); void pixman_sse_code_run (pixman_sse_code_t *code, pixman_op_t op, pixman_image_t *src, pixman_image_t *mask, pixman_image_t *dest, int x_src, int y_src, int x_mask, int y_mask, int x_dest, int y_dest, uint16_t width, uint16_t height); where pixman_sse_image_info would contain information about the format, whether the image is solid, whether the image is there at all (for the mask). DONE: - Pixman CPU detection should be generated dynamically. That will get rid of the annoying #ifdefs and getisax() stuff. - Backwards vs. forwards The code currently in testjit iterates backwards over each line. It may be a little better to go forward. This could be done by - initializing the line to (line + w * bpp) - initializing w to -width - not having a displacement: movq (line, w, bpp), xmm0 and add 2, width Current code iterates forwards. - The memindex/membase should take ops, not reg numbers. - There should only REG and MEM in the ops. emit_memindex() can handle everything. - Assembler needs to deal with labels. API x86_jz (a, "no_cpuid"); asm_label (a, "no_cpuid"); The assembler would just store a big list of labels/positions, then patch up at asm_emit() time. - REG ops can contain any register whatsoever (xmm, mmx, eax, al etc.). There should be a separate op_get_regno() function to extract the register number from an op. - There should be eax() type functions. - Renaming mov to movl was a mistake. The assembler should be able to figure out the size of the operation by itself. (By looking at the involved registers).