- AVX 512

  - More registers, optional operands, opmasks.

    Potential solutions:

    * encode all the additional stuff into one new operand

      OPTION(k0, z, broadcast)

    * Add pseudo-registers

    	    k0m, k1m, k2m, ..., k7m
	    k0z, k1z, k2z, ..., k7z

      that corresponds to opmasks with zeroing/masking. This assumes
      that opmasks are only used when zero/masking is also allowed,
      which may not be true.

    * Just add new mandatory ops:

    	     vaddps, zmm, k*, m/z, zmm, zmm, option

      and require the user to specify, even when they want the
      default.

    * Support optional operands. Is this doable? It will require some
      intelligence in variant selection. But the optional args will
      have recognizable types, so it may not be impossible.

    * It looks like the optional operands are generally considered
      decorations on another existing operand.

  - New disp8 addressing scheme. Requires emit_reg_regm to deal with
    it and produce additional information to be stored in EVEX.

  - There may (or may not) be additinal XMM and YMM registers, so we
    now need the ability to choose EVEX encoding on the fly. The way
    to do this is probably to introduce new optypes XMM_LO, YMM_LO,
    ZMM_LO(?), and then change XMM to be XMM_LO | XMM_HI. Ie., similar
    to the existing cases where some instructions can be encoded more
    efficiently when the register is ax, in this case some
    instructions can be encoded with VEX when the registers are
    *mm0-*mm15

  - A problem is that some of the new instructions have the same name
    as existing ones, but different number of arguments. For example

        VPGATHERDD zmm1 {k1}, vm32z 

    in AVX-512 vs

	VPGATHERDD xmm1, vm32x, xmm2

    in AVX-2. There are also things like

        VADDPS zmm1 {k1}{z}, zmm2,  zmm3/m512/m32bcst {er} 

    vs

        VADDPS ymm1, ymm2, ymm3/m256

    A possibility is to just rename the avx-512 instructions to use a vz prefix
    instead of v.

- Size directives

  - There are a number of cases where the op size can't be inferred
    from the instruction. For example,

    	 movzx eax, PTR(ebx)

    How many bytes should be zero-extended? In this case we have coped
    by simply adding movzx.8/16/32 variants.

    But there are other cases, like 

    	      add  PTR(ebx), imm8

    which can have four sizes too. That goes for all the ALU ops in
    fact.

    Proposed solution: Add new OP_MEM8/16/32/64 and RIP_REL8/16/32/64
    types that are designed such that they can be generated with

    	     BYTE_PTR
	     WORD_PTR
	     DWORD_PTR
	     QWORD_PTR

    macros that are appended to the INDEX/BASE/PTR/RIPREL macros.

    - The A_MEM will include the new OP_MEM<size> so that you can give
      the size directive if you want to. And compute_op_size() will
      check that if you do, it is correct.

    - Some instructions will only allow the size directed variants.
    
- The pextrw instruction got a new variant in SSE 4.1 where the
  destination can now be either a memory location or a register. In
  earlier versions it could only be a register. That means this code

         pinsrw eax, xmm1, IMM (7)

should be encoded with the earlier version in preference to the newer
one.

- Consider changing labels to be integers instead of strings. This
  will make it possible in many cases for code to be stored in static
  const arrays.

- Storing the encoding in the info field. Maybe rename info to
  encoding or details.


Feature checking

A block of assembly should have a declared set of features that will
be used. Then, 

   - then the assembler will check that unsupported instructions are
     not inadvertently used

   - at runtime, if the feature set is not supported by the CPU, it
     will report the error gracefully instead of SIGILL.

   - For code that is doing its own checking, escape hatches will be
     necessary.

An potential issue with this is that we may want to specialize at the
individual instruction level rather than at the full assembly block
level.


Test suite:
- Register allocation braindump:
  Register allocator has these methods:
  - begin_spill_context (reg1, reg2, ...)
  - end_spill_context (reg1, reg2, ...)

  when you enter a spill context, all registers not mentioned in the
  arguments become eligible for spilling. When new registers are allocated,
  an eligible one may be spilled.

  When leaving a spill context, all registers allocated in the
  meantime (except those mentioned in the argument) are deallocated, and
  the registers that were spilled are moved back in from memory.

  If there are conflicts between registers that are preserved and
  those that are spilled, the preserved ones will be given new
  locations.

  Everything can fail, and if it does, the whole thing is aborted.

- Generate intermediate array of code with temp variables,
  then run register allocator on that?
  - Ie., an array of { char *mnemonic, uint64_t ops[4] }
  - Extend instruction table to contain information about what is written
    and read. And what is clobbered.
  - Would also allow primitive optimizations such as mov-to-the-same-register
  - Also would allow the possibility of automatically generating 2-operand
    instructions from 3-operand input to deal with both AVX and non-AVX.
  - Also instruction scheduling
  - Some of this could be portable between architectures.


Alternatively, we could make the ops 32 bit, and then pass immediates
and labels as multiple ops. An issue with this is that RIP_REL can be
used as a memory reference, which would mean memory references can be
both one and two ops.

If labels were just numbers, this problem would be easy enough. A sick
hack would be to just compute a hash, and then have a debug mode where
if two labels actually collide an error is printed. The problem though
sit that memory indices also need 64 bits.

Even with 64 bit ops, there might be architectures where we can't rely
on the pointers having room for the OP_ tag.

Probably strings and immediates should just be interned to 24 bit
numbers numbers. They could just be stored in a table in the
assembler.

done:

  -=-=-=-=-=-=-=-

- CRC32 checksum generated machine code

Moving to array of uint64 instead of varargs
- Avoids the issue of enums and 64 bit
- Can still be formatted as if it's assembly language
- If we also change labels to be integers, a lot of the time code can
  be statically generated and stored in static const arrays.
- opens the door to more complex code analysis, perhaps more advanced
  register allocation.

Alignment and values
- Sketich of algorithm is described in a FIXME
- Annotations should have a beginning and end
- An annotation can just be 'code'
- At emit time, the last annotation is closed
- Every instruction in principle results in an annotation,
  but if both this and the previous annotation were just 'code',
  they are compressed.
- This means we can now patch up the code by simply walking the
  list of annotations.
  - maintain displacement
  - switch (annotation type)
    { code: just copy
      jump: expand if necessary, add to displacement
      align: has associated 'skip'. Adjust this skip such that the code
             will align.
             and change displacement accordingly
      label: do nothing
    }
    add displacement to both begin and end.
  - do this until nothing changes.

- The stuff about values below is still good. If the user adds a value
  we have seen before, we should generally reuse it. However, if we
  haven't, it's probably better to keep it close to the actual code
  than storing it somewhere else. Maybe. Or maybe not.

- The assembler right now supports alignment, but it's a pain to keep
  this working when jumps can be converted to 8 bits.
- Also there is no support for reusing the same values in more than
  one generated fast path
- So the assembler should just support ".var" pseudoinstructions that 
  will be used to define values:

       	  .var32  "asdf"   0x00000000ffffffff0f0f0f0f0f000000000000000ffffffff0f0f0f0f0f000000
       	  .var16  "name1"  0xff00ff00ff00ff00ff00ff00ff00ff00
	  .var8   "name2"  0xffff0000ffff0000
	  .var4   "name3"  0xdeadbeef

  and so on. These are made available as labels, but there is no
  guarantee that they will be stored anywhere near where they are
  defined. (Except that they won't be more than 2GB away so that they
  can be used with RIP relative addresses).

  Note that it has to be done this way. Just putting the values in
  some table in pixman itself won't necessarily ensure that the values
  are close enough for RIP relative addressing.

  The assembler will make sure only one copy of each value is stored
  per file in the code manager. The code and the data should probably
  grow from opposite ends of the file.


-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-


New deal:

* Code is generated in single-lane intermediate language with typed 
  variables. Available types:

  	     un8
	     un8x4
	     pointer

* Each backend must decide how to represent those types

* Each backend will vectorize the types as best it can

* There is a "loop" instruction that takes a variable and a bound


For 8888x8x8888, the code looks like this:

    loop ("outer", h, height);

    pointer src_row = src + h * stride;
    pointer mask_row = mask + h * stride;
    pointer dest_row = dest + h * stride;

    loop ("inner", w, width);

    read_u8 (m, mask, 1, w);

    read_4_x_u8 (s, src, 4, w);

    broadcast4 (mmmm, m);

    read_4_x_u8 (d, dest, 4, w);

    mul (s, s, m);

    shuffle (sa, s, SHUFFLE_AAAA);

    xor (sa, sa, 0xff);

    mul_add (d, sa, s);

    write (d, dest);


AVX (and SSE3) has palignr - maybe add something like that to the
intermediate format and emulate it on older instruction sets. Using it
requires an extra register per unaligned input though.

Useful optimizations:

	- Peephole optimizations to eliminate redundant shuffles etc.

	- Dead code elimination - in some cases we will likely end up
	  computing stuff that is not used

	- Move-from-dead-register. Basically, 

		x = y

	  where y is dead should be eliminated. We are going to generate
	  a number of these.

	- Constant propagation could make generation of intermediate
	  code simpler

	- Invariant code motion. Solids could then be generated in the
	  loop itself, rather than being special cased.

Component alpha:

Normal and component alpha can be treated largely the same way by
having the combiner function take (src, alpha, dest), and generating
alpha differently in the two cases.

if (component_alpha)
	alpha = src-alpha * mask;
	src = src * mask;
else
	alpha = src_alpha x 4
	src = src;


The vector size for an operation is determined by

    - the intermediate format
      	  a8
	  a8r8g8b8
	  a16r16g16b16
	  a16?

      and later on 

      	  a32f r32f g32f b32f
 
    - whether multiplications are involved, in which case we may need
      twice the room in the vector.

    - how many pixels at a time

We will never be able to support everything on everything - some
things will just have to fall back to interpretation. Floating point
on an MMX-only CPU for example. (I am not generating x87 code).

It is also not clear that we should care deeply about ARM < v7. If the
generated code is not super optimal, big deal.

So basically, a backend must be allowed to bail out. If it does, the
intermediate code will be interpreted instead.

Another case where bail out may be required is MMX + 16 bit
intermediate.  But an option here is for the back end to allocate two
mmx registers as one 16 byte register.

It is tempting to load the mask as a8a8a8a8, but it is probably better
to keep it simple and load it as a8x8x8x8, shuffle as necessary, then
peephole optimize later.

	 For example

	     m = m << 24
	     m = shuffle (m, 3, 3, 3, 3);

	 can be peepholed into

	     m = shuffle (m, 0, 0, 0, 0);

The required vector size is 

    MAX (n_pixels * intermediate_size, 2 * intermediate_size)

The 2 * intermediate_size comes from the fact that the frontend won't
deal with shuffling across different registers. Ie., you can't put a 4
channel pixel into two separate registers. The backend will have to
fake a big enough register if it doesn't want to bail.

So while backends should report their preferred vector size to allow
the front end to determine a good number of pixels per operation, they
*will* be faced with 16 byte types, including 4 * float and 4 * u32.

The max number of pixels is determined by

      if (pref_vs < 2 * intermediate_size)
      	 n_pixels = 1;
      else
         n_pixels = pref_vs / intermediate_size

pref_vs has to be a multiple of 4.

And the vector size that we load stuff into is given by

    vector_size = MAX (n_pixels * intermediate_size, 2 * intermediate_size)

(And the "2" is only if the op/intermediate combination requires 
 extra space for multiplication).

Ie. the vector type to be used is

    (vector_size / intermediate_size, intermediate_type).

The code generated is then

    while (w)
    {
	while (w >= 1 && !aligned_to_two_times_intermediate (dest))
	{
	      <code (n_pixels=1, vector_size, intermediate_format, 
	      	    op, src, mask, dest);>

	      w--;
	}

	while (w >= 2 && !aligned_to_four_times_intermediate (dest))
	{
              <code (n_pixels=2, vector_size, intermediate_format, 
	      	    op, src, mask, dest);>
	      
	      w--;
	}
	...
	while (w >= n_pixels)
	{
	      code (n_pixels, vector_size, intermediate_format, 
	      	    op, src, mask, dest);
		
	      w--;
	}
    }

ie.

[while]

for (n_pixels = 1; n_pixels <= max_pixels; n_pixels *= 2)
{
	[while]

	[end while]
}

[end while]

Here is code to compute the various properties:

static void
compute_stuff (int op,
	       pixman_format_code_t src,
	       pixman_format_code_t mask,
	       pixman_format_code_t dest)
{
    const char *intermediate;
    int pref_vs;
    int intermediate_size;
    int mult;
    int n_pixels;

    if (op == PIXMAN_OP_ADD && mask == PIXMAN_null &&
	src == PIXMAN_a8 && dest == PIXMAN_a8)
    {
	intermediate = "a8";
	intermediate_size = 1;
    }
    else if (PIXMAN_FORMAT_16BPC (src) ||
	     PIXMAN_FORMAT_16BPC (mask) ||
	     PIXMAN_FORMAT_16BPC (dest))
    {
	intermediate = "a16r16g16b16";
	intermediate_size = 8;
    }
    else
    {
	intermediate = "a8r8g8b8";
	intermediate_size = 4;
    }

    if (strcmp (intermediate, "af8rf8gf8bf8") == 0)
    {
	mult = 1;
    }
    else
    {
	mult = 2;

	if (mask == PIXMAN_null)
	{
	    if ((op == PIXMAN_OP_ADD)				||
		(op == PIXMAN_OP_OVER && PIXMAN_A (src) == 0)	||
		(op == PIXMAN_OP_SRC))
	    {
		mult = 1;
	    }
    }

    for (pref_vs = 4; pref_vs <= 16; pref_vs *= 2)
    {
	int vector_size;
	
	if (pref_vs < mult * intermediate_size)
	    n_pixels = 1;
	else
	    n_pixels = pref_vs / intermediate_size;

	vector_size = max (n_pixels * intermediate_size,			   mult * intermediate_size);
	
	printf ("%2d vector type: <%d x %s> (%d pixels)\n",
		pref_vs,
		vector_size / intermediate_size,
		intermediate,
		n_pixels);
    }
}


Larrabee:

Larrabee does not fit this pattern because it can't do arithmetic on 8
bit channels; it always expands to at least 32 bits. On the other
hand, it has 32 registers, so we could simply fake the un8 type with
extra registers.

A better approach is to have the backend give the number of pixels it
can deal with for a given type. The only problem with that is that the
backend will have to artificially limit the number of pixels per
register since it has to be able to do multiplications. Hmm, on the
other hand, the front end will need to write out registers to memory,
so the packed representation is necessary. 

But maybe only for output purposes? If the intermediate format has
Larrabee-like instructions, maybe the backends can optimize
themselves?

We do need the typed vectors so that Larrabee can use the generous
representation internally for things like un8. (Ie., we can't just use
16 bit integers there and do the shift tricks in the frontend, because
Larrabee would prefer to do things differently).

For un8 types, maybe there should actually be two types: un8 and un8m,
where multiplications are only allowed on the 'm' variation. Another
possibility is to forget about the multiplication optimization and
always reserve enough space. For the vast majority of operations, we
do need multiplications, and for those where we don't we could just
add special fast paths. 


This seems like the correct approach:

	- vector variables are typed; for example they could be un8 or
	  u16, or f16, f64 etc.

	- conceptually, and as far as the frontend is concerned, the
	  vectors are fully packed, so if you write them to memory,
	  they will be stored in a packed format. Ie., the front end will
	  consider a vector of 16 un8s to be 16 bytes wide.

	- Internally, the backend may choose a different
	  representation (and almost all backends will need to use 16
	  bits for the un8 type).

	- When vectors are written to memory, the backends will need
	  to do whatever packing and unpacking they need to undo their
	  internal representation.

	  The SSE2 backend would want to optimize two memory stores
	  into one pack + move, rather than two
	  pack-with-0-then-store.

One issue with this is what to do about several multiplications in a
row. These can be optimized because there is no need to divide with
255 until all the adds have taken place. Of course, if a backend
wishes to, it may consider the internal type "un88", even when the
external type is un8. That just means that in addition to packing, it
first has to divide by 255. And to do this it has to prove that the
intermediate results can not overflow.

Another issue is what to do with formats like 565 or a1 or
weirder. And is it possible to multiply two vectors together with
different types? If it is, what is the resulting type? The answer is
likely that you can't multiply vectors with different types, so the
front end will need to decide on an intermediate format, and generate
code to load it. This means we will need to be able to convert from n
of one type to n of another.

r5g6b5 should probably be treated specially because the channels have
different widths. Maybe we just need a convert-r5g6b5-to-8888
instruction. With typed vectors, it is quite difficult to generate
code to convert for it.

What about for a8? 

	- find out how many un8s we can deal with

	- divide by 4

	- load that many u8s

	- convert to u32
	
	- shift/mask/etc.

Algorithm:

	- Determine intermediate format (max of all the involved
	  formats, though at least un8).

	- Given intermediate format, determine n_pixels for source,
	  mask and destination. Take minimum.

	- src = load (4 * n_pixel, format)
	  mask = load (4 * n_pixel, format)
	  dest = load (4 * n_pixel, format)

It is the backend's problem how it wants to represent those
internally. For example if there is no mask, and we are compositing
a8r8g8b8 x a8r8g8b8, then the SSE2 backend might report that it can
deal with 4 pixels. Internally, one 4 pixel IR register would be
stored in 2 xmm registers, unpacked. (Problem: a simple-minded
liveness analysis well not know that the two xmm registers could die
separately). Though there could be a "split" pass that would split
such variables into two.  Or there could be IR instructions that would
load two registers. Though, how would the frontend know to do that?
And how would the split pass deal with loads?

Intermediate format is probably always given as "un16" or "f32" or
somesuch (as opposed to a16r16g16b16 etc.)

All the backends need to do then is to provide a
backend_get_n_pixels() that returns the number of pixels it can handle
for a given type. Or the number of lanes, really.

It is likely to be useful to allow backends to do their own rewriting
before register allocation. For example, the arm backend will probably
want to replace a shuffle-multiply with something that it can use to
generate normal multiplication code with.

Similarly, the mmx backend can only do word-level shuffling, so it will
want to replace

     	shuffle (<byte vector>)
	expand  ()
	
with

	expand
	shuffle (<word vector>)

These types of things should be done on the dependency graph of the
intermediate code. (And should be therefore restricted to things that
are semantically *identical*, not just equivalent)

        t = shuffle_byte (s, ...);
	v = expand (t)

can be transformed into 

        t = expand (s)
	v = shuffle (t)

but only if there are no other dependencies on t.

Alternatively, it could be turned into

	       t = shuffle_byte (s, ...);
	       tt = expand (s);
	       v = shuffle (tt);

with the hope that the initial shuffle_byte would then be dead. The first 
one seems simpler:

    	  - no termination arguments necessary

	  - no dead code elimination

	  - potentially less powerful, but meh.

Intermediate format

Benefits: 

	- Can be interpreted. Eventually only the jit code should be
	  necessary

	- Multiple backends with different size registers

	- Register allocation

	- Optimization?


- Typed variables. Should there be a 0.8 type? The intel (and other)
  backend can do the expanding itself, and ARM can do more efficient
  multiplication.

intel:
	src, mask loaded
	nmask = xor (mask, 0xff)
	s = src * mask
	    	  expand (src) -> src1, src2
		  expand (mask) -> mask1, mask2
		  pixmul
		  pixmul
		  pack -> s
	d = dest * nmask
	    	  expand (dest) -> ...
		  pack -> d
	nmask = shuffle (src, 
	s = d (+) s
	store s in dest

which is just as good as before. In fact it's a little better because
we save a negation instruction. Actually, we should keep the computed
alpha around, and then negate afterwards, and then multiply. So there
will be extra shuffling going on with this scheme.

Arm
	src, mask loaded
	nmask = xor (mask, 0xff)
	s = src * mask:
	         uxtb16 s_low, src
		 uxtb16 s_hi, src, ror 8
		 
		 mla s_low, s_low, mask, 0x80
		 mla s_hi, s_hi, mask, 0x80
		 uxtab16 s_low, s_low, s_low, rot 8
		 uxtab16 s_hi, s_hi, s_hi, rot 8

		 and s_hi, 0xff00ff00
		 uxtab16 s_low, s_hi, s_low, rot 8


One difference is that Arm wants the mask as a single byte whereas
intel wants it duplicated. The correct solution may be to simply read
it as an ARGB, then the intel can shuffle and the Arm can do its own
multiply with its builtin shifting.

Another possibility is to add an IR instruction to multiply a vector
with one number. On intel this would result in a pshuf + multiply. On
ARM it could be done with just two mulitplications. On intel we would
actually need to multiply with a number given in a field of another
vector.

Unfortunately, this can't be implemented on ARM without assuming that
there is room for overflow.

An intelligent instruction selector on ARM would probably be a big
benefit.

Or just assume that NEON is the future. It seems saner than v6 SIMD.


- Three-register instructions

It is what future x86s will have, it is what ARM wants, it is much
easier to generate code for, and it is easy to turn three-register
code into two-register code, but not the other way around.


Older notes:

- The generated ops should have a simpler prototype than the normal one. Maybe
  something like.

  void (* CompositeOp) (uint32_t *src_start,
			uint32_t src_skip,	/* = src_stride - (width * src_bpp) / 8 */

       	  	       	uint32_t *mask_start,
			uint32_t mask_skip,

			uint32_t *dest_start,
			uint32_t dest_skip,

			uint16_t width,
			uint16_t height);
			
  The amount of setup in generated code should be minimized, both to simplify 
  code generation and to reduce the memory overhead of the code generation.

  For ops where source or mask is solid, src/mask_start should point to an 8888
  pixel arranged similarly to the dest format. Ie., unpacking should
  happen before the op is called.

  If we add transformations and filters, they can be added at the end of the
  argument list - that way the code won't have to change too much.

  The problem with this though is that we need essentially random access to
  do transformations. It also doesn't work for gradients.

- It should be figured out how to deal with alignment restrictions.
  There is movdqa, which requires 16 byte alignment, and there is
  movdqu, which doesn't. Agner says movdqu is slow and may better be
  replaced with two movq's. Maybe it's best to just always use correctly
  aligned reads:

  	  - Generate a version for every combination of alignments?

	  - Dynamically determine the instruction to use?
	       - This is not necessarily that slow since branch prediction will
	         usually get it right.
	       	

  In all cases we will need to be able to handle single pixels. So maybe start out
  just generating those. The optimal thing to do would be. 

       	- compute 

	  	  n_pixels(sse, op, src_format, mask_format, dest_format)
	    
 	  the number of pixels that fits in a register.

	- For each line compute the initial unaligned strip of pixels. If this is
	  the same for all three, then set sw to that number, otherwise set
	  it to the full width.

	  inner_loop:

	  tmp_wid = width;	with tmp_wid on the stack.

	  w = that_number;
	  tmp_wid -= sw;

	- while (sw--)
	  {
		do_one_pixel;
	  }

	  Then compute the number of aligned pixels that can be done, and

	  while (aw--)
	  {
		do_full_size()
	  }

	  Finally

	  while (w--)
	  {
		do_one_pixel;
	  }

	  the "do_one_pixel" loop can be in a separate function to avoid duplicating
	  the code. In fact, both one and full size could be the same code with
	  a parameter to determine the difference. Branch prediction would make
	  this relatively inexpensive.

	  Although, the full size code is quite different since it has
	  to deal with packed pixels.

  	Also we need the images to be identically aligned - it's not good
  enough to only align say the source to 16 bytes.
  It may be worth having an initial pass that deals with any leading and
  trailing unaligned data.
  	   Dealing with alignment
  
	- Size of operation:	can be 1 or 2 bytes 
	       (depending on whether it's multiplication or just addition)

	- available_channels = vector_size / op_size;

  In principle we could always read in as many pixels as possible from the
  shortest input. Ie., if we have an a8 mask, composite 16 pixels at a time,
  unrolling as necessary. The register pressure is likely to make this a loss
  though.

	1 register set to 0

	1 register to hold the 16 mask pixels
	1 register to hold the 4 src pixels
	1 register to hold the 4 dest pixels

	1 register to hold two expanded src pixels
	1 register to hold two expanded dest pixels
	1 register to hold two expanded alpha pixels

  that's 8 registers already, which is all we have got unless we are in 64 bit mode.
  We could make use of mmx registers too - the currently inactive pixels could be
  held there. Ie.,

       movdqa *src, xmm0
       movdq2q xmm0, mm0
       psrldq xmm0, 64
       movdq2q xmm0, mm1

       movdqa *mask, xmm0,
       movdq2q xmm0, mm2
       psrldq xmm0, 64
       movdq2q xmm0, mm3

       movdqa *mask, xmm0,
       movdq2q xmm0, mm2
       psrldq xmm0, 64
       movdq2q xmm0, mm3

  A simple, but probably pretty good scheme:

  - Generate two versions of each op, one where everything is aligned,
    and one where alignment is detected on the fly.

    In both cases n_pixels is computed, the number of pixels to handle
    per iteration. In the aligned case, we then just read in that many
    pixels as efficiently as possible. For the first iteration, that
    probably means 2 pixels in many cases, but eventually it would be
    nice to unroll once to get to four pixels.

    So both versions have a preamble that reads the source, mask and
    destinations into sse registers. Then afterwards, the computations
    are the same, then finally two different versions generate the
    final write to the destination.

  Another possibility:

  - Deal with unaligned or short lines in separate loops before and after the
    aligned loop. Could be in a function.

  Current thinking:

  - Just ensure destination is aligned, and use movdqu for sources.


  Computing the rendering:
 
  - Reserve three registers  for 0x00..  0x0080.. and 0x00ff

  - Compute (src IN mask) and (srca) where (src IN mask) is returned as a
    packed vector, and srca1 and srca2 are expanded. (srca is only computed
    if actually needed). 

  - Read dest into one of the remaining two registers 

  - In the last register expand low part of d and combine it onto src1

  - In the last register expand high part of d and combine it onto src2

  At this point src1 and src2 represents the 'fd' that will get added onto s

  - Pack src1 and src2 together in src1 (which is now the fd we will add to s)

  - Unpack low part of s into src2 (mov s, s2; unpack z, s2);

  - Unpack low part of d into last register 

  - Shuffle last register to become dest alpha

  - Combine onto src2

  - Unpack high part of s into last register

  - Unpack high part of d into itself

  - Combine onto s

  - Pack s and src2 into src2

  - Add src1 and src2

  - Store the result.

Unfortunately, we are one register short of being able to do F_a's and
F_b's where both src_a and dst_a are involved.

Note: there is no integer division in SSE, so this would possibly have
to be done with floating point. Ie,. using cvtqd2ps.

- It is important that the register allocator is not too dumb

     - EAX is the only register we can use for multiplication
     - There are not a lot of registers.
     - Need to do something sensible when we run out of registers.
       Spilling to memory is a possibility - just giving up is another.
  
  We need the ability to unallocate a register, and then use it in a
  following instruction. That way a register can be reused in the
  middle of an instruction.  (moves from a register to itself should
  be culled).

  This probably also implies that we need the ability to ref count
  registers, and that callees will often have to take ownership of a
  passed-in register.

  Pixels are always 1, 2, 4, or 8 bytes wide, which means we can make
  good use of the x86 addressing mode: Only one register is necessary
  for the X coordinate, we can then index into images using
  displacements and shifts:

       while (h--)
       {
	     s_line = src_pixels + h * s_stride;
	     m_line = mask_pixels + h * m_stride;
	     d_line = dest_pixels + h * d_stride;

       	     while (w--)
	     {
	         s = s_line + w * s_bpp;
		 m = m_line + w * m_bpp;
		 d = d_line + w * d_bpp;
	     }
       }

  Registers:   h, w, s_line, m_line, d_line,               

  Or:

	    s = src_pixels;
	    m = mask_pixels;
	    d = dest_pixels;

	    while (h--)
	    {
		while (w--)
		{
		    [s + w * bpp]
		    [d + w * bpp]
		    [m + w * bpp]
		}

		s += s_stride;
		m += m_stride;
		d += d_stride;
	    }

  Registers: h, w,  s, d, m, s_stride, m_stride, d_stride.


- Optimistic Register Allocator

  Simply hand out registers as code asks for them. If we run out of registers,
  then spill the register somewhere on the stack. (At emit time we will make room
  for spills).

  When a spilled variable is being used, some random other variable is
  stored on the stack, and the used variable gets the
  register. Possibly using the LRU algorithm to decide the
  eviction. 

  Before jumps and before labels, a 'normalization' is run where
  things are put into the locations they are supposed to be
  in. Ie,. all variables have two fields: where *are* they, and where
  do they *belong*. The invariant is that at entry to any basic block,
  variables are in their assigned location.

  Combined with not register allocating constants, this really should
  be good enough.

- Code generator / runtime assembler

  Eventually, the x86_codegen.h file should be folded into
  codex86.c. Things like membase emission can be done straight from the
  ops. 

  - Need to check malloc() returns.

  - An SSE/MMX register allocator is probably required. It would be
    nice if we could figure out up front how many registers we are
    going to need so that constants can be register allocated and 
    hoisted out of the loop when possible.

    Have a "dry run" mode that just finds out which registers are
    used?

  - Should probably swap arguments to memindex and membase to make
    them look more like at&t syntax. Also find out where to put
    immediate operators. Maybe I need to face the fact that at&t
    syntax is crack.

  - The code could be made smaller and faster if op_t were made
    smaller by compressing fields into uint8_t's. This would cause gdb
    to not show registers in enums though.

  - Decide on syntax: AT&T vs Intel? Right now we are mostly at&t, but
    that's kind of weird for things like test, and cmp.

- Public API:

  pixman-sse-jit.h:

  pixman_sse_jit_t   *pixman_sse_jit_new ();
  pixman_sse_code_t  *pixman_sse_jit_get_op (pixman_sse_jit_t			*jit,
					     pixman_op_t			 op,
  		     			     const pixman_sse_image_info	*src,
					     const pixman_sse_image_info	*mask,
					     const pixman_sse_image_info	*dest);
  void                pixman_sse_code_run   (pixman_sse_code_t			*code,
					     pixman_op_t			 op,
					     pixman_image_t			*src,
					     pixman_image_t			*mask,
					     pixman_image_t			*dest,
					     int				 x_src,
					     int				 y_src,
					     int				 x_mask,
					     int				 y_mask,
					     int				 x_dest,
					     int				 y_dest,
					     uint16_t				 width,
					     uint16_t				 height);

  where pixman_sse_image_info would contain information about the format, whether the
  image is solid, whether the image is there at all (for the mask).

DONE:

- Pixman CPU detection should be generated dynamically. That will get
  rid of the annoying #ifdefs and getisax() stuff.

  - Backwards vs. forwards

    The code currently in testjit iterates backwards over each line. It
    may be a little better to go forward. This could be done by

      - initializing the line to (line + w * bpp)
      - initializing w to -width
      - not having a displacement:

      	    movq (line, w, bpp), xmm0

	and 

	    add 2, width

    Current code iterates forwards.

  - The memindex/membase should take ops, not reg numbers.

  - There should only REG and MEM in the ops. emit_memindex() can
    handle everything. 

  - Assembler needs to deal with labels. API

    	      x86_jz (a, "no_cpuid");
    	      asm_label (a, "no_cpuid");

    The assembler would just store a big list of labels/positions, then
    patch up at asm_emit() time.

  - REG ops can contain any register whatsoever (xmm, mmx, eax, al
    etc.). There should be a separate op_get_regno() function to extract
    the register number from an op.

  - There should be eax() type functions.

  - Renaming mov to movl was a mistake. The assembler should be able to
    figure out the size of the operation by itself. (By looking at
    the involved registers).