Unpremultipliers galore!
------------------------

Cairo and many other graphics libraries treat the colour components of
a pixel in memory as being premultiplied by alpha, while many other
systems such as SDL and OpenGL treat the components separately.  This
impedance mismatch is mostly annoying since it requires translating
pixel buffers between representations when moving data between
systems, but it can also be a performance bottleneck.  Namely going
from premultiplied to normal representation requires computing a
division by alpha of every colour component to get the
"unpremultiplied" components.

 unpremultiply-div.c: The reference unpremultiplier.
		      This one does three divisions per pixel.

 unpremultiply-inv.c: Uses a 1 KB table of reciprocals.

 unpremultiply-lut.c: Uses a full 64 KB lookup table.  Does
		      free saturation of superluminant pixels.

 unpremultiply-sse2.S: A blocked version using 3 KB of reciprocal
		       tables. For AMD64/SSE2.

Since lots of images have a lot of runs of constant or solid pixels we
should optimise for that case.  At a small to medium cost (~10-20%)
for the varying pixel case we can get a big speedup for the boring
areas of the image.

 unpremultiply-invb.c: ...-inv.c with fast paths for boring bits.

 unpremultiply-lutb.c: ...-lut.c with fast paths for boring bits.

There's no one true unpremultiplier among these ones as the relative
timings depend a lot on the type of image to unpremultiply, the
memory subsystem of the machine (most notably cache sizes vs. image
size vs. lookup table size), and whether superluminant input pixels
are saturated on the output.  Also space used by the different
tables may be an issue for some programs.

Optimising for size is easy:

-rw------- 1 rowan rowan 67960 2009-01-14 20:30 unpremultiply-lutb.o
-rw------- 1 rowan rowan 66888 2009-01-14 20:30 unpremultiply-lut.o
-rw------- 1 rowan rowan  4224 2009-01-14 20:30 unpremultiply-sse2.o
-rw------- 1 rowan rowan  3480 2009-01-14 20:30 unpremultiply-invb.o
-rw------- 1 rowan rowan  2400 2009-01-14 20:30 unpremultiply-inv.o
-rw------- 1 rowan rowan  2000 2009-01-14 20:30 unpremultiply-div.o

When optimising for speed you can compare the versions from the
relative time taken to unpremultiply a various sized buffer of data,
where the baseline is the time taken to memcpy() the buffer.

  Data |  Pixels |  copy |  sse2 |   invb |   lutb |    lut |    inv |     div
       |         |       |       |        |        |        |        |        
random |     512 | 1.000 | 3.602 | 13.493 | 12.029 |  9.974 | 12.980 | 124.729
random |    4096 | 1.000 | 3.111 | 13.949 | 12.727 | 10.672 | 13.438 | 129.371
random |   32768 | 1.000 | 1.478 |  6.945 |  6.544 |  5.454 |  6.605 |  62.213
random |  262144 | 1.000 | 0.959 |  3.355 |  3.128 |  2.653 |  3.193 |  29.377
random | 2097152 | 1.000 | 1.012 |  3.538 |  3.297 |  2.799 |  3.365 |  30.819
       |         |       |       |        |        |        |        |        
 solid |     512 | 1.000 | 3.589 |  2.628 |  2.624 |  9.829 | 12.978 | 124.918
 solid |    4096 | 1.000 | 3.088 |  2.625 |  2.625 | 10.166 | 13.432 | 129.526
 solid |   32768 | 1.000 | 1.114 |  1.199 |  1.313 |  3.935 |  4.927 |  46.612
 solid |  262144 | 1.000 | 0.882 |  1.502 |  1.529 |  2.466 |  3.016 |  27.826
 solid | 2097152 | 1.000 | 0.911 |  1.544 |  1.545 |  2.516 |  3.091 |  28.389
       |         |       |       |        |        |        |        |        
 clear |     512 | 1.000 | 3.591 |  1.620 |  1.611 |  9.730 | 12.965 |   3.392
 clear |    4096 | 1.000 | 3.098 |  1.571 |  1.570 | 10.077 | 13.434 |   3.487
 clear |   32768 | 1.000 | 1.127 |  1.018 |  1.062 |  3.927 |  4.988 |   1.679
 clear |  262144 | 1.000 | 0.876 |  1.424 |  1.429 |  2.369 |  2.936 |   1.508
 clear | 2097152 | 1.000 | 0.985 |  1.662 |  1.652 |  2.723 |  3.373 |   1.688

For busy images without large solid areas the best portable code the
best one would be lut if you can take the 64 KB table of baggage it
comes with.  The inv routines don't deal correctly with superluminant
pixels that overflow 8 bit components of the destination.  If you
#define DO_CLAMP_INPUT to 1 then the troublesome pixel components will
be saturated instead of overflowing.  There's a small penalty of about
10% for that feature.

The SSE2 version should be preferred for large images.  The reason
it's not so great for small images is that it doesn't have the data
specific fast paths and it uses nontemporal writes to store into the
destination buffer.  The net effect of this is that the destination
buffer is pushed out of the caches and through into main memory
regardless of whether it could comfortably sit in cache all day long.
For comparison the table below shows the numbers when forcing SSE2 to
use movdqa instead of movntdq to write to the destination buffer:

  Data |  Pixels |  copy |  sse2 |   invb |   lutb
       |         |       |       |        |       
random |     512 | 1.000 | 2.797 | 12.040 | 11.873
random |    4096 | 1.000 | 2.764 | 12.460 | 12.763
random |   32768 | 1.000 | 1.577 |  6.033 |  6.218
random |  262144 | 1.000 | 1.598 |  3.074 |  3.168
random | 2097152 | 1.000 | 1.597 |  3.170 |  3.272
       |         |       |       |        |       
 solid |     512 | 1.000 | 2.791 |  2.626 |  2.620
 solid |    4096 | 1.000 | 2.763 |  2.628 |  2.627
 solid |   32768 | 1.000 | 1.687 |  1.633 |  1.644
 solid |  262144 | 1.000 | 1.523 |  1.649 |  1.636
 solid | 2097152 | 1.000 | 1.575 |  1.721 |  1.665
       |         |       |       |        |       
 clear |     512 | 1.000 | 2.800 |  1.614 |  1.610
 clear |    4096 | 1.000 | 2.763 |  1.571 |  1.571
 clear |   32768 | 1.000 | 1.707 |  1.549 |  1.833
 clear |  262144 | 1.000 | 1.508 |  1.546 |  1.534
 clear | 2097152 | 1.000 | 1.558 |  1.652 |  1.673