Unpremultipliers galore! ------------------------ Cairo and many other graphics libraries treat the colour components of a pixel in memory as being premultiplied by alpha, while many other systems such as SDL and OpenGL treat the components separately. This impedance mismatch is mostly annoying since it requires translating pixel buffers between representations when moving data between systems, but it can also be a performance bottleneck. Namely going from premultiplied to normal representation requires computing a division by alpha of every colour component to get the "unpremultiplied" components. unpremultiply-div.c: The reference unpremultiplier. This one does three divisions per pixel. unpremultiply-inv.c: Uses a 1 KB table of reciprocals. unpremultiply-lut.c: Uses a full 64 KB lookup table. Does free saturation of superluminant pixels. unpremultiply-sse2.S: A blocked version using 3 KB of reciprocal tables. For AMD64/SSE2. Since lots of images have a lot of runs of constant or solid pixels we should optimise for that case. At a small to medium cost (~10-20%) for the varying pixel case we can get a big speedup for the boring areas of the image. unpremultiply-invb.c: ...-inv.c with fast paths for boring bits. unpremultiply-lutb.c: ...-lut.c with fast paths for boring bits. There's no one true unpremultiplier among these ones as the relative timings depend a lot on the type of image to unpremultiply, the memory subsystem of the machine (most notably cache sizes vs. image size vs. lookup table size), and whether superluminant input pixels are saturated on the output. Also space used by the different tables may be an issue for some programs. Optimising for size is easy: -rw------- 1 rowan rowan 67960 2009-01-14 20:30 unpremultiply-lutb.o -rw------- 1 rowan rowan 66888 2009-01-14 20:30 unpremultiply-lut.o -rw------- 1 rowan rowan 4224 2009-01-14 20:30 unpremultiply-sse2.o -rw------- 1 rowan rowan 3480 2009-01-14 20:30 unpremultiply-invb.o -rw------- 1 rowan rowan 2400 2009-01-14 20:30 unpremultiply-inv.o -rw------- 1 rowan rowan 2000 2009-01-14 20:30 unpremultiply-div.o When optimising for speed you can compare the versions from the relative time taken to unpremultiply a various sized buffer of data, where the baseline is the time taken to memcpy() the buffer. Data | Pixels | copy | sse2 | invb | lutb | lut | inv | div | | | | | | | | random | 512 | 1.000 | 3.602 | 13.493 | 12.029 | 9.974 | 12.980 | 124.729 random | 4096 | 1.000 | 3.111 | 13.949 | 12.727 | 10.672 | 13.438 | 129.371 random | 32768 | 1.000 | 1.478 | 6.945 | 6.544 | 5.454 | 6.605 | 62.213 random | 262144 | 1.000 | 0.959 | 3.355 | 3.128 | 2.653 | 3.193 | 29.377 random | 2097152 | 1.000 | 1.012 | 3.538 | 3.297 | 2.799 | 3.365 | 30.819 | | | | | | | | solid | 512 | 1.000 | 3.589 | 2.628 | 2.624 | 9.829 | 12.978 | 124.918 solid | 4096 | 1.000 | 3.088 | 2.625 | 2.625 | 10.166 | 13.432 | 129.526 solid | 32768 | 1.000 | 1.114 | 1.199 | 1.313 | 3.935 | 4.927 | 46.612 solid | 262144 | 1.000 | 0.882 | 1.502 | 1.529 | 2.466 | 3.016 | 27.826 solid | 2097152 | 1.000 | 0.911 | 1.544 | 1.545 | 2.516 | 3.091 | 28.389 | | | | | | | | clear | 512 | 1.000 | 3.591 | 1.620 | 1.611 | 9.730 | 12.965 | 3.392 clear | 4096 | 1.000 | 3.098 | 1.571 | 1.570 | 10.077 | 13.434 | 3.487 clear | 32768 | 1.000 | 1.127 | 1.018 | 1.062 | 3.927 | 4.988 | 1.679 clear | 262144 | 1.000 | 0.876 | 1.424 | 1.429 | 2.369 | 2.936 | 1.508 clear | 2097152 | 1.000 | 0.985 | 1.662 | 1.652 | 2.723 | 3.373 | 1.688 For busy images without large solid areas the best portable code the best one would be lut if you can take the 64 KB table of baggage it comes with. The inv routines don't deal correctly with superluminant pixels that overflow 8 bit components of the destination. If you #define DO_CLAMP_INPUT to 1 then the troublesome pixel components will be saturated instead of overflowing. There's a small penalty of about 10% for that feature. The SSE2 version should be preferred for large images. The reason it's not so great for small images is that it doesn't have the data specific fast paths and it uses nontemporal writes to store into the destination buffer. The net effect of this is that the destination buffer is pushed out of the caches and through into main memory regardless of whether it could comfortably sit in cache all day long. For comparison the table below shows the numbers when forcing SSE2 to use movdqa instead of movntdq to write to the destination buffer: Data | Pixels | copy | sse2 | invb | lutb | | | | | random | 512 | 1.000 | 2.797 | 12.040 | 11.873 random | 4096 | 1.000 | 2.764 | 12.460 | 12.763 random | 32768 | 1.000 | 1.577 | 6.033 | 6.218 random | 262144 | 1.000 | 1.598 | 3.074 | 3.168 random | 2097152 | 1.000 | 1.597 | 3.170 | 3.272 | | | | | solid | 512 | 1.000 | 2.791 | 2.626 | 2.620 solid | 4096 | 1.000 | 2.763 | 2.628 | 2.627 solid | 32768 | 1.000 | 1.687 | 1.633 | 1.644 solid | 262144 | 1.000 | 1.523 | 1.649 | 1.636 solid | 2097152 | 1.000 | 1.575 | 1.721 | 1.665 | | | | | clear | 512 | 1.000 | 2.800 | 1.614 | 1.610 clear | 4096 | 1.000 | 2.763 | 1.571 | 1.571 clear | 32768 | 1.000 | 1.707 | 1.549 | 1.833 clear | 262144 | 1.000 | 1.508 | 1.546 | 1.534 clear | 2097152 | 1.000 | 1.558 | 1.652 | 1.673