diff options
author | M Joonas Pihlaja <jpihlaja@cc.helsinki.fi> | 2009-01-14 23:37:08 +0200 |
---|---|---|
committer | M Joonas Pihlaja <jpihlaja@cc.helsinki.fi> | 2009-01-14 23:37:08 +0200 |
commit | 2336c8788dea7209e4bc582f0f7c8cf163edf589 (patch) | |
tree | fdb45c0e08a91d4664415679bbbcb3822afe1621 | |
parent | 4306a288dd0721604bc0b74b5a1deb959b4500f0 (diff) |
Added README
-rw-r--r-- | README | 109 |
1 files changed, 109 insertions, 0 deletions
@@ -0,0 +1,109 @@ +Unpremultipliers galore! +------------------------ + +Cairo and many other graphics libraries treat the colour components of +a pixel in memory as being premultiplied by alpha, while many other +systems such as SDL and OpenGL treat the components separately. This +impedance mismatch is mostly annoying since it requires translating +pixel buffers between representations when moving data between +systems, but it can also be a performance bottleneck. Namely going +from premultiplied to normal representation requires computing a +division by alpha of every colour component to get the +"unpremultiplied" components. + + unpremultiply-div.c: The reference unpremultiplier. + This one does three divisions per pixel. + + unpremultiply-inv.c: Uses a 1 KB table of reciprocals. + + unpremultiply-lut.c: Uses a full 64 KB lookup table. Does + free saturation of superluminant pixels. + + unpremultiply-sse2.S: A blocked version using 3 KB of reciprocal + tables. For AMD64/SSE2. + +Since lots of images have a lot of runs of constant or solid pixels we +should optimise for that case. At a small to medium cost (~10-20%) +for the varying pixel case we can get a big speedup for the boring +areas of the image. + + unpremultiply-invb.c: ...-inv.c with fast paths for boring bits. + + unpremultiply-lutb.c: ...-lut.c with fast paths for boring bits. + +There's no one true unpremultiplier among these ones as the relative +timings depend a lot on the type of image to unpremultiply, the +memory subsystem of the machine (most notably cache sizes vs. image +size vs. lookup table size), and whether superluminant input pixels +are saturated on the output. Also space used by the different +tables may be an issue for some programs. + +Optimising for size is easy: + +-rw------- 1 rowan rowan 67960 2009-01-14 20:30 unpremultiply-lutb.o +-rw------- 1 rowan rowan 66888 2009-01-14 20:30 unpremultiply-lut.o +-rw------- 1 rowan rowan 4224 2009-01-14 20:30 unpremultiply-sse2.o +-rw------- 1 rowan rowan 3480 2009-01-14 20:30 unpremultiply-invb.o +-rw------- 1 rowan rowan 2400 2009-01-14 20:30 unpremultiply-inv.o +-rw------- 1 rowan rowan 2000 2009-01-14 20:30 unpremultiply-div.o + +When optimising for speed you can compare the versions from the +relative time taken to unpremultiply a various sized buffer of data, +where the baseline is the time taken to memcpy() the buffer. + + Data | Pixels | copy | sse2 | invb | lutb | lut | inv | div + | | | | | | | | +random | 512 | 1.000 | 3.602 | 13.493 | 12.029 | 9.974 | 12.980 | 124.729 +random | 4096 | 1.000 | 3.111 | 13.949 | 12.727 | 10.672 | 13.438 | 129.371 +random | 32768 | 1.000 | 1.478 | 6.945 | 6.544 | 5.454 | 6.605 | 62.213 +random | 262144 | 1.000 | 0.959 | 3.355 | 3.128 | 2.653 | 3.193 | 29.377 +random | 2097152 | 1.000 | 1.012 | 3.538 | 3.297 | 2.799 | 3.365 | 30.819 + | | | | | | | | + solid | 512 | 1.000 | 3.589 | 2.628 | 2.624 | 9.829 | 12.978 | 124.918 + solid | 4096 | 1.000 | 3.088 | 2.625 | 2.625 | 10.166 | 13.432 | 129.526 + solid | 32768 | 1.000 | 1.114 | 1.199 | 1.313 | 3.935 | 4.927 | 46.612 + solid | 262144 | 1.000 | 0.882 | 1.502 | 1.529 | 2.466 | 3.016 | 27.826 + solid | 2097152 | 1.000 | 0.911 | 1.544 | 1.545 | 2.516 | 3.091 | 28.389 + | | | | | | | | + clear | 512 | 1.000 | 3.591 | 1.620 | 1.611 | 9.730 | 12.965 | 3.392 + clear | 4096 | 1.000 | 3.098 | 1.571 | 1.570 | 10.077 | 13.434 | 3.487 + clear | 32768 | 1.000 | 1.127 | 1.018 | 1.062 | 3.927 | 4.988 | 1.679 + clear | 262144 | 1.000 | 0.876 | 1.424 | 1.429 | 2.369 | 2.936 | 1.508 + clear | 2097152 | 1.000 | 0.985 | 1.662 | 1.652 | 2.723 | 3.373 | 1.688 + +For busy images without large solid areas the best portable code the +best one would be lut if you can take the 64 KB table of baggage it +comes with. The inv routines don't deal correctly with superluminant +pixels that overflow 8 bit components of the destination. If you +#define DO_CLAMP_INPUT to 1 then the troublesome pixel components will +be saturated instead of overflowing. There's a small penalty of about +10% for that feature. + +The SSE2 version should be preferred for large images. The reason +it's not so great for small images is that it doesn't have the data +specific fast paths and it uses nontemporal writes to store into the +destination buffer. The net effect of this is that the destination +buffer is pushed out of the caches and through into main memory +regardless of whether it could comfortably sit in cache all day long. +For comparison the table below shows the numbers when forcing SSE2 to +use movdqa instead of movntdq to write to the destination buffer: + + Data | Pixels | copy | sse2 | invb | lutb + | | | | | +random | 512 | 1.000 | 2.797 | 12.040 | 11.873 +random | 4096 | 1.000 | 2.764 | 12.460 | 12.763 +random | 32768 | 1.000 | 1.577 | 6.033 | 6.218 +random | 262144 | 1.000 | 1.598 | 3.074 | 3.168 +random | 2097152 | 1.000 | 1.597 | 3.170 | 3.272 + | | | | | + solid | 512 | 1.000 | 2.791 | 2.626 | 2.620 + solid | 4096 | 1.000 | 2.763 | 2.628 | 2.627 + solid | 32768 | 1.000 | 1.687 | 1.633 | 1.644 + solid | 262144 | 1.000 | 1.523 | 1.649 | 1.636 + solid | 2097152 | 1.000 | 1.575 | 1.721 | 1.665 + | | | | | + clear | 512 | 1.000 | 2.800 | 1.614 | 1.610 + clear | 4096 | 1.000 | 2.763 | 1.571 | 1.571 + clear | 32768 | 1.000 | 1.707 | 1.549 | 1.833 + clear | 262144 | 1.000 | 1.508 | 1.546 | 1.534 + clear | 2097152 | 1.000 | 1.558 | 1.652 | 1.673 |