summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorM Joonas Pihlaja <jpihlaja@cc.helsinki.fi>2009-01-14 23:37:08 +0200
committerM Joonas Pihlaja <jpihlaja@cc.helsinki.fi>2009-01-14 23:37:08 +0200
commit2336c8788dea7209e4bc582f0f7c8cf163edf589 (patch)
treefdb45c0e08a91d4664415679bbbcb3822afe1621
parent4306a288dd0721604bc0b74b5a1deb959b4500f0 (diff)
Added README
-rw-r--r--README109
1 files changed, 109 insertions, 0 deletions
diff --git a/README b/README
new file mode 100644
index 0000000..122621c
--- /dev/null
+++ b/README
@@ -0,0 +1,109 @@
+Unpremultipliers galore!
+------------------------
+
+Cairo and many other graphics libraries treat the colour components of
+a pixel in memory as being premultiplied by alpha, while many other
+systems such as SDL and OpenGL treat the components separately. This
+impedance mismatch is mostly annoying since it requires translating
+pixel buffers between representations when moving data between
+systems, but it can also be a performance bottleneck. Namely going
+from premultiplied to normal representation requires computing a
+division by alpha of every colour component to get the
+"unpremultiplied" components.
+
+ unpremultiply-div.c: The reference unpremultiplier.
+ This one does three divisions per pixel.
+
+ unpremultiply-inv.c: Uses a 1 KB table of reciprocals.
+
+ unpremultiply-lut.c: Uses a full 64 KB lookup table. Does
+ free saturation of superluminant pixels.
+
+ unpremultiply-sse2.S: A blocked version using 3 KB of reciprocal
+ tables. For AMD64/SSE2.
+
+Since lots of images have a lot of runs of constant or solid pixels we
+should optimise for that case. At a small to medium cost (~10-20%)
+for the varying pixel case we can get a big speedup for the boring
+areas of the image.
+
+ unpremultiply-invb.c: ...-inv.c with fast paths for boring bits.
+
+ unpremultiply-lutb.c: ...-lut.c with fast paths for boring bits.
+
+There's no one true unpremultiplier among these ones as the relative
+timings depend a lot on the type of image to unpremultiply, the
+memory subsystem of the machine (most notably cache sizes vs. image
+size vs. lookup table size), and whether superluminant input pixels
+are saturated on the output. Also space used by the different
+tables may be an issue for some programs.
+
+Optimising for size is easy:
+
+-rw------- 1 rowan rowan 67960 2009-01-14 20:30 unpremultiply-lutb.o
+-rw------- 1 rowan rowan 66888 2009-01-14 20:30 unpremultiply-lut.o
+-rw------- 1 rowan rowan 4224 2009-01-14 20:30 unpremultiply-sse2.o
+-rw------- 1 rowan rowan 3480 2009-01-14 20:30 unpremultiply-invb.o
+-rw------- 1 rowan rowan 2400 2009-01-14 20:30 unpremultiply-inv.o
+-rw------- 1 rowan rowan 2000 2009-01-14 20:30 unpremultiply-div.o
+
+When optimising for speed you can compare the versions from the
+relative time taken to unpremultiply a various sized buffer of data,
+where the baseline is the time taken to memcpy() the buffer.
+
+ Data | Pixels | copy | sse2 | invb | lutb | lut | inv | div
+ | | | | | | | |
+random | 512 | 1.000 | 3.602 | 13.493 | 12.029 | 9.974 | 12.980 | 124.729
+random | 4096 | 1.000 | 3.111 | 13.949 | 12.727 | 10.672 | 13.438 | 129.371
+random | 32768 | 1.000 | 1.478 | 6.945 | 6.544 | 5.454 | 6.605 | 62.213
+random | 262144 | 1.000 | 0.959 | 3.355 | 3.128 | 2.653 | 3.193 | 29.377
+random | 2097152 | 1.000 | 1.012 | 3.538 | 3.297 | 2.799 | 3.365 | 30.819
+ | | | | | | | |
+ solid | 512 | 1.000 | 3.589 | 2.628 | 2.624 | 9.829 | 12.978 | 124.918
+ solid | 4096 | 1.000 | 3.088 | 2.625 | 2.625 | 10.166 | 13.432 | 129.526
+ solid | 32768 | 1.000 | 1.114 | 1.199 | 1.313 | 3.935 | 4.927 | 46.612
+ solid | 262144 | 1.000 | 0.882 | 1.502 | 1.529 | 2.466 | 3.016 | 27.826
+ solid | 2097152 | 1.000 | 0.911 | 1.544 | 1.545 | 2.516 | 3.091 | 28.389
+ | | | | | | | |
+ clear | 512 | 1.000 | 3.591 | 1.620 | 1.611 | 9.730 | 12.965 | 3.392
+ clear | 4096 | 1.000 | 3.098 | 1.571 | 1.570 | 10.077 | 13.434 | 3.487
+ clear | 32768 | 1.000 | 1.127 | 1.018 | 1.062 | 3.927 | 4.988 | 1.679
+ clear | 262144 | 1.000 | 0.876 | 1.424 | 1.429 | 2.369 | 2.936 | 1.508
+ clear | 2097152 | 1.000 | 0.985 | 1.662 | 1.652 | 2.723 | 3.373 | 1.688
+
+For busy images without large solid areas the best portable code the
+best one would be lut if you can take the 64 KB table of baggage it
+comes with. The inv routines don't deal correctly with superluminant
+pixels that overflow 8 bit components of the destination. If you
+#define DO_CLAMP_INPUT to 1 then the troublesome pixel components will
+be saturated instead of overflowing. There's a small penalty of about
+10% for that feature.
+
+The SSE2 version should be preferred for large images. The reason
+it's not so great for small images is that it doesn't have the data
+specific fast paths and it uses nontemporal writes to store into the
+destination buffer. The net effect of this is that the destination
+buffer is pushed out of the caches and through into main memory
+regardless of whether it could comfortably sit in cache all day long.
+For comparison the table below shows the numbers when forcing SSE2 to
+use movdqa instead of movntdq to write to the destination buffer:
+
+ Data | Pixels | copy | sse2 | invb | lutb
+ | | | | |
+random | 512 | 1.000 | 2.797 | 12.040 | 11.873
+random | 4096 | 1.000 | 2.764 | 12.460 | 12.763
+random | 32768 | 1.000 | 1.577 | 6.033 | 6.218
+random | 262144 | 1.000 | 1.598 | 3.074 | 3.168
+random | 2097152 | 1.000 | 1.597 | 3.170 | 3.272
+ | | | | |
+ solid | 512 | 1.000 | 2.791 | 2.626 | 2.620
+ solid | 4096 | 1.000 | 2.763 | 2.628 | 2.627
+ solid | 32768 | 1.000 | 1.687 | 1.633 | 1.644
+ solid | 262144 | 1.000 | 1.523 | 1.649 | 1.636
+ solid | 2097152 | 1.000 | 1.575 | 1.721 | 1.665
+ | | | | |
+ clear | 512 | 1.000 | 2.800 | 1.614 | 1.610
+ clear | 4096 | 1.000 | 2.763 | 1.571 | 1.571
+ clear | 32768 | 1.000 | 1.707 | 1.549 | 1.833
+ clear | 262144 | 1.000 | 1.508 | 1.546 | 1.534
+ clear | 2097152 | 1.000 | 1.558 | 1.652 | 1.673