Added README

author: M Joonas Pihlaja <jpihlaja@cc.helsinki.fi> 2009-01-14 23:37:08 +0200
committer: M Joonas Pihlaja <jpihlaja@cc.helsinki.fi> 2009-01-14 23:37:08 +0200
commit: 2336c8788dea7209e4bc582f0f7c8cf163edf589 (patch)
tree: fdb45c0e08a91d4664415679bbbcb3822afe1621
parent: 4306a288dd0721604bc0b74b5a1deb959b4500f0 (diff)
1 files changed, 109 insertions, 0 deletions
diff --git a/README b/README
new file mode 100644
index 0000000..122621c
--- /dev/null
+++ b/README
@@ -0,0 +1,109 @@
+Unpremultipliers galore!
+------------------------
+
+Cairo and many other graphics libraries treat the colour components of
+a pixel in memory as being premultiplied by alpha, while many other
+systems such as SDL and OpenGL treat the components separately.  This
+impedance mismatch is mostly annoying since it requires translating
+pixel buffers between representations when moving data between
+systems, but it can also be a performance bottleneck.  Namely going
+from premultiplied to normal representation requires computing a
+division by alpha of every colour component to get the
+"unpremultiplied" components.
+
+ unpremultiply-div.c: The reference unpremultiplier.
+		      This one does three divisions per pixel.
+
+ unpremultiply-inv.c: Uses a 1 KB table of reciprocals.
+
+ unpremultiply-lut.c: Uses a full 64 KB lookup table.  Does
+		      free saturation of superluminant pixels.
+
+ unpremultiply-sse2.S: A blocked version using 3 KB of reciprocal
+		       tables. For AMD64/SSE2.
+
+Since lots of images have a lot of runs of constant or solid pixels we
+should optimise for that case.  At a small to medium cost (~10-20%)
+for the varying pixel case we can get a big speedup for the boring
+areas of the image.
+
+ unpremultiply-invb.c: ...-inv.c with fast paths for boring bits.
+
+ unpremultiply-lutb.c: ...-lut.c with fast paths for boring bits.
+
+There's no one true unpremultiplier among these ones as the relative
+timings depend a lot on the type of image to unpremultiply, the
+memory subsystem of the machine (most notably cache sizes vs. image
+size vs. lookup table size), and whether superluminant input pixels
+are saturated on the output.  Also space used by the different
+tables may be an issue for some programs.
+
+Optimising for size is easy:
+
+-rw------- 1 rowan rowan 67960 2009-01-14 20:30 unpremultiply-lutb.o
+-rw------- 1 rowan rowan 66888 2009-01-14 20:30 unpremultiply-lut.o
+-rw------- 1 rowan rowan  4224 2009-01-14 20:30 unpremultiply-sse2.o
+-rw------- 1 rowan rowan  3480 2009-01-14 20:30 unpremultiply-invb.o
+-rw------- 1 rowan rowan  2400 2009-01-14 20:30 unpremultiply-inv.o
+-rw------- 1 rowan rowan  2000 2009-01-14 20:30 unpremultiply-div.o
+
+When optimising for speed you can compare the versions from the
+relative time taken to unpremultiply a various sized buffer of data,
+where the baseline is the time taken to memcpy() the buffer.
+
+  Data |  Pixels |  copy |  sse2 |   invb |   lutb |    lut |    inv |     div
+       |         |       |       |        |        |        |        |        
+random |     512 | 1.000 | 3.602 | 13.493 | 12.029 |  9.974 | 12.980 | 124.729
+random |    4096 | 1.000 | 3.111 | 13.949 | 12.727 | 10.672 | 13.438 | 129.371
+random |   32768 | 1.000 | 1.478 |  6.945 |  6.544 |  5.454 |  6.605 |  62.213
+random |  262144 | 1.000 | 0.959 |  3.355 |  3.128 |  2.653 |  3.193 |  29.377
+random | 2097152 | 1.000 | 1.012 |  3.538 |  3.297 |  2.799 |  3.365 |  30.819
+       |         |       |       |        |        |        |        |        
+ solid |     512 | 1.000 | 3.589 |  2.628 |  2.624 |  9.829 | 12.978 | 124.918
+ solid |    4096 | 1.000 | 3.088 |  2.625 |  2.625 | 10.166 | 13.432 | 129.526
+ solid |   32768 | 1.000 | 1.114 |  1.199 |  1.313 |  3.935 |  4.927 |  46.612
+ solid |  262144 | 1.000 | 0.882 |  1.502 |  1.529 |  2.466 |  3.016 |  27.826
+ solid | 2097152 | 1.000 | 0.911 |  1.544 |  1.545 |  2.516 |  3.091 |  28.389
+       |         |       |       |        |        |        |        |        
+ clear |     512 | 1.000 | 3.591 |  1.620 |  1.611 |  9.730 | 12.965 |   3.392
+ clear |    4096 | 1.000 | 3.098 |  1.571 |  1.570 | 10.077 | 13.434 |   3.487
+ clear |   32768 | 1.000 | 1.127 |  1.018 |  1.062 |  3.927 |  4.988 |   1.679
+ clear |  262144 | 1.000 | 0.876 |  1.424 |  1.429 |  2.369 |  2.936 |   1.508
+ clear | 2097152 | 1.000 | 0.985 |  1.662 |  1.652 |  2.723 |  3.373 |   1.688
+
+For busy images without large solid areas the best portable code the
+best one would be lut if you can take the 64 KB table of baggage it
+comes with.  The inv routines don't deal correctly with superluminant
+pixels that overflow 8 bit components of the destination.  If you
+#define DO_CLAMP_INPUT to 1 then the troublesome pixel components will
+be saturated instead of overflowing.  There's a small penalty of about
+10% for that feature.
+
+The SSE2 version should be preferred for large images.  The reason
+it's not so great for small images is that it doesn't have the data
+specific fast paths and it uses nontemporal writes to store into the
+destination buffer.  The net effect of this is that the destination
+buffer is pushed out of the caches and through into main memory
+regardless of whether it could comfortably sit in cache all day long.
+For comparison the table below shows the numbers when forcing SSE2 to
+use movdqa instead of movntdq to write to the destination buffer:
+
+  Data |  Pixels |  copy |  sse2 |   invb |   lutb
+       |         |       |       |        |       
+random |     512 | 1.000 | 2.797 | 12.040 | 11.873
+random |    4096 | 1.000 | 2.764 | 12.460 | 12.763
+random |   32768 | 1.000 | 1.577 |  6.033 |  6.218
+random |  262144 | 1.000 | 1.598 |  3.074 |  3.168
+random | 2097152 | 1.000 | 1.597 |  3.170 |  3.272
+       |         |       |       |        |       
+ solid |     512 | 1.000 | 2.791 |  2.626 |  2.620
+ solid |    4096 | 1.000 | 2.763 |  2.628 |  2.627
+ solid |   32768 | 1.000 | 1.687 |  1.633 |  1.644
+ solid |  262144 | 1.000 | 1.523 |  1.649 |  1.636
+ solid | 2097152 | 1.000 | 1.575 |  1.721 |  1.665
+       |         |       |       |        |       
+ clear |     512 | 1.000 | 2.800 |  1.614 |  1.610
+ clear |    4096 | 1.000 | 2.763 |  1.571 |  1.571
+ clear |   32768 | 1.000 | 1.707 |  1.549 |  1.833
+ clear |  262144 | 1.000 | 1.508 |  1.546 |  1.534
+ clear | 2097152 | 1.000 | 1.558 |  1.652 |  1.673
author	M Joonas Pihlaja <jpihlaja@cc.helsinki.fi>	2009-01-14 23:37:08 +0200
committer	M Joonas Pihlaja <jpihlaja@cc.helsinki.fi>	2009-01-14 23:37:08 +0200
commit	2336c8788dea7209e4bc582f0f7c8cf163edf589 (patch)
tree	fdb45c0e08a91d4664415679bbbcb3822afe1621
parent	4306a288dd0721604bc0b74b5a1deb959b4500f0 (diff)