blob: 122621c31f8ac90ffd3eb39cb7f6812b8ad55d54 (
plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
|
Unpremultipliers galore!
------------------------
Cairo and many other graphics libraries treat the colour components of
a pixel in memory as being premultiplied by alpha, while many other
systems such as SDL and OpenGL treat the components separately. This
impedance mismatch is mostly annoying since it requires translating
pixel buffers between representations when moving data between
systems, but it can also be a performance bottleneck. Namely going
from premultiplied to normal representation requires computing a
division by alpha of every colour component to get the
"unpremultiplied" components.
unpremultiply-div.c: The reference unpremultiplier.
This one does three divisions per pixel.
unpremultiply-inv.c: Uses a 1 KB table of reciprocals.
unpremultiply-lut.c: Uses a full 64 KB lookup table. Does
free saturation of superluminant pixels.
unpremultiply-sse2.S: A blocked version using 3 KB of reciprocal
tables. For AMD64/SSE2.
Since lots of images have a lot of runs of constant or solid pixels we
should optimise for that case. At a small to medium cost (~10-20%)
for the varying pixel case we can get a big speedup for the boring
areas of the image.
unpremultiply-invb.c: ...-inv.c with fast paths for boring bits.
unpremultiply-lutb.c: ...-lut.c with fast paths for boring bits.
There's no one true unpremultiplier among these ones as the relative
timings depend a lot on the type of image to unpremultiply, the
memory subsystem of the machine (most notably cache sizes vs. image
size vs. lookup table size), and whether superluminant input pixels
are saturated on the output. Also space used by the different
tables may be an issue for some programs.
Optimising for size is easy:
-rw------- 1 rowan rowan 67960 2009-01-14 20:30 unpremultiply-lutb.o
-rw------- 1 rowan rowan 66888 2009-01-14 20:30 unpremultiply-lut.o
-rw------- 1 rowan rowan 4224 2009-01-14 20:30 unpremultiply-sse2.o
-rw------- 1 rowan rowan 3480 2009-01-14 20:30 unpremultiply-invb.o
-rw------- 1 rowan rowan 2400 2009-01-14 20:30 unpremultiply-inv.o
-rw------- 1 rowan rowan 2000 2009-01-14 20:30 unpremultiply-div.o
When optimising for speed you can compare the versions from the
relative time taken to unpremultiply a various sized buffer of data,
where the baseline is the time taken to memcpy() the buffer.
Data | Pixels | copy | sse2 | invb | lutb | lut | inv | div
| | | | | | | |
random | 512 | 1.000 | 3.602 | 13.493 | 12.029 | 9.974 | 12.980 | 124.729
random | 4096 | 1.000 | 3.111 | 13.949 | 12.727 | 10.672 | 13.438 | 129.371
random | 32768 | 1.000 | 1.478 | 6.945 | 6.544 | 5.454 | 6.605 | 62.213
random | 262144 | 1.000 | 0.959 | 3.355 | 3.128 | 2.653 | 3.193 | 29.377
random | 2097152 | 1.000 | 1.012 | 3.538 | 3.297 | 2.799 | 3.365 | 30.819
| | | | | | | |
solid | 512 | 1.000 | 3.589 | 2.628 | 2.624 | 9.829 | 12.978 | 124.918
solid | 4096 | 1.000 | 3.088 | 2.625 | 2.625 | 10.166 | 13.432 | 129.526
solid | 32768 | 1.000 | 1.114 | 1.199 | 1.313 | 3.935 | 4.927 | 46.612
solid | 262144 | 1.000 | 0.882 | 1.502 | 1.529 | 2.466 | 3.016 | 27.826
solid | 2097152 | 1.000 | 0.911 | 1.544 | 1.545 | 2.516 | 3.091 | 28.389
| | | | | | | |
clear | 512 | 1.000 | 3.591 | 1.620 | 1.611 | 9.730 | 12.965 | 3.392
clear | 4096 | 1.000 | 3.098 | 1.571 | 1.570 | 10.077 | 13.434 | 3.487
clear | 32768 | 1.000 | 1.127 | 1.018 | 1.062 | 3.927 | 4.988 | 1.679
clear | 262144 | 1.000 | 0.876 | 1.424 | 1.429 | 2.369 | 2.936 | 1.508
clear | 2097152 | 1.000 | 0.985 | 1.662 | 1.652 | 2.723 | 3.373 | 1.688
For busy images without large solid areas the best portable code the
best one would be lut if you can take the 64 KB table of baggage it
comes with. The inv routines don't deal correctly with superluminant
pixels that overflow 8 bit components of the destination. If you
#define DO_CLAMP_INPUT to 1 then the troublesome pixel components will
be saturated instead of overflowing. There's a small penalty of about
10% for that feature.
The SSE2 version should be preferred for large images. The reason
it's not so great for small images is that it doesn't have the data
specific fast paths and it uses nontemporal writes to store into the
destination buffer. The net effect of this is that the destination
buffer is pushed out of the caches and through into main memory
regardless of whether it could comfortably sit in cache all day long.
For comparison the table below shows the numbers when forcing SSE2 to
use movdqa instead of movntdq to write to the destination buffer:
Data | Pixels | copy | sse2 | invb | lutb
| | | | |
random | 512 | 1.000 | 2.797 | 12.040 | 11.873
random | 4096 | 1.000 | 2.764 | 12.460 | 12.763
random | 32768 | 1.000 | 1.577 | 6.033 | 6.218
random | 262144 | 1.000 | 1.598 | 3.074 | 3.168
random | 2097152 | 1.000 | 1.597 | 3.170 | 3.272
| | | | |
solid | 512 | 1.000 | 2.791 | 2.626 | 2.620
solid | 4096 | 1.000 | 2.763 | 2.628 | 2.627
solid | 32768 | 1.000 | 1.687 | 1.633 | 1.644
solid | 262144 | 1.000 | 1.523 | 1.649 | 1.636
solid | 2097152 | 1.000 | 1.575 | 1.721 | 1.665
| | | | |
clear | 512 | 1.000 | 2.800 | 1.614 | 1.610
clear | 4096 | 1.000 | 2.763 | 1.571 | 1.571
clear | 32768 | 1.000 | 1.707 | 1.549 | 1.833
clear | 262144 | 1.000 | 1.508 | 1.546 | 1.534
clear | 2097152 | 1.000 | 1.558 | 1.652 | 1.673
|