Age | Commit message (Collapse) | Author | Files | Lines |
|
This reverts commit 70680f7c4f07e1b2c96a28be1f03be6a447d0b60.
The environment variable PIXMAN_DISABLE=wholeops still allows
to use separable bilinear scaling iterators in benchmarks.
|
|
This patch adds the 'v' and 'h' modifiers to the command line
parsing logic, which can be used together with the '-b' option.
They enforce vertical-only or horizontal-only special cases of
interpolation when running the bilinear scaling benchmark.
The optimized implementations may have special shortcuts for
doing only vertical or only horizontal scaling. This change
allows to do benchmarking for these code paths.
Also instead of just a minimal nudge to the x-axis scaling
coefficient, apply a more sizeable nudge to the x- or y-axis
translation coefficients. With the older matrix variant, a
clever hack in the optimized code could be able to deduct
that the matrix is in fact indistinguishable from the
identity tranformation. Which would be an undesired effect
and an opportunity to 'rig' benchmark scores.
Signed-off-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
|
|
Similar in concept to fast_composite_tiled_repeat(), this breaks up any
unscaled composites, where source/mask areas outside the bitmap grid are
not clipped, into a series of simpler composites (either bitmap to bitmap
or solid to bitmap). These simpler composites are usually likely to match
existing fast path implementations, and so should benefit all platforms.
This produces some significant speedups for some cairo-perf-trace tests.
For example, timings on ARMv6 (using Siarhei's trimmed traces) are
Before:
[ # ] backend test min(s) median(s) stddev. count
[ # ] image: pixman 0.29.3
[ 0] image t-firefox-chalkboard 35.715 35.736 0.03% 6/6
After:
[ # ] backend test min(s) median(s) stddev. count
[ # ] image: pixman 0.29.3
[ 0] image t-firefox-chalkboard 9.254 9.261 0.15% 6/6
That's a speedup of 3.86x.
Also added a simple test program to check different repeat types.
|
|
The previous implementations of DIV and MOD relied upon the built-in / and %
operators performing round-to-zero. This is true for C99, but rounding is
implementation-defined for C89 when divisor and/or dividend is negative, and
I believe Pixman is still supposed to support C89.
|
|
The new system of bilinear fetchers outperforms the old fast paths by a
significant margin in nearly all cases. Since general_composite_rect() is
capable of synthesising any fast path using fetchers, it is advantageous
to simply remove all the old fast path code. This also simplifies the code
base considerably.
Benchmarks on Cortex-A7 (lowlevel-blt-bench -b) for the whole set of
affected operations follow:
src_8888_8888
Before After
Mean StdDev Mean StdDev Confidence Change
L1 35.9 0.03 65.2 1.43 100.00% +81.6%
L2 35.6 0.17 62.7 0.65 100.00% +76.3%
M 34.7 0.02 53.3 0.29 100.00% +53.7%
HT 26.5 0.14 33.1 0.21 100.00% +24.7%
VT 24.8 0.13 26.6 0.38 100.00% +7.6%
R 21.9 0.12 23.7 0.25 100.00% +8.3%
RT 12.0 0.21 9.2 0.08 100.00% -23.1%
src_8888_0565
Before After
Mean StdDev Mean StdDev Confidence Change
L1 33.5 0.17 54.3 0.67 100.00% +62.0%
L2 33.3 0.18 52.8 1.04 100.00% +58.6%
M 32.9 0.00 44.9 0.52 100.00% +36.4%
HT 24.3 0.02 26.7 0.30 100.00% +9.8%
VT 23.1 0.28 20.9 0.24 100.00% -9.6%
R 19.6 0.24 18.9 0.20 100.00% -3.6%
RT 10.6 0.19 7.0 0.11 100.00% -34.1%
src_0565_x888
Before After
Mean StdDev Mean StdDev Confidence Change
L1 25.3 0.14 50.6 0.77 100.00% +99.7%
L2 25.4 0.01 49.5 0.32 100.00% +95.0%
M 25.0 0.00 43.5 0.39 100.00% +73.9%
HT 21.7 0.22 29.7 0.32 100.00% +36.8%
VT 19.7 0.20 24.5 0.29 100.00% +24.4%
R 19.2 0.21 21.9 0.13 100.00% +13.9%
RT 11.9 0.20 9.2 0.16 100.00% -22.9%
src_0565_0565
Before After
Mean StdDev Mean StdDev Confidence Change
L1 21.7 0.11 43.8 0.41 100.00% +102.5%
L2 21.5 0.13 43.3 0.39 100.00% +101.1%
M 21.3 0.15 38.4 0.34 100.00% +80.3%
HT 18.6 0.16 24.7 0.06 100.00% +32.6%
VT 17.4 0.17 20.4 0.24 100.00% +17.3%
R 16.9 0.18 18.5 0.33 100.00% +9.9%
RT 10.8 0.17 7.2 0.16 100.00% -33.3%
over_8888_8888
Before After
Mean StdDev Mean StdDev Confidence Change
L1 26.6 0.14 42.9 0.32 100.00% +61.5%
L2 26.4 0.12 41.7 0.32 100.00% +57.9%
M 24.7 0.20 31.8 0.27 100.00% +28.8%
HT 18.9 0.20 23.8 0.26 100.00% +25.7%
VT 16.5 0.18 17.9 0.21 100.00% +8.7%
R 15.0 0.17 16.5 0.19 100.00% +9.8%
RT 7.7 0.12 7.1 0.10 100.00% -7.9%
add_8888_8888
Before After
Mean StdDev Mean StdDev Confidence Change
L1 28.3 0.02 62.2 0.79 100.00% +119.6%
L2 28.1 0.02 58.9 0.42 100.00% +109.3%
M 26.2 0.01 43.4 0.38 100.00% +65.6%
HT 20.2 0.01 30.3 0.18 100.00% +50.2%
VT 18.3 0.01 21.1 0.12 100.00% +15.1%
R 16.6 0.01 19.4 0.09 100.00% +16.4%
RT 8.9 0.06 7.9 0.14 100.00% -10.6%
src_8888_8_8888
Before After
Mean StdDev Mean StdDev Confidence Change
L1 21.9 0.10 34.9 0.26 100.00% +59.8%
L2 21.8 0.09 34.8 0.35 100.00% +59.8%
M 21.0 0.18 32.2 0.27 100.00% +53.1%
HT 17.2 0.01 21.7 0.22 100.00% +26.3%
VT 15.1 0.16 17.4 0.19 100.00% +14.6%
R 14.1 0.16 16.6 0.18 100.00% +17.5%
RT 7.9 0.12 6.9 0.09 100.00% -13.3%
src_8888_8_0565
Before After
Mean StdDev Mean StdDev Confidence Change
L1 18.8 0.09 31.3 0.28 100.00% +66.4%
L2 18.7 0.09 31.1 0.24 100.00% +66.2%
M 18.3 0.14 29.1 0.15 100.00% +58.4%
HT 15.3 0.15 18.3 0.21 100.00% +19.3%
VT 13.9 0.14 14.9 0.13 100.00% +6.9%
R 12.8 0.13 13.9 0.06 100.00% +8.5%
RT 7.3 0.11 5.6 0.09 100.00% -23.2%
src_0565_8_x888
Before After
Mean StdDev Mean StdDev Confidence Change
L1 19.7 0.10 30.3 0.20 100.00% +53.9%
L2 19.6 0.10 30.3 0.17 100.00% +54.4%
M 19.0 0.15 28.2 0.26 100.00% +48.9%
HT 16.1 0.17 20.1 0.24 100.00% +24.5%
VT 14.5 0.15 16.4 0.22 100.00% +13.6%
R 13.9 0.16 15.7 0.21 100.00% +13.1%
RT 8.0 0.13 6.8 0.12 100.00% -15.3%
src_0565_8_0565
Before After
Mean StdDev Mean StdDev Confidence Change
L1 17.2 0.09 27.5 0.23 100.00% +59.8%
L2 17.1 0.10 27.3 0.18 100.00% +59.7%
M 16.6 0.15 25.8 0.14 100.00% +55.2%
HT 14.2 0.21 17.1 0.06 100.00% +20.4%
VT 13.0 0.24 14.4 0.18 100.00% +10.7%
R 12.4 0.24 13.4 0.16 100.00% +7.9%
RT 7.3 0.21 5.5 0.04 100.00% -23.8%
over_8888_8_8888
Before After
Mean StdDev Mean StdDev Confidence Change
L1 25.5 0.10 30.3 0.37 100.00% +18.6%
L2 25.3 0.02 29.9 0.17 100.00% +18.0%
M 21.5 0.00 22.3 0.21 100.00% +4.0%
HT 16.6 0.01 17.2 0.17 100.00% +3.4%
VT 14.4 0.01 13.5 0.14 100.00% -6.6%
R 12.5 0.01 12.5 0.13 71.35% +0.3% (insignificant)
RT 5.9 0.05 5.4 0.08 100.00% -9.3%
add_8888_8_8888
Before After
Mean StdDev Mean StdDev Confidence Change
L1 21.0 0.08 36.9 0.53 100.00% +75.6%
L2 20.9 0.01 36.1 0.28 100.00% +72.6%
M 18.9 0.00 25.4 0.21 100.00% +34.8%
HT 14.6 0.01 19.6 0.23 100.00% +34.9%
VT 12.8 0.01 14.9 0.18 100.00% +16.3%
R 11.7 0.01 13.8 0.16 100.00% +17.4%
RT 6.0 0.04 5.7 0.09 100.00% -4.4%
|
|
|
|
|
|
|
|
|
|
Adds a more sophisticated algorithm suitable for horizontal scale factors
less than 1 (i.e. enlargements) where blocks of contiguous source pixels
are loaded and format-converted prior to being picked. This produces a
significant speed boost where the pixel conversion overhead outweighs the
branch predict penalties; in practice this appears to hold for the r5g6b5
fetcher only.
|
|
|
|
|
|
|
|
|
|
Introduces the infrastructure that will be used by ARMv7 bilinear scaled
fetcher routines, but doesn't actually utilise it yet. For simplicity, only
includes the macros where source pixels are picked before being
format-converted.
|
|
This is constrained to support X increments in the positive X direction only.
It also doesn't attempt to support any form of image repeat.
Here are some affine-bench results for a variety of horizontal and vertical
scaling factors.
Before:
x increment 0.5 0.75 1.0 1.5 2.0
y increment
0.5 6.2 6.2 6.2 6.1 6.0
0.75 6.2 6.1 6.1 6.0 5.9
1.0 6.2 6.1 5.9 5.8
1.5 6.1 6.0 5.9 5.8 5.6
2.0 6.1 6.0 5.9 5.7 5.5
After:
x increment 0.5 0.75 1.0 1.5 2.0
y increment
0.5 22.2 21.2 19.7 21.0 20.4
0.75 19.4 18.3 16.7 18.2 17.4
1.0 24.7 22.3 22.1 20.4
1.5 14.2 13.0 11.5 12.9 12.1
2.0 12.0 10.9 9.5 10.8 10.0
Improvement:
x increment 0.5 0.75 1.0 1.5 2.0
y increment
0.5 +256.4% +242.8% +219.6% +246.6% +241.3%
0.75 +212.9% +197.8% +173.7% +203.4% +195.1%
1.0 +300.2% +265.9% +273.4% +251.2%
1.5 +131.8% +115.6% +93.1% +123.2% +114.0%
2.0 +97.7% +82.9% +62.8% +91.0% +82.9%
|
|
This is constrained to support X increments in the positive X direction only.
It also doesn't attempt to support any form of image repeat.
Here are some affine-bench results for a variety of horizontal and vertical
scaling factors.
Before:
x increment 0.5 0.75 1.0 1.5 2.0
y increment
0.5 3.0 2.9 2.9 2.9 2.8
0.75 2.9 2.9 2.9 2.8 2.8
1.0 2.9 2.9 2.8 2.8
1.5 2.9 2.9 2.8 2.8 2.7
2.0 2.9 2.8 2.8 2.8 2.7
After:
x increment 0.5 0.75 1.0 1.5 2.0
y increment
0.5 20.2 18.5 19.1 17.6 16.3
0.75 17.1 15.4 16.0 14.5 13.2
1.0 20.1 17.1 15.6 13.6
1.5 11.9 10.3 10.8 9.5 8.4
2.0 9.9 8.4 8.9 7.7 6.8
Improvement:
x increment 0.5 0.75 1.0 1.5 2.0
y increment
0.5 +582.2% +530.7% +554.4% +514.7% +477.1%
0.75 +481.5% +427.7% +451.2% +410.3% +371.4%
1.0 +583.9% +486.9% +453.3% +392.7%
1.5 +308.1% +258.7% +281.0% +240.5% +208.1%
2.0 +241.4% +196.9% +217.8% +179.9% +152.4%
|
|
This is constrained to support X increments in the positive X direction only.
It also doesn't attempt to support any form of image repeat.
Here are some affine-bench results for a variety of horizontal and vertical
scaling factors.
Before:
x increment 0.5 0.75 1.0 1.5 2.0
y increment
0.5 3.9 3.8 3.7 3.6 3.4
0.75 3.8 3.8 3.7 3.5 3.3
1.0 3.8 3.7 3.5 3.3
1.5 3.7 3.6 3.5 3.3 3.1
2.0 3.6 3.5 3.4 3.2 3.0
After:
x increment 0.5 0.75 1.0 1.5 2.0
y increment
0.5 20.8 19.3 18.8 19.6 18.4
0.75 17.8 16.2 15.7 16.6 15.2
1.0 21.3 18.4 18.9 16.9
1.5 12.5 11.1 10.5 11.4 10.2
2.0 10.5 9.1 8.7 9.4 8.3
Improvement:
x increment 0.5 0.75 1.0 1.5 2.0
y increment
0.5 +436.3% +406.2% +402.8% +448.2% +437.9%
0.75 +364.2% +330.4% +324.7% +372.3% +354.5%
1.0 +461.5% +392.4% +446.9% +415.9%
1.5 +236.3% +204.8% +198.8% +242.2% +224.1%
2.0 +187.0% +156.5% +154.3% +193.3% +177.1%
|
|
This is constrained to support X increments in the positive X direction only.
It also doesn't attempt to support any form of image repeat.
Here are some affine-bench results for a variety of horizontal and vertical
scaling factors.
Before:
x increment 0.5 0.75 1.0 1.5 2.0
y increment
0.5 7.1 6.9 6.8 6.6 6.3
0.75 6.4 6.2 6.1 5.8 5.5
1.0 5.9 5.7 5.2 4.9
1.5 5.0 4.8 4.6 4.3 4.0
2.0 4.4 4.2 4.0 3.7 3.4
After:
x increment 0.5 0.75 1.0 1.5 2.0
y increment
0.5 21.0 19.6 19.2 20.2 18.9
0.75 18.0 16.6 16.1 17.1 15.9
1.0 21.8 18.9 19.9 17.7
1.5 12.8 11.3 10.9 11.8 10.7
2.0 10.7 9.3 8.9 9.8 8.8
Improvement:
x increment 0.5 0.75 1.0 1.5 2.0
y increment
0.5 +196.7% +183.6% +181.8% +206.6% +198.4%
0.75 +182.2% +166.2% +164.0% +194.8% +185.8%
1.0 +271.7% +234.4% +282.7% +257.9%
1.5 +154.6% +135.3% +134.3% +173.3% +164.8%
2.0 +144.1% +124.2% +123.3% +165.6% +155.5%
|
|
|
|
|
|
|
|
|
|
These complete the set of fast paths where currently pixman-fast-path.c
provides versions that get selected in preference to the armv6-optimised
scanline fetchers/combiners/writeback routines.
Because generation of these fast paths is macroised, the patch required
to add them is fairly simple.
lowlevel-blt-bench -n over_8888_8888:
Before After
Mean StdDev Mean StdDev Confidence Change
L1 13.8 0.0 26.5 0.2 100.0% +91.7%
L2 9.4 0.2 22.9 0.4 100.0% +142.6%
M 8.6 0.0 23.8 0.0 100.0% +176.1%
HT 7.4 0.0 14.1 0.1 100.0% +91.2%
VT 7.3 0.0 13.4 0.1 100.0% +84.1%
R 7.0 0.0 13.0 0.1 100.0% +85.9%
RT 4.5 0.1 6.2 0.1 100.0% +36.6%
affine-bench * 0 0 1 over a8r8g8b8 a8r8g8b8:
Before After
Mean StdDev Mean StdDev Confidence Change
0.5 9.4 0.0 28.0 0.0 100.0% +197.4%
0.75 9.0 0.0 26.1 0.0 100.0% +190.2%
1.0 8.6 0.0 24.4 0.0 100.0% +184.6%
1.5 7.9 0.0 21.7 0.0 100.0% +173.4%
2.0 7.3 0.0 19.6 0.0 100.0% +166.6%
lowlevel-blt-bench -n src_x888_8888:
Before After
Mean StdDev Mean StdDev Confidence Change
L1 108.6 2.0 66.3 0.9 100.0% -39.0%
L2 32.4 1.5 44.3 2.1 100.0% +36.8%
M 27.5 0.1 62.0 0.1 100.0% +125.6%
HT 20.3 0.1 28.7 0.2 100.0% +41.2%
VT 19.9 0.1 26.7 0.1 100.0% +34.4%
R 18.6 0.1 25.3 0.2 100.0% +36.3%
RT 8.7 0.1 9.8 0.2 100.0% +12.6%
affine-bench * 0 0 1 src x8r8g8b8 a8r8g8b8:
Before After
Mean StdDev Mean StdDev Confidence Change
0.5 45.2 0.0 97.2 0.1 100.0% +115.1%
0.75 35.9 0.1 76.7 0.1 100.0% +113.9%
1.0 29.6 0.1 61.1 0.1 100.0% +106.4%
1.5 21.4 0.0 52.7 0.1 100.0% +145.9%
2.0 16.7 0.0 43.0 0.1 100.0% +156.9%
lowlevel-blt-bench -n src_8888_0565:
Before After
Mean StdDev Mean StdDev Confidence Change
L1 57.2 0.7 43.1 0.4 100.0% -24.7%
L2 23.0 1.0 32.8 1.0 100.0% +42.5%
M 24.8 0.0 42.2 0.0 100.0% +70.0%
HT 18.0 0.1 22.1 0.1 100.0% +22.5%
VT 17.1 0.1 21.0 0.1 100.0% +22.5%
R 16.5 0.1 20.0 0.1 100.0% +21.4%
RT 8.3 0.2 8.4 0.1 95.0% +1.0% (insignificant)
affine-bench * 0 0 1 src a8r8g8b8 r5g6b5:
Before After
Mean StdDev Mean StdDev Confidence Change
0.5 34.9 0.0 55.3 0.0 100.0% +58.7%
0.75 29.3 0.0 49.1 0.0 100.0% +67.4%
1.0 24.8 0.0 42.6 0.1 100.0% +71.6%
1.5 19.0 0.0 38.2 0.1 100.0% +100.7%
2.0 15.4 0.0 31.8 0.0 100.0% +107.1%
lowlevel-blt-bench -n over_8888_0565:
Before After
Mean StdDev Mean StdDev Confidence Change
L1 9.8 0.0 15.3 0.1 100.0% +56.6%
L2 7.4 0.0 14.3 0.2 100.0% +91.7%
M 7.5 0.0 15.4 0.0 100.0% +106.0%
HT 6.5 0.0 10.1 0.0 100.0% +54.5%
VT 6.4 0.0 9.9 0.0 100.0% +54.6%
R 6.2 0.0 9.5 0.0 100.0% +52.1%
RT 4.2 0.0 4.6 0.1 100.0% +9.8%
affine-bench * 0 0 1 over a8r8g8b8 r5g6b5:
Before After
Mean StdDev Mean StdDev Confidence Change
0.5 8.0 0.0 17.3 0.0 100.0% +116.1%
0.75 7.8 0.0 16.5 0.0 100.0% +112.9%
1.0 7.5 0.0 15.7 0.0 100.0% +110.5%
1.5 7.0 0.0 14.8 0.0 100.0% +112.8%
2.0 6.5 0.0 13.7 0.0 100.0% +111.4%
|
|
This is adapted from the nearest scaled cover scanline fetcher, modified to
pack output data in 16-bit units.
lowlevel-blt-bench -n src_0565_0565:
Before After
Mean StdDev Mean StdDev Confidence Change
L1 119.6 4.1 72.5 1.1 100.0% -39.4%
L2 45.2 1.4 55.4 2.0 100.0% +22.5%
M 47.1 0.1 71.3 0.1 100.0% +51.4%
HT 26.4 0.2 31.8 0.3 100.0% +20.3%
VT 25.0 0.2 30.0 0.3 100.0% +20.3%
R 22.6 0.2 27.6 0.2 100.0% +22.0%
RT 9.7 0.2 10.3 0.2 100.0% +5.6%
affine-bench * 0 0 1 src r5g6b5 r5g6b5:
Before After
Mean StdDev Mean StdDev Confidence Change
0.5 59.6 0.1 129.6 0.1 100.0% +117.2%
0.75 52.0 0.1 106.3 0.1 100.0% +104.6%
1.0 47.2 0.1 71.7 0.0 100.0% +52.0%
1.5 39.1 0.1 68.1 0.1 100.0% +74.2%
2.0 37.7 0.1 68.7 0.1 100.0% +82.2%
|
|
Without this patch, any such operations are matched against the fast path
implementation in pixman-fast-path.c before general_composite_rect(), so
we never get to use the armv6-optimised assembly fetcher routines.
This patch adds a C wrapper to the same assembly routine used for the
nearest-scaled-cover fetcher, adapted to perform a 2D plot rather than a
single scanlne. The C is macroised so that later patches can use the same
approach to build more complex fast paths from combinations of armv6
fetcher/combiner/writeback routines in a similar manner to
pixcman_composite_rect().
lowlevel-blt-bench -n src_8888_8888:
Before After
Mean StdDev Mean StdDev Confidence Change
L1 117.2 1.6 79.2 1.1 100.0% -32.4%
L2 44.1 3.1 49.9 2.4 100.0% +13.2%
M 40.0 0.1 72.5 0.1 100.0% +81.4%
HT 20.1 0.1 29.5 0.3 100.0% +46.5%
VT 19.4 0.1 27.7 0.2 100.0% +42.7%
R 18.2 0.1 26.2 0.2 100.0% +44.1%
RT 8.7 0.2 10.0 0.2 100.0% +15.8%
affine-bench * 0 0 1 src a8r8g8b8 a8r8g8b8:
Before After
Mean StdDev Mean StdDev Confidence Change
0.5 46.6 0.1 110.5 0.1 100.0% +137.2%
0.75 39.1 0.1 88.5 0.1 100.0% +126.1%
1.0 36.3 0.2 71.7 0.1 100.0% +97.7%
1.5 26.7 0.1 55.3 0.1 100.0% +106.8%
2.0 19.9 0.0 43.5 0.0 100.0% +119.2%
|
|
This is related to the a8r8g8b8 nearest-scaled-cover fetcher. Below are
benchmarks for src_8_8888, which uses it.
lowlevel-blt-bench -n :
Before After
Mean StdDev Mean StdDev Confidence Change
L1 15.1 0.1 55.5 0.3 100.0% +267.2%
L2 13.7 0.1 45.3 0.8 100.0% +230.0%
M 14.5 0.0 53.9 0.1 100.0% +272.5%
HT 8.3 0.0 21.2 0.2 100.0% +154.6%
VT 8.3 0.0 20.1 0.3 100.0% +141.7%
R 8.0 0.0 19.2 0.3 100.0% +140.5%
RT 3.6 0.0 6.8 0.1 100.0% +88.4%
affine-bench:
Before After
Mean StdDev Mean StdDev Confidence Change
0.5 17.2 0.0 76.5 0.1 100.0% +344.4%
0.75 16.7 0.0 67.1 0.1 100.0% +300.8%
1.0 16.4 0.0 54.3 0.1 100.0% +232.2%
1.5 15.7 0.0 52.4 0.1 100.0% +234.6%
2.0 14.8 0.0 50.8 0.1 100.0% +243.9%
|
|
This is related to the a8r8g8b8 nearest-scaled-cover fetcher. Below are
benchmarks for add_x888_8888, which uses it.
lowlevel-blt-bench -n :
Before After
Mean StdDev Mean StdDev Confidence Change
L1 12.0 0.0 45.0 0.5 100.0% +275.3%
L2 9.2 0.1 30.4 1.2 100.0% +231.6%
M 8.6 0.0 27.8 0.1 100.0% +224.0%
HT 6.0 0.0 15.4 0.1 100.0% +158.5%
VT 5.9 0.0 14.5 0.1 100.0% +146.2%
R 5.7 0.0 14.1 0.1 100.0% +145.8%
RT 2.9 0.0 5.6 0.1 100.0% +91.4%
affine-bench:
Before After
Mean StdDev Mean StdDev Confidence Change
0.5 12.1 0.0 32.5 0.1 100.0% +169.6%
0.75 11.3 0.0 30.0 0.0 100.0% +165.1%
1.0 10.7 0.0 27.1 0.0 100.0% +153.7%
1.5 9.6 0.0 24.1 0.0 100.0% +151.6%
2.0 8.8 0.0 21.5 0.0 100.0% +145.1%
|
|
This is related to the a8r8g8b8 nearest-scaled-cover fetcher. Below are
benchmarks for src_0565_8888, which uses it.
lowlevel-blt-bench -n :
Before After
Mean StdDev Mean StdDev Confidence Change
L1 9.0 0.0 34.4 0.3 100.0% +284.7%
L2 8.1 0.1 29.0 0.6 100.0% +258.7%
M 8.4 0.0 33.2 0.1 100.0% +297.6%
HT 5.8 0.0 16.5 0.3 100.0% +183.6%
VT 5.8 0.0 16.0 0.3 100.0% +175.6%
R 5.6 0.0 15.6 0.1 100.0% +175.5%
RT 3.0 0.0 6.0 0.2 100.0% +98.7%
affine-bench:
Before After
Mean StdDev Mean StdDev Confidence Change
0.5 11.2 0.0 52.0 0.1 100.0% +363.2%
0.75 10.9 0.0 41.3 0.1 100.0% +279.3%
1.0 10.6 0.0 33.4 0.1 100.0% +216.7%
1.5 10.0 0.0 32.3 0.1 100.0% +221.8%
2.0 9.4 0.0 31.7 0.0 100.0% +236.0%
|
|
This is constrained to support X increments in the positive X direction only,
so this means scaled images (except those reflected in the Y axis) plus
parallelogram transformations which preserve the direction of the X axis.
It also doesn't attempt to support any form of image repeat.
With this optimisation, some operations constructed from fetcher and combiner
calls using general_composite_rect() now outperform the versions consructed
from FAST_NEAREST macros in pixman-fast-path.c, but unfortunately the
FAST_NEAREST ones have higher priority in fast path lookup. Here are some
benchmarks for the in_reverse_8888_8888 operation, which is not affected:
lowlevel-blt-bench -n :
Before After
Mean StdDev Mean StdDev Confidence Change
L1 10.2 0.0 27.1 0.2 100.0% +164.8%
L2 8.2 0.1 23.0 0.4 100.0% +179.2%
M 8.3 0.0 24.8 0.0 100.0% +200.3%
HT 5.5 0.0 12.7 0.0 100.0% +129.9%
VT 5.4 0.0 12.1 0.0 100.0% +123.2%
R 5.4 0.0 11.9 0.1 100.0% +122.7%
RT 2.8 0.0 5.4 0.1 100.0% +91.9%
affine-bench for 5 different scaling factors:
Before After
Mean StdDev Mean StdDev Confidence Change
0.5 11.1 0.0 28.3 0.0 100.0% +155.1%
0.75 10.5 0.0 26.4 0.0 100.0% +152.2%
1.0 9.9 0.0 24.6 0.0 100.0% +147.5%
1.5 9.0 0.0 21.8 0.0 100.0% +141.4%
2.0 8.3 0.0 19.7 0.0 100.0% +138.4%
|
|
There are a group of combiner types - SRC, OVER_REVERSE, IN, OUT and ADD -
where the source alpha affects only the destination alpha component. This
means that any fast path with a8r8g8b8 source and destination can also be
applied to an equivalent operation with x8r8g8b8 source and destination
just by updating the fast path table, and likewise with a8b8g8r8 and
x8b8g8r8. The following operations are affected:
add_x888_8_x888 (and bilinear scaled version of same)
add_x888_8888_x888
add_x888_n_x888
add_x888_x888 (and bilinear scaled version of same)
|
|
There are a group of combiner types - SRC, OVER, IN_REVERSE, OUT_REVERSE
and ADD - where the destination alpha component is only used (if at all) to
determine the destination alpha component. This means that any such fast
paths with an a8r8g8b8 destination can also be applied to an x8r8g8b8
destination just by updating the fast path table, and likewise with
a8b8g8r8 and x8b8g8r8. The following operations are affected:
over_8888_8888_x888
add_n_8_x888
add_8888_8_x888
add_8888_8888_x888
add_8888_n_x888
add_8888_x888
out_reverse_8_x888
|
|
This is tuned for Cortex-A7 (Raspberry Pi 2).
lowlevel-blt-bench results, compared to the ARMv6 fast path:
Before After
Mean StdDev Mean StdDev Confidence Change
L1 104.6 0.5 119.4 0.1 100.0% +14.1%
L2 106.8 0.6 121.4 0.1 100.0% +13.6%
M 100.3 1.3 116.4 0.0 100.0% +16.0%
HT 64.5 1.0 70.8 0.1 100.0% +9.8%
VT 56.0 0.8 62.2 0.1 100.0% +11.1%
R 54.1 0.9 55.2 0.0 100.0% +1.9%
RT 24.6 0.5 26.6 0.0 100.0% +8.3%
|
|
lowlevel-blt-bench results:
Before After
Mean StdDev Mean StdDev Confidence Change
L1 18.8 0.1 63.9 0.9 100.0% +239.0%
L2 16.0 0.4 58.5 1.3 100.0% +265.8%
M 13.1 0.0 56.8 0.1 100.0% +332.6%
HT 11.6 0.0 31.3 0.3 100.0% +169.6%
VT 11.4 0.0 27.2 0.2 100.0% +139.2%
R 11.0 0.1 28.2 0.2 100.0% +156.1%
RT 6.8 0.1 12.9 0.2 100.0% +89.0%
|
|
lowlevel-blt-bench results:
Before After
Mean StdDev Mean StdDev Confidence Change
L1 5.7 0.0 20.6 0.1 100.0% +263.8%
L2 4.9 0.0 17.4 0.3 100.0% +254.0%
M 4.8 0.0 19.9 0.0 100.0% +312.9%
HT 4.5 0.0 12.4 0.1 100.0% +175.4%
VT 4.5 0.0 12.0 0.0 100.0% +168.9%
R 4.3 0.0 11.4 0.1 100.0% +163.3%
RT 2.9 0.0 6.0 0.1 100.0% +106.9%
|
|
lowlevel-blt-bench results:
Before After
Mean StdDev Mean StdDev Confidence Change
L1 5.2 0.0 20.0 0.2 100.0% +281.7%
L2 4.5 0.0 16.2 0.2 100.0% +256.9%
M 4.5 0.0 18.8 0.0 100.0% +321.1%
HT 3.9 0.0 10.9 0.0 100.0% +177.6%
VT 3.9 0.0 10.6 0.0 100.0% +171.5%
R 3.8 0.0 10.0 0.0 100.0% +165.1%
RT 2.3 0.0 4.9 0.1 100.0% +107.7%
|
|
This is tuned for the Cortex-A7 (Raspberry Pi 2).
lowlevel-blt-bench results, compared to the ARMv6 fast path:
Before After
Mean StdDev Mean StdDev Confidence Change
L1 146.0 0.7 231.4 1.2 100.0% +58.5%
L2 143.1 0.9 222.1 1.7 100.0% +55.3%
M 110.9 0.0 129.0 0.5 100.0% +16.3%
HT 57.3 0.6 73.0 0.3 100.0% +27.4%
VT 46.6 0.5 61.6 0.4 100.0% +32.3%
R 42.3 0.2 51.7 0.2 100.0% +22.2%
RT 19.1 0.1 21.0 0.1 100.0% +9.9%
|
|
This is used instead of the equivalent C fast path.
lowlevel-blt-bench results, compared to no fast path at all:
Before After
Mean StdDev Mean StdDev Confidence Change
L1 12.4 0.1 117.5 2.3 100.0% +851.2%
L2 9.5 0.1 46.9 2.4 100.0% +393.8%
M 9.6 0.0 61.9 0.9 100.0% +544.0%
HT 7.9 0.0 26.6 0.5 100.0% +238.6%
VT 7.7 0.0 24.2 0.4 100.0% +212.5%
R 7.4 0.0 22.4 0.4 100.0% +204.5%
RT 4.1 0.0 8.7 0.2 100.0% +109.4%
|
|
This is a C fast path, useful for reference or for platforms that don't
have their own fast path for this operation.
lowlevel-blt-bench results on ARMv6:
Before After
Mean StdDev Mean StdDev Confidence Change
L1 12.4 0.1 24.4 0.3 100.0% +97.8%
L2 9.5 0.1 14.1 0.2 100.0% +48.1%
M 9.6 0.0 14.7 0.0 100.0% +53.1%
HT 7.9 0.0 12.0 0.1 100.0% +52.3%
VT 7.7 0.0 11.6 0.1 100.0% +49.8%
R 7.4 0.0 10.8 0.1 100.0% +47.2%
RT 4.1 0.0 6.1 0.1 100.0% +48.2%
|
|
This is used instead of the equivalent C fast path.
lowlevel-blt-bench results, compared to no fast path at all:
Before After
Mean StdDev Mean StdDev Confidence Change
L1 8.2 0.0 38.7 0.5 100.0% +372.7%
L2 7.9 0.1 37.6 0.5 100.0% +376.8%
M 7.3 0.0 38.5 0.1 100.0% +425.6%
HT 6.9 0.0 26.1 0.3 100.0% +279.9%
VT 6.8 0.0 24.5 0.3 100.0% +258.0%
R 6.6 0.1 23.6 0.2 100.0% +255.1%
RT 4.5 0.1 10.9 0.2 100.0% +143.1%
|
|
This is a C fast path, useful for reference or for platforms that don't
have their own fast path for this operation.
lowlevel-blt-bench results on ARMv6:
Before After
Mean StdDev Mean StdDev Confidence Change
L1 8.2 0.0 11.3 0.1 100.0% +38.6%
L2 7.9 0.1 10.5 0.0 100.0% +33.3%
M 7.3 0.0 10.0 0.0 100.0% +36.7%
HT 6.9 0.0 9.2 0.0 100.0% +33.3%
VT 6.8 0.0 9.0 0.0 100.0% +32.1%
R 6.6 0.1 8.8 0.0 100.0% +31.8%
RT 4.5 0.1 6.3 0.1 100.0% +39.7%
|
|
This is a C fast path, useful for reference or for platforms that don't
have their own fast path for this operation.
lowlevel-blt-bench results on ARMv6:
Before After
Mean StdDev Mean StdDev Confidence Change
L1 11.9 0.1 20.4 0.2 100.0% +71.1%
L2 10.6 0.2 16.5 0.4 100.0% +55.8%
M 9.4 0.0 13.5 0.0 100.0% +44.3%
HT 8.4 0.0 12.2 0.1 100.0% +43.9%
VT 8.3 0.0 11.9 0.1 100.0% +42.7%
R 8.1 0.0 11.5 0.1 100.0% +41.3%
RT 5.4 0.1 7.6 0.1 100.0% +40.3%
|
|
lowlevel-blt-bench results for an example operation, src_1555_0565:
Before After
Mean StdDev Mean StdDev Confidence Change
L1 85.8 2.12 114.0 1.65 100.00% +32.9%
L2 83.7 0.96 106.0 1.01 100.00% +26.7%
M 76.4 0.66 94.8 0.98 100.00% +24.0%
HT 39.8 0.37 38.9 0.29 100.00% -2.3%
VT 37.0 0.36 34.1 0.24 100.00% -7.7%
R 33.9 0.37 30.3 0.24 100.00% -10.5%
RT 14.7 0.20 11.5 0.11 100.00% -21.7%
|
|
lowlevel-blt-bench results on Cortex-A7 for a couple of sample operations
that utilise these fetchers are below.
add_0565_8888:
Before After
Mean StdDev Mean StdDev Confidence Change
L1 75.4 0.38 147.5 0.90 100.00% +95.7%
L2 72.3 0.36 129.3 0.57 100.00% +79.0%
M 64.4 0.05 94.6 0.90 100.00% +46.8%
HT 35.8 0.03 42.3 0.26 100.00% +18.1%
VT 29.9 0.04 34.3 0.31 100.00% +14.5%
R 26.1 0.02 28.6 0.11 100.00% +9.4%
RT 12.2 0.06 13.1 0.15 100.00% +7.9%
add_1555_8888:
Before After
Mean StdDev Mean StdDev Confidence Change
L1 73.3 0.38 160.7 0.89 100.00% +119.2%
L2 69.8 0.08 139.1 0.74 100.00% +99.4%
M 62.2 0.03 100.4 0.76 100.00% +61.4%
HT 35.1 0.03 42.9 0.42 100.00% +22.1%
VT 29.5 0.03 34.7 0.33 100.00% +17.8%
R 25.8 0.02 28.7 0.27 100.00% +11.4%
RT 12.1 0.02 13.2 0.15 100.00% +8.5%
---
For the record, I tried writing an a8 fetcher, but benchmarking indicated that
it couldn't improve upon the ARMv6 a8 fetcher results.
I also tried adding prefetch to the above fetchers - since they are the
first iterator in a chain and won't benefit from write-allocate caches, you
might think that this would help. Benchmarking indicated otherwise.
|
|
This is tuned for Cortex-A7 (Raspberry Pi 2).
lowlevel-blt-bench results, compared to the ARMv6 fast path:
Before After
Mean StdDev Mean StdDev Confidence Change
L1 88.6 0.2 221.3 0.5 100.0% +149.7%
L2 88.1 0.4 219.2 0.8 100.0% +148.9%
M 87.9 0.1 178.2 0.1 100.0% +102.6%
HT 59.7 0.4 72.0 0.2 100.0% +20.7%
VT 53.2 0.4 69.8 0.2 100.0% +31.3%
R 48.5 0.3 53.6 0.1 100.0% +10.6%
RT 21.2 0.1 23.0 0.1 100.0% +8.5%
|
|
lowlevel-blt-bench results, compared to using the armv6 1555 fetcher:
Before After
Mean StdDev Mean StdDev Confidence Change
L1 57.0 1.1 70.1 0.6 100.0% +23.1%
L2 41.4 1.0 44.1 1.4 100.0% +6.3%
M 49.8 0.1 59.0 0.2 100.0% +18.5%
HT 21.4 0.3 32.3 0.3 100.0% +50.9%
VT 21.0 0.3 30.2 0.3 100.0% +43.8%
R 19.7 0.2 27.0 0.2 100.0% +37.4%
RT 7.0 0.2 10.9 0.3 100.0% +56.6%
|
|
This supports x8r8g8b8 source images.
lowlevel-blt-bench results for src_x888_8888 with PIXMAN_DISABLE=wholeops
on a Raspberry Pi 1:
Before After
Mean StdDev Mean StdDev Confidence Change
L1 55.5 0.98 147.5 5.82 100.00% +165.8%
L2 25.2 0.84 46.7 2.83 100.00% +85.5%
M 27.8 0.15 57.5 0.06 100.00% +106.7%
HT 14.5 0.10 24.2 0.19 100.00% +66.8%
VT 14.2 0.11 23.2 0.20 100.00% +63.0%
R 13.5 0.07 22.0 0.24 100.00% +63.3%
RT 5.5 0.05 7.8 0.24 100.00% +41.8%
lowlevel-blt-bench results for src_x888_8888 with PIXMAN_DISABLE=wholeops
on a Raspberry Pi 2 (ARMv7):
Before After
Mean StdDev Mean StdDev Confidence Change
L1 135.8 2.43 236.4 6.68 100.00% +74.0%
L2 122.8 1.09 201.4 2.01 100.00% +64.1%
M 94.1 1.15 145.2 0.59 100.00% +54.3%
HT 41.1 0.53 52.4 0.38 100.00% +27.5%
VT 36.5 0.53 51.7 0.38 100.00% +41.7%
R 30.3 0.42 40.9 0.29 100.00% +34.7%
RT 13.7 0.24 17.5 0.25 100.00% +28.2%
The before case was using the fetcher iterator defined in pixman-access.c.
Note that it does not appear to be worthwhile to create an additional ARMv7
version of this fetcher. If we construct one using the src_x888_8888 macros
the results are as follows on a Raspberry Pi 2:
Before After
Mean StdDev Mean StdDev Confidence Change
L1 236.4 6.68 259.0 3.58 100.00% +9.6%
L2 201.4 2.01 209.8 2.17 100.00% +4.2%
M 145.2 0.59 139.4 1.06 100.00% -4.0%
HT 52.4 0.38 51.4 0.56 100.00% -1.9%
VT 51.7 0.38 47.8 0.86 100.00% -7.6%
R 40.9 0.29 35.3 0.40 100.00% -13.5%
RT 17.5 0.25 16.5 0.26 100.00% -6.2%
|
|
This supports a1r5g5b5 source images.
lowlevel-blt-bench results for src_1555_8888, which does not yet have a
dedicated fast path:
Before After
Mean StdDev Mean StdDev Confidence Change
L1 24.5 0.2 57.0 1.1 100.0% +132.2%
L2 19.3 0.4 41.4 1.0 100.0% +114.3%
M 20.4 0.0 49.8 0.1 100.0% +144.7%
HT 12.8 0.1 21.4 0.3 100.0% +67.0%
VT 12.7 0.1 21.0 0.3 100.0% +65.4%
R 12.1 0.1 19.7 0.2 100.0% +63.1%
RT 5.6 0.1 7.0 0.2 100.0% +24.8%
|
|
This supports r5g6b5 source and desitination images, and a8 source images.
lowlevel-blt-bench results for example operations which use these because
they lack a dedicated fast path at the time of writing:
in_reverse_8_8888
Before After
Mean StdDev Mean StdDev Confidence Change
L1 30.0 0.3 37.0 0.3 100.0% +23.2%
L2 23.3 0.3 29.4 0.4 100.0% +26.1%
M 24.0 0.0 31.3 0.1 100.0% +30.5%
HT 12.8 0.1 16.1 0.1 100.0% +25.8%
VT 11.9 0.1 14.8 0.1 100.0% +24.6%
R 11.7 0.1 14.6 0.1 100.0% +24.5%
RT 5.1 0.1 6.2 0.1 100.0% +20.2%
in_0565_8888
Before After
Mean StdDev Mean StdDev Confidence Change
L1 22.0 0.1 28.3 0.2 100.0% +28.4%
L2 16.6 0.2 23.6 0.3 100.0% +42.2%
M 16.5 0.0 24.7 0.1 100.0% +49.5%
HT 11.0 0.1 13.7 0.1 100.0% +24.4%
VT 10.7 0.0 13.1 0.1 100.0% +22.0%
R 10.3 0.0 12.6 0.1 100.0% +22.5%
RT 5.3 0.1 5.7 0.1 100.0% +9.0%
in_reverse_8888_0565
Before After
Mean StdDev Mean StdDev Confidence Change
L1 16.6 0.1 20.9 0.1 100.0% +25.5%
L2 13.1 0.1 17.7 0.3 100.0% +35.3%
M 13.2 0.0 19.2 0.0 100.0% +45.3%
HT 9.6 0.0 11.7 0.1 100.0% +21.8%
VT 9.3 0.0 11.4 0.1 100.0% +22.4%
R 9.0 0.0 10.9 0.1 100.0% +21.1%
RT 4.7 0.1 5.2 0.1 100.0% +8.7%
|
|
In common with the ARMv6 version of this combiner, this code features a
shortcut for the case where the destination is opaque. Without that, the
NEON version performs significantly worse than the ARMv6 version (though it
muct be noted that the effect of repeated application of the OVER_REVERSE
operator is to set the destination opaque, so lowlevel-blt-bench is perhaps
not best representing real-world usage in this case).
lowlevel-blt-bench results for over_reverse_0565_8888 (compared to ARMv6
version):
Before After
Mean StdDev Mean StdDev Confidence Change
L1 73.4 0.21 77.9 0.40 100.00% +6.2%
L2 72.8 0.18 76.0 0.40 100.00% +4.4%
M 66.3 0.02 70.1 0.67 100.00% +5.8%
HT 34.0 0.19 31.0 0.38 100.00% -9.0%
VT 30.2 0.16 27.4 0.35 100.00% -9.1%
R 28.5 0.16 23.4 0.32 100.00% -17.9%
RT 12.4 0.10 10.5 0.17 100.00% -15.2%
lowlevel-blt-bench results for over_reverse_0565_8_8888 (compared to ARMv6
version):
Before After
Mean StdDev Mean StdDev Confidence Change
L1 60.0 0.20 65.4 0.29 100.00% +9.0%
L2 59.1 0.18 63.4 0.38 100.00% +7.2%
M 50.3 0.24 55.8 0.09 100.00% +10.9%
HT 24.1 0.15 22.4 0.12 100.00% -7.1%
VT 20.8 0.12 19.6 0.13 100.00% -5.6%
R 19.6 0.13 17.2 0.01 100.00% -12.4%
RT 8.2 0.06 7.5 0.05 100.00% -8.2%
It's notable that the compatative performance depends heavily upon the
rectangle size - not surprising since one of the main features of NEON is
the ability to work on larger blocks of data at once, which is mainly a
benefit to large data sets, and the larger granularity works against it for
smaller data sets. Comments welcome on whether it would be desirable to select
between ARMv6 and ARMv7 implementations at runtime based upon the rectangle
size.
|