~siamashka/pixman - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author	Files	Lines
2016-04-05	Revert "armv7: Retire old bilinear fast paths"20160405-arm-neon-release1-from-bavison	Siarhei Siamashka	4	-4/+2042
	This reverts commit 70680f7c4f07e1b2c96a28be1f03be6a447d0b60. The environment variable PIXMAN_DISABLE=wholeops still allows to use separable bilinear scaling iterators in benchmarks.
2016-04-05	lowlevel-blt-bench: horizontal/vertical variants of bilinear scaling	Siarhei Siamashka	1	-10/+57
	This patch adds the 'v' and 'h' modifiers to the command line parsing logic, which can be used together with the '-b' option. They enforce vertical-only or horizontal-only special cases of interpolation when running the bilinear scaling benchmark. The optimized implementations may have special shortcuts for doing only vertical or only horizontal scaling. This change allows to do benchmarking for these code paths. Also instead of just a minimal nudge to the x-axis scaling coefficient, apply a more sizeable nudge to the x- or y-axis translation coefficients. With the older matrix variant, a clever hack in the optimized code could be able to deduct that the matrix is in fact indistinguishable from the identity tranformation. Which would be an undesired effect and an opportunity to 'rig' benchmark scores. Signed-off-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
2015-10-15	pixman-fast-path: Add fast path for "pad" type repeats	Ben Avison	3	-0/+371
	Similar in concept to fast_composite_tiled_repeat(), this breaks up any unscaled composites, where source/mask areas outside the bitmap grid are not clipped, into a series of simpler composites (either bitmap to bitmap or solid to bitmap). These simpler composites are usually likely to match existing fast path implementations, and so should benefit all platforms. This produces some significant speedups for some cairo-perf-trace tests. For example, timings on ARMv6 (using Siarhei's trimmed traces) are Before: [ # ] backend test min(s) median(s) stddev. count [ # ] image: pixman 0.29.3 [ 0] image t-firefox-chalkboard 35.715 35.736 0.03% 6/6 After: [ # ] backend test min(s) median(s) stddev. count [ # ] image: pixman 0.29.3 [ 0] image t-firefox-chalkboard 9.254 9.261 0.15% 6/6 That's a speedup of 3.86x. Also added a simple test program to check different repeat types.
2015-10-15	Resolve implementation-defined behaviour for division rounded to -infinity	Ben Avison	1	-4/+4
	The previous implementations of DIV and MOD relied upon the built-in / and % operators performing round-to-zero. This is true for C99, but rounding is implementation-defined for C89 when divisor and/or dividend is negative, and I believe Pixman is still supposed to support C89.
2015-10-15	armv7: Retire old bilinear fast paths	Ben Avison	4	-2042/+4
	The new system of bilinear fetchers outperforms the old fast paths by a significant margin in nearly all cases. Since general_composite_rect() is capable of synthesising any fast path using fetchers, it is advantageous to simply remove all the old fast path code. This also simplifies the code base considerably. Benchmarks on Cortex-A7 (lowlevel-blt-bench -b) for the whole set of affected operations follow: src_8888_8888 Before After Mean StdDev Mean StdDev Confidence Change L1 35.9 0.03 65.2 1.43 100.00% +81.6% L2 35.6 0.17 62.7 0.65 100.00% +76.3% M 34.7 0.02 53.3 0.29 100.00% +53.7% HT 26.5 0.14 33.1 0.21 100.00% +24.7% VT 24.8 0.13 26.6 0.38 100.00% +7.6% R 21.9 0.12 23.7 0.25 100.00% +8.3% RT 12.0 0.21 9.2 0.08 100.00% -23.1% src_8888_0565 Before After Mean StdDev Mean StdDev Confidence Change L1 33.5 0.17 54.3 0.67 100.00% +62.0% L2 33.3 0.18 52.8 1.04 100.00% +58.6% M 32.9 0.00 44.9 0.52 100.00% +36.4% HT 24.3 0.02 26.7 0.30 100.00% +9.8% VT 23.1 0.28 20.9 0.24 100.00% -9.6% R 19.6 0.24 18.9 0.20 100.00% -3.6% RT 10.6 0.19 7.0 0.11 100.00% -34.1% src_0565_x888 Before After Mean StdDev Mean StdDev Confidence Change L1 25.3 0.14 50.6 0.77 100.00% +99.7% L2 25.4 0.01 49.5 0.32 100.00% +95.0% M 25.0 0.00 43.5 0.39 100.00% +73.9% HT 21.7 0.22 29.7 0.32 100.00% +36.8% VT 19.7 0.20 24.5 0.29 100.00% +24.4% R 19.2 0.21 21.9 0.13 100.00% +13.9% RT 11.9 0.20 9.2 0.16 100.00% -22.9% src_0565_0565 Before After Mean StdDev Mean StdDev Confidence Change L1 21.7 0.11 43.8 0.41 100.00% +102.5% L2 21.5 0.13 43.3 0.39 100.00% +101.1% M 21.3 0.15 38.4 0.34 100.00% +80.3% HT 18.6 0.16 24.7 0.06 100.00% +32.6% VT 17.4 0.17 20.4 0.24 100.00% +17.3% R 16.9 0.18 18.5 0.33 100.00% +9.9% RT 10.8 0.17 7.2 0.16 100.00% -33.3% over_8888_8888 Before After Mean StdDev Mean StdDev Confidence Change L1 26.6 0.14 42.9 0.32 100.00% +61.5% L2 26.4 0.12 41.7 0.32 100.00% +57.9% M 24.7 0.20 31.8 0.27 100.00% +28.8% HT 18.9 0.20 23.8 0.26 100.00% +25.7% VT 16.5 0.18 17.9 0.21 100.00% +8.7% R 15.0 0.17 16.5 0.19 100.00% +9.8% RT 7.7 0.12 7.1 0.10 100.00% -7.9% add_8888_8888 Before After Mean StdDev Mean StdDev Confidence Change L1 28.3 0.02 62.2 0.79 100.00% +119.6% L2 28.1 0.02 58.9 0.42 100.00% +109.3% M 26.2 0.01 43.4 0.38 100.00% +65.6% HT 20.2 0.01 30.3 0.18 100.00% +50.2% VT 18.3 0.01 21.1 0.12 100.00% +15.1% R 16.6 0.01 19.4 0.09 100.00% +16.4% RT 8.9 0.06 7.9 0.14 100.00% -10.6% src_8888_8_8888 Before After Mean StdDev Mean StdDev Confidence Change L1 21.9 0.10 34.9 0.26 100.00% +59.8% L2 21.8 0.09 34.8 0.35 100.00% +59.8% M 21.0 0.18 32.2 0.27 100.00% +53.1% HT 17.2 0.01 21.7 0.22 100.00% +26.3% VT 15.1 0.16 17.4 0.19 100.00% +14.6% R 14.1 0.16 16.6 0.18 100.00% +17.5% RT 7.9 0.12 6.9 0.09 100.00% -13.3% src_8888_8_0565 Before After Mean StdDev Mean StdDev Confidence Change L1 18.8 0.09 31.3 0.28 100.00% +66.4% L2 18.7 0.09 31.1 0.24 100.00% +66.2% M 18.3 0.14 29.1 0.15 100.00% +58.4% HT 15.3 0.15 18.3 0.21 100.00% +19.3% VT 13.9 0.14 14.9 0.13 100.00% +6.9% R 12.8 0.13 13.9 0.06 100.00% +8.5% RT 7.3 0.11 5.6 0.09 100.00% -23.2% src_0565_8_x888 Before After Mean StdDev Mean StdDev Confidence Change L1 19.7 0.10 30.3 0.20 100.00% +53.9% L2 19.6 0.10 30.3 0.17 100.00% +54.4% M 19.0 0.15 28.2 0.26 100.00% +48.9% HT 16.1 0.17 20.1 0.24 100.00% +24.5% VT 14.5 0.15 16.4 0.22 100.00% +13.6% R 13.9 0.16 15.7 0.21 100.00% +13.1% RT 8.0 0.13 6.8 0.12 100.00% -15.3% src_0565_8_0565 Before After Mean StdDev Mean StdDev Confidence Change L1 17.2 0.09 27.5 0.23 100.00% +59.8% L2 17.1 0.10 27.3 0.18 100.00% +59.7% M 16.6 0.15 25.8 0.14 100.00% +55.2% HT 14.2 0.21 17.1 0.06 100.00% +20.4% VT 13.0 0.24 14.4 0.18 100.00% +10.7% R 12.4 0.24 13.4 0.16 100.00% +7.9% RT 7.3 0.21 5.5 0.04 100.00% -23.8% over_8888_8_8888 Before After Mean StdDev Mean StdDev Confidence Change L1 25.5 0.10 30.3 0.37 100.00% +18.6% L2 25.3 0.02 29.9 0.17 100.00% +18.0% M 21.5 0.00 22.3 0.21 100.00% +4.0% HT 16.6 0.01 17.2 0.17 100.00% +3.4% VT 14.4 0.01 13.5 0.14 100.00% -6.6% R 12.5 0.01 12.5 0.13 71.35% +0.3% (insignificant) RT 5.9 0.05 5.4 0.08 100.00% -9.3% add_8888_8_8888 Before After Mean StdDev Mean StdDev Confidence Change L1 21.0 0.08 36.9 0.53 100.00% +75.6% L2 20.9 0.01 36.1 0.28 100.00% +72.6% M 18.9 0.00 25.4 0.21 100.00% +34.8% HT 14.6 0.01 19.6 0.23 100.00% +34.9% VT 12.8 0.01 14.9 0.18 100.00% +16.3% R 11.7 0.01 13.8 0.16 100.00% +17.4% RT 6.0 0.04 5.7 0.09 100.00% -4.4%
2015-10-15	arm: Add bilinear scaled fetchers for repeat type REFLECT	Ben Avison	1	-7/+68

2015-10-15	arm: Add bilinear scaled fetchers for repeat type NORMAL	Ben Avison	1	-7/+141

2015-10-15	arm: Add bilinear scaled fetchers for repeat type PAD	Ben Avison	1	-3/+131

2015-10-15	arm: Add bilinear scaled fetchers for repeat type NONE	Ben Avison	1	-26/+365

2015-10-15	armv7: Improved bilinear scaled fetchers for small scale factors	Ben Avison	2	-4/+364
	Adds a more sophisticated algorithm suitable for horizontal scale factors less than 1 (i.e. enlargements) where blocks of contiguous source pixels are loaded and format-converted prior to being picked. This produces a significant speed boost where the pixel conversion overhead outweighs the branch predict penalties; in practice this appears to hold for the r5g6b5 fetcher only.
2015-10-15	armv7: Add bilinear scaled a8 fetcher	Ben Avison	2	-0/+22

2015-10-15	armv7: Add bilinear scaled r5g6b5 fetcher	Ben Avison	2	-0/+32

2015-10-15	armv7: Add bilinear scaled x8r8g8b8 fetcher	Ben Avison	2	-0/+24

2015-10-15	armv7: Add bilinear scaled a8r8g8b8 fetcher	Ben Avison	2	-0/+6

2015-10-15	armv7: Support for bilinear scaled fetchers	Ben Avison	5	-2/+494
	Introduces the infrastructure that will be used by ARMv7 bilinear scaled fetcher routines, but doesn't actually utilise it yet. For simplicity, only includes the macros where source pixels are picked before being format-converted.
2015-10-15	armv6: Add fetcher for a8 bilinear-interpolation scaled images	Ben Avison	3	-0/+36
	This is constrained to support X increments in the positive X direction only. It also doesn't attempt to support any form of image repeat. Here are some affine-bench results for a variety of horizontal and vertical scaling factors. Before: x increment 0.5 0.75 1.0 1.5 2.0 y increment 0.5 6.2 6.2 6.2 6.1 6.0 0.75 6.2 6.1 6.1 6.0 5.9 1.0 6.2 6.1 5.9 5.8 1.5 6.1 6.0 5.9 5.8 5.6 2.0 6.1 6.0 5.9 5.7 5.5 After: x increment 0.5 0.75 1.0 1.5 2.0 y increment 0.5 22.2 21.2 19.7 21.0 20.4 0.75 19.4 18.3 16.7 18.2 17.4 1.0 24.7 22.3 22.1 20.4 1.5 14.2 13.0 11.5 12.9 12.1 2.0 12.0 10.9 9.5 10.8 10.0 Improvement: x increment 0.5 0.75 1.0 1.5 2.0 y increment 0.5 +256.4% +242.8% +219.6% +246.6% +241.3% 0.75 +212.9% +197.8% +173.7% +203.4% +195.1% 1.0 +300.2% +265.9% +273.4% +251.2% 1.5 +131.8% +115.6% +93.1% +123.2% +114.0% 2.0 +97.7% +82.9% +62.8% +91.0% +82.9%
2015-10-15	armv6: Add fetcher for r5g6b5 bilinear-interpolation scaled images	Ben Avison	3	-0/+48
	This is constrained to support X increments in the positive X direction only. It also doesn't attempt to support any form of image repeat. Here are some affine-bench results for a variety of horizontal and vertical scaling factors. Before: x increment 0.5 0.75 1.0 1.5 2.0 y increment 0.5 3.0 2.9 2.9 2.9 2.8 0.75 2.9 2.9 2.9 2.8 2.8 1.0 2.9 2.9 2.8 2.8 1.5 2.9 2.9 2.8 2.8 2.7 2.0 2.9 2.8 2.8 2.8 2.7 After: x increment 0.5 0.75 1.0 1.5 2.0 y increment 0.5 20.2 18.5 19.1 17.6 16.3 0.75 17.1 15.4 16.0 14.5 13.2 1.0 20.1 17.1 15.6 13.6 1.5 11.9 10.3 10.8 9.5 8.4 2.0 9.9 8.4 8.9 7.7 6.8 Improvement: x increment 0.5 0.75 1.0 1.5 2.0 y increment 0.5 +582.2% +530.7% +554.4% +514.7% +477.1% 0.75 +481.5% +427.7% +451.2% +410.3% +371.4% 1.0 +583.9% +486.9% +453.3% +392.7% 1.5 +308.1% +258.7% +281.0% +240.5% +208.1% 2.0 +241.4% +196.9% +217.8% +179.9% +152.4%
2015-10-15	armv6: Add fetcher for x8r8g8b8 bilinear-interpolation scaled images	Ben Avison	3	-0/+29
	This is constrained to support X increments in the positive X direction only. It also doesn't attempt to support any form of image repeat. Here are some affine-bench results for a variety of horizontal and vertical scaling factors. Before: x increment 0.5 0.75 1.0 1.5 2.0 y increment 0.5 3.9 3.8 3.7 3.6 3.4 0.75 3.8 3.8 3.7 3.5 3.3 1.0 3.8 3.7 3.5 3.3 1.5 3.7 3.6 3.5 3.3 3.1 2.0 3.6 3.5 3.4 3.2 3.0 After: x increment 0.5 0.75 1.0 1.5 2.0 y increment 0.5 20.8 19.3 18.8 19.6 18.4 0.75 17.8 16.2 15.7 16.6 15.2 1.0 21.3 18.4 18.9 16.9 1.5 12.5 11.1 10.5 11.4 10.2 2.0 10.5 9.1 8.7 9.4 8.3 Improvement: x increment 0.5 0.75 1.0 1.5 2.0 y increment 0.5 +436.3% +406.2% +402.8% +448.2% +437.9% 0.75 +364.2% +330.4% +324.7% +372.3% +354.5% 1.0 +461.5% +392.4% +446.9% +415.9% 1.5 +236.3% +204.8% +198.8% +242.2% +224.1% 2.0 +187.0% +156.5% +154.3% +193.3% +177.1%
2015-10-15	armv6: Add fetcher for a8r8g8b8 bilinear-interpolation scaled images	Ben Avison	4	-0/+1008
	This is constrained to support X increments in the positive X direction only. It also doesn't attempt to support any form of image repeat. Here are some affine-bench results for a variety of horizontal and vertical scaling factors. Before: x increment 0.5 0.75 1.0 1.5 2.0 y increment 0.5 7.1 6.9 6.8 6.6 6.3 0.75 6.4 6.2 6.1 5.8 5.5 1.0 5.9 5.7 5.2 4.9 1.5 5.0 4.8 4.6 4.3 4.0 2.0 4.4 4.2 4.0 3.7 3.4 After: x increment 0.5 0.75 1.0 1.5 2.0 y increment 0.5 21.0 19.6 19.2 20.2 18.9 0.75 18.0 16.6 16.1 17.1 15.9 1.0 21.8 18.9 19.9 17.7 1.5 12.8 11.3 10.9 11.8 10.7 2.0 10.7 9.3 8.9 9.8 8.8 Improvement: x increment 0.5 0.75 1.0 1.5 2.0 y increment 0.5 +196.7% +183.6% +181.8% +206.6% +198.4% 0.75 +182.2% +166.2% +164.0% +194.8% +185.8% 1.0 +271.7% +234.4% +282.7% +257.9% 1.5 +154.6% +135.3% +134.3% +173.3% +164.8% 2.0 +144.1% +124.2% +123.3% +165.6% +155.5%
2015-10-15	armv7: Add nearest-neighbour scaled a8 fetcher	Ben Avison	3	-7/+111

2015-10-15	armv7: Add nearest-neighbour scaled r5g6b5 fetcher	Ben Avison	1	-0/+2

2015-10-15	armv7: Add nearest-neighbour scaled x8r8g8b8 fetcher	Ben Avison	2	-0/+17

2015-10-15	armv7: Add nearest-neighbour scaled a8r8g8b8 fetcher	Ben Avison	2	-0/+35

2015-10-15	armv6: Add four more nearest-scaled-cover fast paths	Ben Avison	1	-0/+17
	These complete the set of fast paths where currently pixman-fast-path.c provides versions that get selected in preference to the armv6-optimised scanline fetchers/combiners/writeback routines. Because generation of these fast paths is macroised, the patch required to add them is fairly simple. lowlevel-blt-bench -n over_8888_8888: Before After Mean StdDev Mean StdDev Confidence Change L1 13.8 0.0 26.5 0.2 100.0% +91.7% L2 9.4 0.2 22.9 0.4 100.0% +142.6% M 8.6 0.0 23.8 0.0 100.0% +176.1% HT 7.4 0.0 14.1 0.1 100.0% +91.2% VT 7.3 0.0 13.4 0.1 100.0% +84.1% R 7.0 0.0 13.0 0.1 100.0% +85.9% RT 4.5 0.1 6.2 0.1 100.0% +36.6% affine-bench * 0 0 1 over a8r8g8b8 a8r8g8b8: Before After Mean StdDev Mean StdDev Confidence Change 0.5 9.4 0.0 28.0 0.0 100.0% +197.4% 0.75 9.0 0.0 26.1 0.0 100.0% +190.2% 1.0 8.6 0.0 24.4 0.0 100.0% +184.6% 1.5 7.9 0.0 21.7 0.0 100.0% +173.4% 2.0 7.3 0.0 19.6 0.0 100.0% +166.6% lowlevel-blt-bench -n src_x888_8888: Before After Mean StdDev Mean StdDev Confidence Change L1 108.6 2.0 66.3 0.9 100.0% -39.0% L2 32.4 1.5 44.3 2.1 100.0% +36.8% M 27.5 0.1 62.0 0.1 100.0% +125.6% HT 20.3 0.1 28.7 0.2 100.0% +41.2% VT 19.9 0.1 26.7 0.1 100.0% +34.4% R 18.6 0.1 25.3 0.2 100.0% +36.3% RT 8.7 0.1 9.8 0.2 100.0% +12.6% affine-bench * 0 0 1 src x8r8g8b8 a8r8g8b8: Before After Mean StdDev Mean StdDev Confidence Change 0.5 45.2 0.0 97.2 0.1 100.0% +115.1% 0.75 35.9 0.1 76.7 0.1 100.0% +113.9% 1.0 29.6 0.1 61.1 0.1 100.0% +106.4% 1.5 21.4 0.0 52.7 0.1 100.0% +145.9% 2.0 16.7 0.0 43.0 0.1 100.0% +156.9% lowlevel-blt-bench -n src_8888_0565: Before After Mean StdDev Mean StdDev Confidence Change L1 57.2 0.7 43.1 0.4 100.0% -24.7% L2 23.0 1.0 32.8 1.0 100.0% +42.5% M 24.8 0.0 42.2 0.0 100.0% +70.0% HT 18.0 0.1 22.1 0.1 100.0% +22.5% VT 17.1 0.1 21.0 0.1 100.0% +22.5% R 16.5 0.1 20.0 0.1 100.0% +21.4% RT 8.3 0.2 8.4 0.1 95.0% +1.0% (insignificant) affine-bench * 0 0 1 src a8r8g8b8 r5g6b5: Before After Mean StdDev Mean StdDev Confidence Change 0.5 34.9 0.0 55.3 0.0 100.0% +58.7% 0.75 29.3 0.0 49.1 0.0 100.0% +67.4% 1.0 24.8 0.0 42.6 0.1 100.0% +71.6% 1.5 19.0 0.0 38.2 0.1 100.0% +100.7% 2.0 15.4 0.0 31.8 0.0 100.0% +107.1% lowlevel-blt-bench -n over_8888_0565: Before After Mean StdDev Mean StdDev Confidence Change L1 9.8 0.0 15.3 0.1 100.0% +56.6% L2 7.4 0.0 14.3 0.2 100.0% +91.7% M 7.5 0.0 15.4 0.0 100.0% +106.0% HT 6.5 0.0 10.1 0.0 100.0% +54.5% VT 6.4 0.0 9.9 0.0 100.0% +54.6% R 6.2 0.0 9.5 0.0 100.0% +52.1% RT 4.2 0.0 4.6 0.1 100.0% +9.8% affine-bench * 0 0 1 over a8r8g8b8 r5g6b5: Before After Mean StdDev Mean StdDev Confidence Change 0.5 8.0 0.0 17.3 0.0 100.0% +116.1% 0.75 7.8 0.0 16.5 0.0 100.0% +112.9% 1.0 7.5 0.0 15.7 0.0 100.0% +110.5% 1.5 7.0 0.0 14.8 0.0 100.0% +112.8% 2.0 6.5 0.0 13.7 0.0 100.0% +111.4%
2015-10-15	armv6: Add nearest-scaled-cover src_0565_0565 fast path	Ben Avison	3	-9/+75
	This is adapted from the nearest scaled cover scanline fetcher, modified to pack output data in 16-bit units. lowlevel-blt-bench -n src_0565_0565: Before After Mean StdDev Mean StdDev Confidence Change L1 119.6 4.1 72.5 1.1 100.0% -39.4% L2 45.2 1.4 55.4 2.0 100.0% +22.5% M 47.1 0.1 71.3 0.1 100.0% +51.4% HT 26.4 0.2 31.8 0.3 100.0% +20.3% VT 25.0 0.2 30.0 0.3 100.0% +20.3% R 22.6 0.2 27.6 0.2 100.0% +22.0% RT 9.7 0.2 10.3 0.2 100.0% +5.6% affine-bench * 0 0 1 src r5g6b5 r5g6b5: Before After Mean StdDev Mean StdDev Confidence Change 0.5 59.6 0.1 129.6 0.1 100.0% +117.2% 0.75 52.0 0.1 106.3 0.1 100.0% +104.6% 1.0 47.2 0.1 71.7 0.0 100.0% +52.0% 1.5 39.1 0.1 68.1 0.1 100.0% +74.2% 2.0 37.7 0.1 68.7 0.1 100.0% +82.2%
2015-10-15	armv6: Add nearest-scaled-cover src_8888_8888 fast path	Ben Avison	2	-0/+104
	Without this patch, any such operations are matched against the fast path implementation in pixman-fast-path.c before general_composite_rect(), so we never get to use the armv6-optimised assembly fetcher routines. This patch adds a C wrapper to the same assembly routine used for the nearest-scaled-cover fetcher, adapted to perform a 2D plot rather than a single scanlne. The C is macroised so that later patches can use the same approach to build more complex fast paths from combinations of armv6 fetcher/combiner/writeback routines in a similar manner to pixcman_composite_rect(). lowlevel-blt-bench -n src_8888_8888: Before After Mean StdDev Mean StdDev Confidence Change L1 117.2 1.6 79.2 1.1 100.0% -32.4% L2 44.1 3.1 49.9 2.4 100.0% +13.2% M 40.0 0.1 72.5 0.1 100.0% +81.4% HT 20.1 0.1 29.5 0.3 100.0% +46.5% VT 19.4 0.1 27.7 0.2 100.0% +42.7% R 18.2 0.1 26.2 0.2 100.0% +44.1% RT 8.7 0.2 10.0 0.2 100.0% +15.8% affine-bench * 0 0 1 src a8r8g8b8 a8r8g8b8: Before After Mean StdDev Mean StdDev Confidence Change 0.5 46.6 0.1 110.5 0.1 100.0% +137.2% 0.75 39.1 0.1 88.5 0.1 100.0% +126.1% 1.0 36.3 0.2 71.7 0.1 100.0% +97.7% 1.5 26.7 0.1 55.3 0.1 100.0% +106.8% 2.0 19.9 0.0 43.5 0.0 100.0% +119.2%
2015-10-15	armv6: Add fetcher for a8 nearest-neighbour transformed images	Ben Avison	2	-0/+10
	This is related to the a8r8g8b8 nearest-scaled-cover fetcher. Below are benchmarks for src_8_8888, which uses it. lowlevel-blt-bench -n : Before After Mean StdDev Mean StdDev Confidence Change L1 15.1 0.1 55.5 0.3 100.0% +267.2% L2 13.7 0.1 45.3 0.8 100.0% +230.0% M 14.5 0.0 53.9 0.1 100.0% +272.5% HT 8.3 0.0 21.2 0.2 100.0% +154.6% VT 8.3 0.0 20.1 0.3 100.0% +141.7% R 8.0 0.0 19.2 0.3 100.0% +140.5% RT 3.6 0.0 6.8 0.1 100.0% +88.4% affine-bench: Before After Mean StdDev Mean StdDev Confidence Change 0.5 17.2 0.0 76.5 0.1 100.0% +344.4% 0.75 16.7 0.0 67.1 0.1 100.0% +300.8% 1.0 16.4 0.0 54.3 0.1 100.0% +232.2% 1.5 15.7 0.0 52.4 0.1 100.0% +234.6% 2.0 14.8 0.0 50.8 0.1 100.0% +243.9%
2015-10-15	armv6: Add fetcher for x8r8g8b8 nearest-neighbour transformed images	Ben Avison	2	-0/+10
	This is related to the a8r8g8b8 nearest-scaled-cover fetcher. Below are benchmarks for add_x888_8888, which uses it. lowlevel-blt-bench -n : Before After Mean StdDev Mean StdDev Confidence Change L1 12.0 0.0 45.0 0.5 100.0% +275.3% L2 9.2 0.1 30.4 1.2 100.0% +231.6% M 8.6 0.0 27.8 0.1 100.0% +224.0% HT 6.0 0.0 15.4 0.1 100.0% +158.5% VT 5.9 0.0 14.5 0.1 100.0% +146.2% R 5.7 0.0 14.1 0.1 100.0% +145.8% RT 2.9 0.0 5.6 0.1 100.0% +91.4% affine-bench: Before After Mean StdDev Mean StdDev Confidence Change 0.5 12.1 0.0 32.5 0.1 100.0% +169.6% 0.75 11.3 0.0 30.0 0.0 100.0% +165.1% 1.0 10.7 0.0 27.1 0.0 100.0% +153.7% 1.5 9.6 0.0 24.1 0.0 100.0% +151.6% 2.0 8.8 0.0 21.5 0.0 100.0% +145.1%
2015-10-15	armv6: Add fetcher for r5g6b5 nearest-neighbour transformed images	Ben Avison	2	-0/+22
	This is related to the a8r8g8b8 nearest-scaled-cover fetcher. Below are benchmarks for src_0565_8888, which uses it. lowlevel-blt-bench -n : Before After Mean StdDev Mean StdDev Confidence Change L1 9.0 0.0 34.4 0.3 100.0% +284.7% L2 8.1 0.1 29.0 0.6 100.0% +258.7% M 8.4 0.0 33.2 0.1 100.0% +297.6% HT 5.8 0.0 16.5 0.3 100.0% +183.6% VT 5.8 0.0 16.0 0.3 100.0% +175.6% R 5.6 0.0 15.6 0.1 100.0% +175.5% RT 3.0 0.0 6.0 0.2 100.0% +98.7% affine-bench: Before After Mean StdDev Mean StdDev Confidence Change 0.5 11.2 0.0 52.0 0.1 100.0% +363.2% 0.75 10.9 0.0 41.3 0.1 100.0% +279.3% 1.0 10.6 0.0 33.4 0.1 100.0% +216.7% 1.5 10.0 0.0 32.3 0.1 100.0% +221.8% 2.0 9.4 0.0 31.7 0.0 100.0% +236.0%
2015-10-15	armv6: Add fetcher for a8r8g8b8 nearest-neighbour transformed images	Ben Avison	4	-0/+479
	This is constrained to support X increments in the positive X direction only, so this means scaled images (except those reflected in the Y axis) plus parallelogram transformations which preserve the direction of the X axis. It also doesn't attempt to support any form of image repeat. With this optimisation, some operations constructed from fetcher and combiner calls using general_composite_rect() now outperform the versions consructed from FAST_NEAREST macros in pixman-fast-path.c, but unfortunately the FAST_NEAREST ones have higher priority in fast path lookup. Here are some benchmarks for the in_reverse_8888_8888 operation, which is not affected: lowlevel-blt-bench -n : Before After Mean StdDev Mean StdDev Confidence Change L1 10.2 0.0 27.1 0.2 100.0% +164.8% L2 8.2 0.1 23.0 0.4 100.0% +179.2% M 8.3 0.0 24.8 0.0 100.0% +200.3% HT 5.5 0.0 12.7 0.0 100.0% +129.9% VT 5.4 0.0 12.1 0.0 100.0% +123.2% R 5.4 0.0 11.9 0.1 100.0% +122.7% RT 2.8 0.0 5.4 0.1 100.0% +91.9% affine-bench for 5 different scaling factors: Before After Mean StdDev Mean StdDev Confidence Change 0.5 11.1 0.0 28.3 0.0 100.0% +155.1% 0.75 10.5 0.0 26.4 0.0 100.0% +152.2% 1.0 9.9 0.0 24.6 0.0 100.0% +147.5% 1.5 9.0 0.0 21.8 0.0 100.0% +141.4% 2.0 8.3 0.0 19.7 0.0 100.0% +138.4%
2015-10-15	armv7: Re-use existing fast paths in more cases	Ben Avison	1	-0/+9
	There are a group of combiner types - SRC, OVER_REVERSE, IN, OUT and ADD - where the source alpha affects only the destination alpha component. This means that any fast path with a8r8g8b8 source and destination can also be applied to an equivalent operation with x8r8g8b8 source and destination just by updating the fast path table, and likewise with a8b8g8r8 and x8b8g8r8. The following operations are affected: add_x888_8_x888 (and bilinear scaled version of same) add_x888_8888_x888 add_x888_n_x888 add_x888_x888 (and bilinear scaled version of same)
2015-10-15	armv7: Re-use existing fast paths in more cases	Ben Avison	1	-0/+12
	There are a group of combiner types - SRC, OVER, IN_REVERSE, OUT_REVERSE and ADD - where the destination alpha component is only used (if at all) to determine the destination alpha component. This means that any such fast paths with an a8r8g8b8 destination can also be applied to an x8r8g8b8 destination just by updating the fast path table, and likewise with a8b8g8r8 and x8b8g8r8. The following operations are affected: over_8888_8888_x888 add_n_8_x888 add_8888_8_x888 add_8888_8888_x888 add_8888_n_x888 add_8888_x888 out_reverse_8_x888
2015-10-15	armv7: Add in_n_8888 fast path	Ben Avison	2	-0/+68
	This is tuned for Cortex-A7 (Raspberry Pi 2). lowlevel-blt-bench results, compared to the ARMv6 fast path: Before After Mean StdDev Mean StdDev Confidence Change L1 104.6 0.5 119.4 0.1 100.0% +14.1% L2 106.8 0.6 121.4 0.1 100.0% +13.6% M 100.3 1.3 116.4 0.0 100.0% +16.0% HT 64.5 1.0 70.8 0.1 100.0% +9.8% VT 56.0 0.8 62.2 0.1 100.0% +11.1% R 54.1 0.9 55.2 0.0 100.0% +1.9% RT 24.6 0.5 26.6 0.0 100.0% +8.3%
2015-10-15	armv6: Add in_n_8888 fast path	Ben Avison	2	-0/+81
	lowlevel-blt-bench results: Before After Mean StdDev Mean StdDev Confidence Change L1 18.8 0.1 63.9 0.9 100.0% +239.0% L2 16.0 0.4 58.5 1.3 100.0% +265.8% M 13.1 0.0 56.8 0.1 100.0% +332.6% HT 11.6 0.0 31.3 0.3 100.0% +169.6% VT 11.4 0.0 27.2 0.2 100.0% +139.2% R 11.0 0.1 28.2 0.2 100.0% +156.1% RT 6.8 0.1 12.9 0.2 100.0% +89.0%
2015-10-15	armv6: Add over_8888_n_0565 fast path	Ben Avison	2	-20/+53
	lowlevel-blt-bench results: Before After Mean StdDev Mean StdDev Confidence Change L1 5.7 0.0 20.6 0.1 100.0% +263.8% L2 4.9 0.0 17.4 0.3 100.0% +254.0% M 4.8 0.0 19.9 0.0 100.0% +312.9% HT 4.5 0.0 12.4 0.1 100.0% +175.4% VT 4.5 0.0 12.0 0.0 100.0% +168.9% R 4.3 0.0 11.4 0.1 100.0% +163.3% RT 2.9 0.0 6.0 0.1 100.0% +106.9%
2015-10-15	armv6: Add over_8888_8_0565 fast path	Ben Avison	2	-0/+186
	lowlevel-blt-bench results: Before After Mean StdDev Mean StdDev Confidence Change L1 5.2 0.0 20.0 0.2 100.0% +281.7% L2 4.5 0.0 16.2 0.2 100.0% +256.9% M 4.5 0.0 18.8 0.0 100.0% +321.1% HT 3.9 0.0 10.9 0.0 100.0% +177.6% VT 3.9 0.0 10.6 0.0 100.0% +171.5% R 3.8 0.0 10.0 0.0 100.0% +165.1% RT 2.3 0.0 4.9 0.1 100.0% +107.7%
2015-10-15	armv7: Add in_8888_8 fast path	Ben Avison	2	-0/+40
	This is tuned for the Cortex-A7 (Raspberry Pi 2). lowlevel-blt-bench results, compared to the ARMv6 fast path: Before After Mean StdDev Mean StdDev Confidence Change L1 146.0 0.7 231.4 1.2 100.0% +58.5% L2 143.1 0.9 222.1 1.7 100.0% +55.3% M 110.9 0.0 129.0 0.5 100.0% +16.3% HT 57.3 0.6 73.0 0.3 100.0% +27.4% VT 46.6 0.5 61.6 0.4 100.0% +32.3% R 42.3 0.2 51.7 0.2 100.0% +22.2% RT 19.1 0.1 21.0 0.1 100.0% +9.9%
2015-10-15	armv6: Add in_8888_8 fast path	Ben Avison	2	-0/+116
	This is used instead of the equivalent C fast path. lowlevel-blt-bench results, compared to no fast path at all: Before After Mean StdDev Mean StdDev Confidence Change L1 12.4 0.1 117.5 2.3 100.0% +851.2% L2 9.5 0.1 46.9 2.4 100.0% +393.8% M 9.6 0.0 61.9 0.9 100.0% +544.0% HT 7.9 0.0 26.6 0.5 100.0% +238.6% VT 7.7 0.0 24.2 0.4 100.0% +212.5% R 7.4 0.0 22.4 0.4 100.0% +204.5% RT 4.1 0.0 8.7 0.2 100.0% +109.4%
2015-10-15	pixman-fast-path: Add in_8888_8 fast path	Ben Avison	1	-0/+40
	This is a C fast path, useful for reference or for platforms that don't have their own fast path for this operation. lowlevel-blt-bench results on ARMv6: Before After Mean StdDev Mean StdDev Confidence Change L1 12.4 0.1 24.4 0.3 100.0% +97.8% L2 9.5 0.1 14.1 0.2 100.0% +48.1% M 9.6 0.0 14.7 0.0 100.0% +53.1% HT 7.9 0.0 12.0 0.1 100.0% +52.3% VT 7.7 0.0 11.6 0.1 100.0% +49.8% R 7.4 0.0 10.8 0.1 100.0% +47.2% RT 4.1 0.0 6.1 0.1 100.0% +48.2%
2015-10-15	armv6: Add over_n_0565 fast path	Ben Avison	2	-0/+118
	This is used instead of the equivalent C fast path. lowlevel-blt-bench results, compared to no fast path at all: Before After Mean StdDev Mean StdDev Confidence Change L1 8.2 0.0 38.7 0.5 100.0% +372.7% L2 7.9 0.1 37.6 0.5 100.0% +376.8% M 7.3 0.0 38.5 0.1 100.0% +425.6% HT 6.9 0.0 26.1 0.3 100.0% +279.9% VT 6.8 0.0 24.5 0.3 100.0% +258.0% R 6.6 0.1 23.6 0.2 100.0% +255.1% RT 4.5 0.1 10.9 0.2 100.0% +143.1%
2015-10-15	pixman-fast-path: Add over_n_0565 fast path	Ben Avison	1	-0/+35
	This is a C fast path, useful for reference or for platforms that don't have their own fast path for this operation. lowlevel-blt-bench results on ARMv6: Before After Mean StdDev Mean StdDev Confidence Change L1 8.2 0.0 11.3 0.1 100.0% +38.6% L2 7.9 0.1 10.5 0.0 100.0% +33.3% M 7.3 0.0 10.0 0.0 100.0% +36.7% HT 6.9 0.0 9.2 0.0 100.0% +33.3% VT 6.8 0.0 9.0 0.0 100.0% +32.1% R 6.6 0.1 8.8 0.0 100.0% +31.8% RT 4.5 0.1 6.3 0.1 100.0% +39.7%
2015-10-15	pixman-fast-path: Add over_n_8888 fast path	Ben Avison	1	-0/+35
	This is a C fast path, useful for reference or for platforms that don't have their own fast path for this operation. lowlevel-blt-bench results on ARMv6: Before After Mean StdDev Mean StdDev Confidence Change L1 11.9 0.1 20.4 0.2 100.0% +71.1% L2 10.6 0.2 16.5 0.4 100.0% +55.8% M 9.4 0.0 13.5 0.0 100.0% +44.3% HT 8.4 0.0 12.2 0.1 100.0% +43.9% VT 8.3 0.0 11.9 0.1 100.0% +42.7% R 8.1 0.0 11.5 0.1 100.0% +41.3% RT 5.4 0.1 7.6 0.1 100.0% +40.3%
2015-10-15	armv7: Add optimised scanline writeback for r5g6b5	Ben Avison	2	-0/+20
	lowlevel-blt-bench results for an example operation, src_1555_0565: Before After Mean StdDev Mean StdDev Confidence Change L1 85.8 2.12 114.0 1.65 100.00% +32.9% L2 83.7 0.96 106.0 1.01 100.00% +26.7% M 76.4 0.66 94.8 0.98 100.00% +24.0% HT 39.8 0.37 38.9 0.29 100.00% -2.3% VT 37.0 0.36 34.1 0.24 100.00% -7.7% R 33.9 0.37 30.3 0.24 100.00% -10.5% RT 14.7 0.20 11.5 0.11 100.00% -21.7%
2015-10-15	armv7: Add optimised untransformed scanline fetchers r5g6b5 & a1r5g5b5	Ben Avison	2	-0/+33
	lowlevel-blt-bench results on Cortex-A7 for a couple of sample operations that utilise these fetchers are below. add_0565_8888: Before After Mean StdDev Mean StdDev Confidence Change L1 75.4 0.38 147.5 0.90 100.00% +95.7% L2 72.3 0.36 129.3 0.57 100.00% +79.0% M 64.4 0.05 94.6 0.90 100.00% +46.8% HT 35.8 0.03 42.3 0.26 100.00% +18.1% VT 29.9 0.04 34.3 0.31 100.00% +14.5% R 26.1 0.02 28.6 0.11 100.00% +9.4% RT 12.2 0.06 13.1 0.15 100.00% +7.9% add_1555_8888: Before After Mean StdDev Mean StdDev Confidence Change L1 73.3 0.38 160.7 0.89 100.00% +119.2% L2 69.8 0.08 139.1 0.74 100.00% +99.4% M 62.2 0.03 100.4 0.76 100.00% +61.4% HT 35.1 0.03 42.9 0.42 100.00% +22.1% VT 29.5 0.03 34.7 0.33 100.00% +17.8% R 25.8 0.02 28.7 0.27 100.00% +11.4% RT 12.1 0.02 13.2 0.15 100.00% +8.5% --- For the record, I tried writing an a8 fetcher, but benchmarking indicated that it couldn't improve upon the ARMv6 a8 fetcher results. I also tried adding prefetch to the above fetchers - since they are the first iterator in a chain and won't benefit from write-allocate caches, you might think that this would help. Benchmarking indicated otherwise.
2015-10-15	armv7: Add src_1555_8888 fast path	Ben Avison	2	-0/+59
	This is tuned for Cortex-A7 (Raspberry Pi 2). lowlevel-blt-bench results, compared to the ARMv6 fast path: Before After Mean StdDev Mean StdDev Confidence Change L1 88.6 0.2 221.3 0.5 100.0% +149.7% L2 88.1 0.4 219.2 0.8 100.0% +148.9% M 87.9 0.1 178.2 0.1 100.0% +102.6% HT 59.7 0.4 72.0 0.2 100.0% +20.7% VT 53.2 0.4 69.8 0.2 100.0% +31.3% R 48.5 0.3 53.6 0.1 100.0% +10.6% RT 21.2 0.1 23.0 0.1 100.0% +8.5%
2015-10-15	armv6: Add src_1555_8888 fast path	Ben Avison	2	-0/+19
	lowlevel-blt-bench results, compared to using the armv6 1555 fetcher: Before After Mean StdDev Mean StdDev Confidence Change L1 57.0 1.1 70.1 0.6 100.0% +23.1% L2 41.4 1.0 44.1 1.4 100.0% +6.3% M 49.8 0.1 59.0 0.2 100.0% +18.5% HT 21.4 0.3 32.3 0.3 100.0% +50.9% VT 21.0 0.3 30.2 0.3 100.0% +43.8% R 19.7 0.2 27.0 0.2 100.0% +37.4% RT 7.0 0.2 10.9 0.3 100.0% +56.6%
2015-10-15	armv6: Add optimised scanline fetcher for x8r8g8b8	Ben Avison	2	-0/+13
	This supports x8r8g8b8 source images. lowlevel-blt-bench results for src_x888_8888 with PIXMAN_DISABLE=wholeops on a Raspberry Pi 1: Before After Mean StdDev Mean StdDev Confidence Change L1 55.5 0.98 147.5 5.82 100.00% +165.8% L2 25.2 0.84 46.7 2.83 100.00% +85.5% M 27.8 0.15 57.5 0.06 100.00% +106.7% HT 14.5 0.10 24.2 0.19 100.00% +66.8% VT 14.2 0.11 23.2 0.20 100.00% +63.0% R 13.5 0.07 22.0 0.24 100.00% +63.3% RT 5.5 0.05 7.8 0.24 100.00% +41.8% lowlevel-blt-bench results for src_x888_8888 with PIXMAN_DISABLE=wholeops on a Raspberry Pi 2 (ARMv7): Before After Mean StdDev Mean StdDev Confidence Change L1 135.8 2.43 236.4 6.68 100.00% +74.0% L2 122.8 1.09 201.4 2.01 100.00% +64.1% M 94.1 1.15 145.2 0.59 100.00% +54.3% HT 41.1 0.53 52.4 0.38 100.00% +27.5% VT 36.5 0.53 51.7 0.38 100.00% +41.7% R 30.3 0.42 40.9 0.29 100.00% +34.7% RT 13.7 0.24 17.5 0.25 100.00% +28.2% The before case was using the fetcher iterator defined in pixman-access.c. Note that it does not appear to be worthwhile to create an additional ARMv7 version of this fetcher. If we construct one using the src_x888_8888 macros the results are as follows on a Raspberry Pi 2: Before After Mean StdDev Mean StdDev Confidence Change L1 236.4 6.68 259.0 3.58 100.00% +9.6% L2 201.4 2.01 209.8 2.17 100.00% +4.2% M 145.2 0.59 139.4 1.06 100.00% -4.0% HT 52.4 0.38 51.4 0.56 100.00% -1.9% VT 51.7 0.38 47.8 0.86 100.00% -7.6% R 40.9 0.29 35.3 0.40 100.00% -13.5% RT 17.5 0.25 16.5 0.26 100.00% -6.2%
2015-10-15	armv6: Add optimised scanline fetcher for a1r5g5b5	Ben Avison	2	-0/+73
	This supports a1r5g5b5 source images. lowlevel-blt-bench results for src_1555_8888, which does not yet have a dedicated fast path: Before After Mean StdDev Mean StdDev Confidence Change L1 24.5 0.2 57.0 1.1 100.0% +132.2% L2 19.3 0.4 41.4 1.0 100.0% +114.3% M 20.4 0.0 49.8 0.1 100.0% +144.7% HT 12.8 0.1 21.4 0.3 100.0% +67.0% VT 12.7 0.1 21.0 0.3 100.0% +65.4% R 12.1 0.1 19.7 0.2 100.0% +63.1% RT 5.6 0.1 7.0 0.2 100.0% +24.8%
2015-10-15	armv6: Add optimised scanline fetchers and writeback for r5g6b5 and a8	Ben Avison	3	-0/+178
	This supports r5g6b5 source and desitination images, and a8 source images. lowlevel-blt-bench results for example operations which use these because they lack a dedicated fast path at the time of writing: in_reverse_8_8888 Before After Mean StdDev Mean StdDev Confidence Change L1 30.0 0.3 37.0 0.3 100.0% +23.2% L2 23.3 0.3 29.4 0.4 100.0% +26.1% M 24.0 0.0 31.3 0.1 100.0% +30.5% HT 12.8 0.1 16.1 0.1 100.0% +25.8% VT 11.9 0.1 14.8 0.1 100.0% +24.6% R 11.7 0.1 14.6 0.1 100.0% +24.5% RT 5.1 0.1 6.2 0.1 100.0% +20.2% in_0565_8888 Before After Mean StdDev Mean StdDev Confidence Change L1 22.0 0.1 28.3 0.2 100.0% +28.4% L2 16.6 0.2 23.6 0.3 100.0% +42.2% M 16.5 0.0 24.7 0.1 100.0% +49.5% HT 11.0 0.1 13.7 0.1 100.0% +24.4% VT 10.7 0.0 13.1 0.1 100.0% +22.0% R 10.3 0.0 12.6 0.1 100.0% +22.5% RT 5.3 0.1 5.7 0.1 100.0% +9.0% in_reverse_8888_0565 Before After Mean StdDev Mean StdDev Confidence Change L1 16.6 0.1 20.9 0.1 100.0% +25.5% L2 13.1 0.1 17.7 0.3 100.0% +35.3% M 13.2 0.0 19.2 0.0 100.0% +45.3% HT 9.6 0.0 11.7 0.1 100.0% +21.8% VT 9.3 0.0 11.4 0.1 100.0% +22.4% R 9.0 0.0 10.9 0.1 100.0% +21.1% RT 4.7 0.1 5.2 0.1 100.0% +8.7%
2015-10-15	armv7: Add OVER_REVERSE combiner	Ben Avison	2	-0/+147
	In common with the ARMv6 version of this combiner, this code features a shortcut for the case where the destination is opaque. Without that, the NEON version performs significantly worse than the ARMv6 version (though it muct be noted that the effect of repeated application of the OVER_REVERSE operator is to set the destination opaque, so lowlevel-blt-bench is perhaps not best representing real-world usage in this case). lowlevel-blt-bench results for over_reverse_0565_8888 (compared to ARMv6 version): Before After Mean StdDev Mean StdDev Confidence Change L1 73.4 0.21 77.9 0.40 100.00% +6.2% L2 72.8 0.18 76.0 0.40 100.00% +4.4% M 66.3 0.02 70.1 0.67 100.00% +5.8% HT 34.0 0.19 31.0 0.38 100.00% -9.0% VT 30.2 0.16 27.4 0.35 100.00% -9.1% R 28.5 0.16 23.4 0.32 100.00% -17.9% RT 12.4 0.10 10.5 0.17 100.00% -15.2% lowlevel-blt-bench results for over_reverse_0565_8_8888 (compared to ARMv6 version): Before After Mean StdDev Mean StdDev Confidence Change L1 60.0 0.20 65.4 0.29 100.00% +9.0% L2 59.1 0.18 63.4 0.38 100.00% +7.2% M 50.3 0.24 55.8 0.09 100.00% +10.9% HT 24.1 0.15 22.4 0.12 100.00% -7.1% VT 20.8 0.12 19.6 0.13 100.00% -5.6% R 19.6 0.13 17.2 0.01 100.00% -12.4% RT 8.2 0.06 7.5 0.05 100.00% -8.2% It's notable that the compatative performance depends heavily upon the rectangle size - not surprising since one of the main features of NEON is the ability to work on larger blocks of data at once, which is mainly a benefit to large data sets, and the larger granularity works against it for smaller data sets. Comments welcome on whether it would be desirable to select between ARMv6 and ARMv7 implementations at runtime based upon the rectangle size.