Age | Commit message (Collapse) | Author | Files | Lines |
|
This makes the demo match normal behavior of pixman/cairo at startup.
Signed-off-by: Bill Spitzak <spitzak@gmail.com>
Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com>
|
|
Detects and uses PIXMAN_FILTER_NEAREST for all 8 90-degree rotations and
reflections when the scale is 1.0 and integer translation.
GOOD uses:
scale < 1/16 : BOX.BOX at size 16
scale < 3/4 : BOX.BOX at size 1/scale
larger : BOX.BOX at size 1
If both directions have a scale >= 3/4 or a scale of 1/2 and an integer
translation, the faster PIXMAN_FILTER_BILINEAR code is used. This is
compatable at these scales with older versions of pixman where bilinear
was always used for GOOD.
BEST uses:
scale < 1/24 : BOX.BOX at size 24
scale < 1/16 : BOX.BOX at size 1/scale
scale < 1 : IMPULSE.LANCZOS2 at size 1/scale
scale < 2.333 : IMPULSE.LANCZOS2 at size 1
scale < 128 : BOX.LANCZOS2 at size 1/(scale-1) (antialiased square pixels)
larger : BOX.LANCZOS2 at size 1/127 (antialias blur gets thicker)
v8: Cutoff in BEST between IMPULSE.LANCZOS2 and BOX.LANCZOS2 adjusted for
a better match between the filters.
v9: Use the new negative subsample controls to scale the subsamples. These
were chosen by finding the lowest number that did not add visible
artifacts to the zone plate image.
Scale demo altered to default to GOOD and locked-together x+y scale
Fixed divide-by-zero from all-zero matrix found by stress-test
v11: Whitespace and formatting fixes
Moved demo changes to a later patch
v12: Whitespace and formatting fixes
Signed-off-by: Bill Spitzak <spitzak@gmail.com>
|
|
v9: Described arguments and more filter combinations, fixed some errors.
v11: Further correction, in particular replaced "scale" with "size"
Signed-off-by: Bill Spitzak <spitzak@gmail.com>
Acked-by: Oded Gabbay <oded.gabbay@redhat.com>
|
|
The intention here is to produce approximately the same number of samples for
each filter size (ie width*samples is the same). This means the caller can
pass a constant rather than a different value for each size. To avoid conflict
with existing code, negative numbers are used to indicate that -n samples are
needed at size==1.
For larger size the width of a BOX.BOX filter is used (the number of samples
is scaled by 2/(size+1)). This may be more than are needed for other filters
which increase in width faster.
For smaller filters it seems 1/size would be needed to keep the same number
of samples on the high-frequency portions. But it appears to be acceptable to
reduce them, I used 2/((size+1)*size) which makes about 1/2 as many samples
as size approaches zero.
These functions were arrived at experimentally by testing for visible
artifacts in the scaling of the zone plate image.
The computed value is then rounded up to the next power of 2 to get the
subsample_bits.
The scale demo is modified to allow these negative numbers, and initially
uses -12.
v11: Put subsample calculation in it's own function
Minor changes to comments
v12: More info in the commit message
Signed-off-by: Bill Spitzak <spitzak@gmail.com>
|
|
The SIGMA term drops out on simplification.
Expanded the size slightly (from ~4.25 to 5) to make the cutoff less noticable.
Previouly the value at the cutoff was gaussian_filter(sqrt(2)*3/2) = 0.00626
which is larger than the difference between 8-bit pixels (1/255 = 0.003921).
New cutoff is gaussian_filter(2.5) = 0.001089 which is smaller.
v11: added some math to commit message
Signed-off-by: Bill Spitzak <spitzak@gmail.com>
Acked-by: Oded Gabbay <oded.gabbay@gmail.com>
|
|
v11: Restored range checks
Signed-off-by: Bill Spitzak <spitzak@gmail.com>
Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com>
|
|
The IMPULSE special-cases did not sample the center of the of the region.
This caused it to sample the filters outside their range, and produce
assymetric filters and other errors. Fixing this required changing the
arguments to integral() so the correct point could be determined.
I fixed the nice filter and the integration to directly produce normalized
values. Re-normalization is still needed for impulse.box or impulse.triangle
so I did not remove it.
Distribute fixed error over all filter samples, to remove a high-frequency
bit of noise in the center of some filters (lancoz at large scale value).
box.box, which I expect will be very common as it is the proposed "good" filter,
was made a lot faster and more accurate. This is easy as the caller already
intersected the two boxes, so the width is the integral.
v7: This is a merge of 4 patches and lots of new code cleanup and fixes
determined by examining the gnuplot output
v9: Restored the recursion splitting at zero for linear filter
v10: Small change from here moved to previous Simpsons patch so it compiles
Merged patch to get correct subsample positions when subsample_bits==0
v11: Whitespace fixes
Signed-off-by: Bill Spitzak <spitzak@gmail.com>
Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com>
Acked-by: Oded Gabbay <oded.gabbay@gmail.com>
|
|
Simpsons uses cubic curve fitting, with 3 samples defining each cubic. This
makes the weights of the samples be in a pattern of 1,4,2,4,2...4,1, and then
dividing the result by 3.
The previous code was using weights of 1,2,6,6...6,2,1. Since it divided by
3 this produced about 2x the desired value (the normalization fixed this).
Also this is effectively a linear interpolation, not Simpsons integration.
With this fix the integration is accurate enough that the number of samples
could be reduced a lot. Multiples of 12 seem to work best.
v7: Merged with patch to reduce from 128 samples to 16
v9: Changed samples from 16 to 12
v10: Fixed rebase error that made it not compile
v11: minor whitespace change
Signed-off-by: Bill Spitzak <spitzak@gmail.com>
Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com>
|
|
Rearranged so that the entire block of memory for the filter pair
is allocated first, and then filled in. Previous version allocated
and freed two temporary buffers for each filter and did an extra
memcpy.
v8: small refactor to remove the filter_width function
v10: Restored filter_width function but with arguments changed to
match later patches
v11: Removed unused arg and pointer from filter_width function
Whitespace fixes.
Signed-off-by: Bill Spitzak <spitzak@gmail.com>
Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com>
|
|
Rename kernel1/2 to reconstruct/sample to match the other functions.
Rename "scale" to "size" to avoid confusion with the scale being applied
to the image, which is the reciprocol of this value.
v10: Renamed "scale" to "size"
Signed-off-by: Bill Spitzak <spitzak@gmail.com>
Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com>
|
|
If enable-gnuplot is configured, then you can pipe the output of a pixman-using program
to gnuplot and get a continuously-updated plot of the horizontal filter. This
works well with demos/scale to test the filter generation.
The plot is all the different subposition filters shuffled together. This is
misleading in a few cases:
IMPULSE.BOX - goes up and down as the subfilters have different numbers of non-zero samples
IMPULSE.TRIANGLE - somewhat crooked for the same reason
1-wide filters - looks triangular, but a 1-wide box would be more accurate
v7: First time this ability was included
v8: Use config option
Moved code to the filter generator
Modified scale demo to not call filter generator a second time.
v10: Only print if successful generation of plots
Use #ifdef, not #if
v11: small whitespace fixes
Signed-off-by: Bill Spitzak <spitzak@gmail.com>
Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com>
|
|
This makes the speed of the demo more accurate, as the filter generation
is a visible fraction of the time it takes to do a transform. This also
prevents the output of unused filters in the gnuplot option in the next
patch.
Signed-off-by: Bill Spitzak <spitzak@gmail.com>
Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com>
|
|
This allows testing of GOOD/BEST and to do comparisons between
the basic filters and PIXMAN_FILTER_SEPARABLE_CONVOLUTION settings.
Signed-off-by: Bill Spitzak <spitzak@gmail.com>
Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com>
|
|
rectangle
This is much more accurate and less blurry. In particular the filtering does
not change as the image is rotated.
Signed-off-by: Bill Spitzak <spitzak@gmail.com>
Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com>
|
|
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
|
|
<float.h> is included unconditionally by pixman-private.h, which in
turn gets included by assembler files. Unfortunately, with certain C
libraries (like the musl C library), <float.h> cannot be included in
assembler files:
CCLD libpixman-arm-simd.la
/home/test/buildroot/output/host/usr/arm-buildroot-linux-musleabihf/sysroot/usr/include/float.h: Assembler messages:
/home/test/buildroot/output/host/usr/arm-buildroot-linux-musleabihf/sysroot/usr/include/float.h:8: Error: bad instruction `int __flt_rounds(void)'
/home/test/buildroot/output/host/usr/arm-buildroot-linux-musleabihf/sysroot/usr/include/float.h: Assembler messages:
/home/test/buildroot/output/host/usr/arm-buildroot-linux-musleabihf/sysroot/usr/include/float.h:8: Error: bad instruction `int __flt_rounds(void)'
It turns out however that <float.h> is not needed by assembly files,
so we move its inclusion within the #ifndef __ASSEMBLER__ condition,
which solves the problem.
Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Reviewed-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
|
|
The `check` target in test/Makefile.win32 assumed that any non-0 exit
code from the tests was an error, but the testsuite is currently using
77 as a SKIP exit code (based on the convention used in autotools).
Fixes fence-image-self-test and cover-test (now reported as SKIP).
Signed-off-by: Andrea Canciani <ranma42@gmail.com>
Acked-by: Oded Gabbay <oded.gabbay@gmail.com>
|
|
The `rm` command is not usually available when running on Win32 in a
`cmd.exe` shell. Instead the shell provides the `del` builtin, which
has somewhat more limited wildcars expansion and error handling.
This makes all of the Makefile targets work on Win32 both using
`cmd.exe` and using the MSYS environment.
Signed-off-by: Simon Richter <Simon.Richter@hogyros.de>
Signed-off-by: Andrea Canciani <ranma42@gmail.com>
Acked-by: Oded Gabbay <oded.gabbay@gmail.com>
|
|
When the build is performed using `cmd.exe` as shell, the `mkdir`
command does not support the `-p` flag. The ability to create multiple
netsted folder is not used, hence it can be easily replaced by only
creating the directory if it does not exist.
This makes the build work on the `cmd.exe` shell, except for the
`clean` targets.
Signed-off-by: Andrea Canciani <ranma42@gmail.com>
Acked-by: Oded Gabbay <oded.gabbay@gmail.com>
|
|
Instead of explicitly depending on "pixman" for the "all" and "check"
targets, rely on the dependency to the .lib file
Signed-off-by: Andrea Canciani <ranma42@gmail.com>
Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com>
|
|
Since 3d81d89c292058522cce91338028d9b4c4a23c24 BUILT_SOURCES is not
used anymore, but it was unintentionally left in Win32 Makefiles.
Signed-off-by: Andrea Canciani <ranma42@gmail.com>
Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com>
|
|
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
|
|
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
|
|
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
|
|
This patch modifies the SSE2 & SSSE3 tests in configure.ac to use a
global variable to initialize vector variables. In addition, we now
return the value of the computation instead of 0.
This is done so gcc 4.9 (and lower) won't optimize the SSE assembly
instructions (when using -O1 and higher), because then the configure test
might incorrectly pass even though the assembler doesn't support the
SSE instructions (the test will pass because the compiler does support
the intrinsics).
v2: instead of using volatile, use a global variable as input
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
|
|
Older versions of clang emitted an error on the "K" constraint, but at
least since version 3.7 it is supported. Just like gcc, this
constraint is only allowed for constants, but apparently clang
requires them to be known before inlining.
Using the macro definition _mm_shuffle_pi16(A, N) ensures that the "K"
constraint is always applied to a literal constant, independently from
the compiler optimizations and allows building pixman-mmx on modern
clang.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Signed-off-by: Andrea Canciani <ranma42@gmail.com>
|
|
This reverts commit 7de61d8d14e84623b6fa46506eb74f938287f536.
Newer versions of gcc allow inclusion of xmmintrin.h without -msse, but
still won't allow usage of the intrinsics.
Bugzilla: https://bugs.gentoo.org/show_bug.cgi?id=564024
|
|
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
|
|
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
|
|
On MacOS X, according to the manpage of mprotect(), "When a program
violates the protections of a page, it gets a SIGBUS or SIGSEGV
signal.", but fence-image-self-test was only accepting a SIGSEGV as
notification of invalid access.
Fixes fence-image-self-test
Reviewed-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
|
|
We had lots of hacks to handle the inability to include xmmintrin.h
without compiling with -msse (lest SSE instructions be used in
pixman-mmx.c). Some recent version of gcc relaxed this restriction.
Change configure.ac to test that xmmintrin.h can be included and that we
can use some intrinsics from it, and remove the work-around code from
pixman-mmx.c.
Evidently allows gcc 4.9.3 to optimize better as well:
text data bss dec hex filename
657078 30848 680 688606 a81de libpixman-1.so.0.33.3 before
656710 30848 680 688238 a806e libpixman-1.so.0.33.3 after
Reviewed-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
Tested-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Signed-off-by: Matt Turner <mattst88@gmail.com>
|
|
Running "lowlevel-blt-bench over_n_8888" on Playstation3 3.2GHz,
Gentoo ppc (32-bit userland) gave the following results:
before: over_n_8888 = L1: 147.47 L2: 205.86 M:121.07
after: over_n_8888 = L1: 287.27 L2: 261.09 M:133.48
Cairo non-trimmed benchmarks on POWER8, 3.4GHz 8 Cores:
ocitysmap 659.69 -> 611.71 : 1.08x speedup
xfce4-terminal-a1 2725.22 -> 2547.47 : 1.07x speedup
Signed-off-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
|
|
Patch "Remove the 8e extra safety margin in COVER_CLIP analysis" reduced
the required image area for setting the COVER flags in
pixman.c:analyze_extent(). Do the same reduction in affine-bench.
Leaving the old calculations in place would be very confusing for anyone
reading the code.
Also add a comment that explains how affine-bench wants to hit the COVER
paths. This explains why the intricate extent calculations are copied
from pixman.c.
[Pekka: split patch, change comments, write commit message]
Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Ben Avison <bavison@riscosopen.org>
|
|
As discussed in
http://lists.freedesktop.org/archives/pixman/2015-August/003905.html
the 8 * pixman_fixed_e (8e) adjustment which was applied to the transformed
coordinates is a legacy of rounding errors which used to occur in old
versions of Pixman, but which no longer apply. For any affine transform,
you are now guaranteed to get the same result by transforming the upper
coordinate as though you transform the lower coordinate and add (size-1)
steps of the increment in source coordinate space. No projective
transform routines use the COVER_CLIP flags, so they cannot be affected.
Proof by Siarhei Siamashka:
Let's take a look at the following affine transformation matrix (with 16.16
fixed point values) and two vectors:
| a b c |
M = | d e f |
| 0 0 0x10000 |
| x_dst |
P = | y_dst |
| 0x10000 |
| 0x10000 |
ONE_X = | 0 |
| 0 |
The current matrix multiplication code does the following calculations:
| (a * x_dst + b * y_dst + 0x8000) / 0x10000 + c |
M * P = | (d * x_dst + e * y_dst + 0x8000) / 0x10000 + f |
| 0x10000 |
These calculations are not perfectly exact and we may get rounding
because the integer coordinates are adjusted by 0.5 (or 0x8000 in the
16.16 fixed point format) before doing matrix multiplication. For
example, if the 'a' coefficient is an odd number and 'b' is zero,
then we are losing some of the least significant bits when dividing by
0x10000.
So we need to strictly prove that the following expression is always
true even though we have to deal with rounding:
| a |
M * (P + ONE_X) - M * P = M * ONE_X = | d |
| 0 |
or
((a * (x_dst + 0x10000) + b * y_dst + 0x8000) / 0x10000 + c)
-
((a * x_dst + b * y_dst + 0x8000) / 0x10000 + c)
=
a
It's easy to see that this is equivalent to
a + ((a * x_dst + b * y_dst + 0x8000) / 0x10000 + c)
- ((a * x_dst + b * y_dst + 0x8000) / 0x10000 + c)
=
a
Which means that stepping exactly by one pixel horizontally in the
destination image space (advancing 'x_dst' by 0x10000) is the same as
changing the transformed 'x_src' coordinate in the source image space
exactly by 'a'. The same applies to the vertical direction too.
Repeating these steps, we can reach any pixel in the source image
space and get exactly the same fixed point coordinates as doing
matrix multiplications per each pixel.
By the way, the older matrix multiplication implementation, which was
relying on less accurate calculations with three intermediate roundings
"((a + 0x8000) >> 16) + ((b + 0x8000) >> 16) + ((c + 0x8000) >> 16)",
also has the same properties. However reverting
http://cgit.freedesktop.org/pixman/commit/?id=ed39992564beefe6b12f81e842caba11aff98a9c
and applying this "Remove the 8e extra safety margin in COVER_CLIP
analysis" patch makes the cover test fail. The real reason why it fails
is that the old pixman code was using "pixman_transform_point_3d()"
function
http://cgit.freedesktop.org/pixman/tree/pixman/pixman-matrix.c?id=pixman-0.28.2#n49
for getting the transformed coordinate of the top left corner pixel
in the image scaling code, but at the same time using a different
"pixman_transform_point()" function
http://cgit.freedesktop.org/pixman/tree/pixman/pixman-matrix.c?id=pixman-0.28.2#n82
in the extents calculation code for setting the cover flag. And these
functions did the intermediate rounding differently. That's why the 8e
safety margin was needed.
** proof ends
However, for COVER_CLIP_NEAREST, the actual margins added were not 8e.
Because the half-way cases round down, that is, coordinate 0 hits pixel
index -1 while coordinate e hits pixel index 0, the extra safety margins
were actually 7e to the left and up, and 9e to the right and down. This
patch removes the 7e and 9e margins and restores the -e adjustment
required for NEAREST sampling in Pixman. For reference, see
pixman/rounding.txt.
For COVER_CLIP_BILINEAR, the margins were exactly 8e as there are no
additional offsets to be restored, so simply removing the 8e additions
is enough.
Proof:
All implementations must give the same numerical results as
bits_image_fetch_pixel_nearest() / bits_image_fetch_pixel_bilinear().
The former does
int x0 = pixman_fixed_to_int (x - pixman_fixed_e);
which maps directly to the new test for the nearest flag, when you consider
that x0 must fall in the interval [0,width).
The latter does
x1 = x - pixman_fixed_1 / 2;
x1 = pixman_fixed_to_int (x1);
x2 = x1 + 1;
When you write a COVER path, you take advantage of the assumption that
both x1 and x2 fall in the interval [0, width).
As samplers are allowed to fetch the pixel at x2 unconditionally, we
require
x1 >= 0
x2 < width
so
x - pixman_fixed_1 / 2 >= 0
x - pixman_fixed_1 / 2 + pixman_fixed_1 < width * pixman_fixed_1
so
pixman_fixed_to_int (x - pixman_fixed_1 / 2) >= 0
pixman_fixed_to_int (x + pixman_fixed_1 / 2) < width
which matches the source code lines for the bilinear case, once you delete
the lines that add the 8e margin.
Signed-off-by: Ben Avison <bavison@riscosopen.org>
[Pekka: adjusted commit message, left affine-bench changes for another patch]
[Pekka: add commit message parts from Siarhei]
Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
Reviewed-by: Ben Avison <bavison@riscosopen.org>
|
|
Each of the aligns can only add a maximum of 15 bytes to the space
requirement. This permits some edge cases to use the stack buffer where
previously it would have deduced that a heap buffer was required.
Reviewed-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
|
|
As https://bugs.freedesktop.org/show_bug.cgi?id=92027#c6 explains,
the stack is allocated at the very top of the process address space
in some configurations (32-bit x86 systems with ASLR disabled).
And the careless computations done with the 'dest_buffer' pointer
may overflow, failing the buffer upper limit check.
The problem can be reproduced using the 'stress-test' program,
which segfaults when executed via setarch:
export CFLAGS="-O2 -m32" && ./autogen.sh
./configure --disable-libpng --disable-gtk && make
setarch i686 -R test/stress-test
This patch introduces the required corrections. The extra check
for negative 'width' may be redundant (the invalid 'width' value
is not supposed to reach here), but it's better to play safe
when dealing with the buffers allocated on stack.
Reported-by: Ludovic Courtès <ludo@gnu.org>
Signed-off-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
Reviewed-by: soren.sandmann@gmail.com
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
|
|
Some architectures, such as Microblaze and Nios2, currently do not
implement FE_DIVBYZERO, even though they have <fenv.h> and
feenableexcept(). This commit adds a configure.ac check to verify
whether FE_DIVBYZERO is defined or not, and if not, disables the
problematic code in test/utils.c.
Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Signed-off-by: Marek Vasut <marex@denx.de>
Acked-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
|
|
Now that we replaced the expensive functions with better performing
alternatives, we should remove them so they will not be used again.
Running Cairo benchmark on trimmed traces gave the following results:
POWER8, 8 cores, 3.4GHz, RHEL 7.2 ppc64le.
Speedups
========
t-firefox-scrolling 1232.30 -> 1096.55 : 1.12x
t-gnome-terminal-vim 613.86 -> 553.10 : 1.11x
t-evolution 405.54 -> 371.02 : 1.09x
t-firefox-talos-gfx 919.31 -> 862.27 : 1.07x
t-gvim 653.02 -> 616.85 : 1.06x
t-firefox-canvas-alpha 941.29 -> 890.42 : 1.06x
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Acked-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Acked-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
|
|
POWER8, 8 cores, 3.4GHz, RHEL 7.2 ppc64le.
reference memcpy speed = 25008.9MB/s (6252.2MP/s for 32bpp fills)
Before After Change
---------------------------------------------
L1 91.32 182.84 +100.22%
L2 94.94 182.83 +92.57%
M 95.55 181.51 +89.96%
HT 88.96 162.09 +82.21%
VT 87.4 168.35 +92.62%
R 83.37 146.23 +75.40%
RT 66.4 91.5 +37.80%
Kops/s 683 859 +25.77%
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Acked-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Acked-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
|
|
This patch optimizes vmx_composite_over_n_8888_8888_ca by removing use
of expand_alpha_1x128, unpack/pack and in_over_2x128 in favor of
splat_alpha, in_over and MUL/ADD macros from pixman_combine32.h.
Running "lowlevel-blt-bench -n over_8888_8888" on POWER8, 8 cores,
3.4GHz, RHEL 7.2 ppc64le gave the following results:
reference memcpy speed = 23475.4MB/s (5868.8MP/s for 32bpp fills)
Before After Change
--------------------------------------------
L1 244.97 474.05 +93.51%
L2 243.74 473.05 +94.08%
M 243.29 467.16 +92.02%
HT 144.03 252.79 +75.51%
VT 174.24 279.03 +60.14%
R 109.86 149.98 +36.52%
RT 47.96 53.18 +10.88%
Kops/s 524 576 +9.92%
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Acked-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Acked-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
|
|
This patch optimizes scaled_nearest_scanline_vmx_8888_8888_OVER and all
the functions it calls (combine1, combine4 and
core_combine_over_u_pixel_vmx).
The optimization is done by removing use of expand_alpha_1x128 and
expand_alpha_2x128 in favor of splat_alpha and MUL/ADD macros from
pixman_combine32.h.
Running "lowlevel-blt-bench -n over_8888_8888" on POWER8, 8 cores,
3.4GHz, RHEL 7.2 ppc64le gave the following results:
reference memcpy speed = 24847.3MB/s (6211.8MP/s for 32bpp fills)
Before After Change
--------------------------------------------
L1 182.05 210.22 +15.47%
L2 180.6 208.92 +15.68%
M 180.52 208.22 +15.34%
HT 130.17 178.97 +37.49%
VT 145.82 184.22 +26.33%
R 104.51 129.38 +23.80%
RT 48.3 61.54 +27.41%
Kops/s 430 504 +17.21%
v2: Check *pm is not NULL before dereferencing it in combine1()
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Acked-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Acked-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
|
|
Enable the fast path added in the previous patch by moving the lookup
table entries to their proper locations.
Lowlevel-blt-bench benchmark statistics with 30 iterations, showing the
effect of adding this one patch on top of
"armv6: Add over_n_8888 fast path (disabled)", which was applied on
fd595692941f3d9ddea8934462bd1d18aed07c65.
Before After
Mean StdDev Mean StdDev Confidence Change
L1 12.5 0.04 45.2 0.10 100.00% +263.1%
L2 11.1 0.02 43.2 0.03 100.00% +289.3%
M 9.4 0.00 42.4 0.02 100.00% +351.7%
HT 8.5 0.02 25.4 0.10 100.00% +198.8%
VT 8.4 0.02 22.3 0.07 100.00% +167.0%
R 8.2 0.02 23.1 0.09 100.00% +183.6%
RT 5.4 0.05 11.4 0.21 100.00% +110.3%
At most 3 outliers rejected per test per set.
Iterating here means that lowlevel-blt-bench was executed 30 times, and
the statistics above were computed from the output.
Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
|
|
This new fast path is initially disabled by putting the entries in the
lookup table after the sentinel. The compiler cannot tell the new code
is not used, so it cannot eliminate the code. Also the lookup table size
will include the new fast path. When the follow-up patch then enables
the new fast path, the binary layout (alignments, size, etc.) will stay
the same compared to the disabled case.
Keeping the binary layout identical is important for benchmarking on
Raspberry Pi 1. The addresses at which functions are loaded will have a
significant impact on benchmark results, causing unexpected performance
changes. Keeping all function addresses the same across the patch
enabling a new fast path improves the reliability of benchmarks.
Benchmark results are included in the patch enabling this fast path.
[Pekka: disabled the fast path, commit message]
Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
|
|
This test aims to verify both numerical correctness and the honouring of
array bounds for scaled plots (both nearest-neighbour and bilinear) at or
close to the boundary conditions for applicability of "cover" type fast paths
and iter fetch routines.
It has a secondary purpose: by setting the env var EXACT (to any value) it
will only test plots that are exactly on the boundary condition. This makes
it possible to ensure that "cover" routines are being used to the maximum,
although this requires the use of a debugger or code instrumentation to
verify.
Changes in v4:
Check the fence page size and skip the test if it is too large. Since
we need to deal with pixman_fixed_t coordinates that go beyond the
real image width, make the page size limit 16 kB. A 32 kB or larger
page size would cause an a8 image width to be 32k or more, which is no
longer representable in pixman_fixed_t.
Use a shorthand variable 'filter' in test_cover().
Whitespace adjustments.
Changes in v5:
Skip if fenced memory is not supported. Do you know of any such
platform?
Signed-off-by: Ben Avison <bavison@riscosopen.org>
[Pekka: changes in v4 and v5]
Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Ben Avison <bavison@riscosopen.org>
Acked-by: Oded Gabbay <oded.gabbay@gmail.com>
|
|
Add a new option to PIXMAN_DISABLE: "wholeops". This option disables all
whole-operation fast paths regardless of implementation level, except
the general path (general_composite_rect).
The purpose is to add a debug option that allows us to test optimized
iterator paths specifically. With this, it is possible to see if:
- fast paths mask bugs in iterators
- compare fast paths with iterator paths for performance
The effect was tested on x86_64 by running:
$ PIXMAN_DISABLE='' ./test/lowlevel-blt-bench over_8888_8888
$ PIXMAN_DISABLE='wholeops' ./test/lowlevel-blt-bench over_8888_8888
In the first case time is spent in sse2_composite_over_8888_8888(), and
in the latter in sse2_combine_over_u().
Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com>
|
|
Add a function to get the page size used for memory fence purposes, and
use it everywhere where getpagesize() was used.
This offers a single point in code to override the page size, in case
one wants to experiment how the tests work with a higher page size than
what the developer's machine has.
This also offers a clean API, without adding #ifdefs, to tests for
checking the page size.
Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com>
Reviewed-by: Ben Avison <bavison@riscosopen.org>
|
|
Used a wrong variable name, causing:
/home/pq/git/pixman/demos/../test/utils.c: In function ‘fence_image_create_bits’:
/home/pq/git/pixman/demos/../test/utils.c:562:46: error: ‘width’ undeclared (first use in this function)
Use the correct variable.
Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com>
Reviewed-by: Ben Avison <bavison@riscosopen.org>
|
|
Tests that fence_malloc and fence_image_create_bits actually work: that
out-of-bounds and out-of-row (unused stride area) accesses trigger
SIGSEGV.
If fence_malloc is a dummy (FENCE_MALLOC_ACTIVE not defined), this test
is skipped.
Changes in v2:
- check FENCE_MALLOC_ACTIVE value, not whether it is defined
- test that reading bytes near the fence pages does not cause a
segmentation fault
Changes in v3:
- Do not print progress messages unless VERBOSE environment variable is
set. Avoid spamming the terminal output of 'make check' on some
versions of autotools.
Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Ben Avison <bavison@riscosopen.org>
|
|
Useful for detecting out-of-bounds accesses in composite operations.
This will be used by follow-up patches adding new tests.
Changes in v2:
- fix style on fence_image_create_bits args
- add page to stride only if stride_fence
- add comment on the fallback definition about freeing storage
Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Ben Avison <bavison@riscosopen.org>
|
|
Define a new token to simplify checking whether fence_malloc() actually
can catch out-of-bounds access.
This will be used in the future to skip tests that rely on fence_malloc
checking functionality.
Changes in v2:
- #define FENCE_MALLOC_ACTIVE always, but change its value to help catch
use of it without including utils.h
Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Ben Avison <bavison@riscosopen.org>
|