~siamashka/pixman - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author	Files	Lines
2016-01-31	pixman-private: include <float.h> only in C codeHEAD master	Thomas Petazzoni	1	-2/+1
	<float.h> is included unconditionally by pixman-private.h, which in turn gets included by assembler files. Unfortunately, with certain C libraries (like the musl C library), <float.h> cannot be included in assembler files: CCLD libpixman-arm-simd.la /home/test/buildroot/output/host/usr/arm-buildroot-linux-musleabihf/sysroot/usr/include/float.h: Assembler messages: /home/test/buildroot/output/host/usr/arm-buildroot-linux-musleabihf/sysroot/usr/include/float.h:8: Error: bad instruction `int __flt_rounds(void)' /home/test/buildroot/output/host/usr/arm-buildroot-linux-musleabihf/sysroot/usr/include/float.h: Assembler messages: /home/test/buildroot/output/host/usr/arm-buildroot-linux-musleabihf/sysroot/usr/include/float.h:8: Error: bad instruction `int __flt_rounds(void)' It turns out however that <float.h> is not needed by assembly files, so we move its inclusion within the #ifndef __ASSEMBLER__ condition, which solves the problem. Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Reviewed-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
2015-12-30	build: Distinguish SKIP and FAIL on Win32	Andrea Canciani	1	-11/+20
	The `check` target in test/Makefile.win32 assumed that any non-0 exit code from the tests was an error, but the testsuite is currently using 77 as a SKIP exit code (based on the convention used in autotools). Fixes fence-image-self-test and cover-test (now reported as SKIP). Signed-off-by: Andrea Canciani <ranma42@gmail.com> Acked-by: Oded Gabbay <oded.gabbay@gmail.com>
2015-12-23	build: Use `del` instead of `rm` on `cmd.exe` shells	Simon Richter	1	-2/+6
	The `rm` command is not usually available when running on Win32 in a `cmd.exe` shell. Instead the shell provides the `del` builtin, which has somewhat more limited wildcars expansion and error handling. This makes all of the Makefile targets work on Win32 both using `cmd.exe` and using the MSYS environment. Signed-off-by: Simon Richter <Simon.Richter@hogyros.de> Signed-off-by: Andrea Canciani <ranma42@gmail.com> Acked-by: Oded Gabbay <oded.gabbay@gmail.com>
2015-12-23	build: Do not use `mkdir -p` on Windows	Andrea Canciani	1	-2/+3
	When the build is performed using `cmd.exe` as shell, the `mkdir` command does not support the `-p` flag. The ability to create multiple netsted folder is not used, hence it can be easily replaced by only creating the directory if it does not exist. This makes the build work on the `cmd.exe` shell, except for the `clean` targets. Signed-off-by: Andrea Canciani <ranma42@gmail.com> Acked-by: Oded Gabbay <oded.gabbay@gmail.com>
2015-12-23	build: Avoid phony `pixman` target in test/Makefile.win32	Andrea Canciani	1	-6/+4
	Instead of explicitly depending on "pixman" for the "all" and "check" targets, rely on the dependency to the .lib file Signed-off-by: Andrea Canciani <ranma42@gmail.com> Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com>
2015-12-23	build: Remove use of BUILT_SOURCES from Makefile.win32	Andrea Canciani	1	-1/+1
	Since 3d81d89c292058522cce91338028d9b4c4a23c24 BUILT_SOURCES is not used anymore, but it was unintentionally left in Win32 Makefiles. Signed-off-by: Andrea Canciani <ranma42@gmail.com> Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com>
2015-12-23	Post 0.34 branch creation version bump to 0.35.1	Oded Gabbay	1	-2/+2
	Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2015-12-22	Post-release version bump to 0.33.7	Oded Gabbay	1	-1/+1
	Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2015-12-22	Pre-release version bump to 0.33.6	Oded Gabbay	1	-1/+1
	Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2015-12-22	configura.ac: fix test for SSE2 & SSSE3 assembler support	Oded Gabbay	1	-4/+6
	This patch modifies the SSE2 & SSSE3 tests in configure.ac to use a global variable to initialize vector variables. In addition, we now return the value of the computation instead of 0. This is done so gcc 4.9 (and lower) won't optimize the SSE assembly instructions (when using -O1 and higher), because then the configure test might incorrectly pass even though the assembler doesn't support the SSE instructions (the test will pass because the compiler does support the intrinsics). v2: instead of using volatile, use a global variable as input Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2015-11-18	mmx: Improve detection of support for "K" constraint	Andrea Canciani	2	-21/+18
	Older versions of clang emitted an error on the "K" constraint, but at least since version 3.7 it is supported. Just like gcc, this constraint is only allowed for constants, but apparently clang requires them to be known before inlining. Using the macro definition _mm_shuffle_pi16(A, N) ensures that the "K" constraint is always applied to a literal constant, independently from the compiler optimizations and allows building pixman-mmx on modern clang. Reviewed-by: Matt Turner <mattst88@gmail.com> Signed-off-by: Andrea Canciani <ranma42@gmail.com>
2015-11-18	Revert "mmx: Use MMX2 intrinsics from xmmintrin.h directly."	Matt Turner	2	-8/+71
	This reverts commit 7de61d8d14e84623b6fa46506eb74f938287f536. Newer versions of gcc allow inclusion of xmmintrin.h without -msse, but still won't allow usage of the intrinsics. Bugzilla: https://bugs.gentoo.org/show_bug.cgi?id=564024
2015-10-23	Post-release version bump to 0.33.5	Oded Gabbay	1	-1/+1
	Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2015-10-23	Pre-release version bump to 0.33.4	Oded Gabbay	1	-1/+1
	Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2015-10-16	test: Fix fence-image-self-test on Mac	Andrea Canciani	2	-8/+10
	On MacOS X, according to the manpage of mprotect(), "When a program violates the protections of a page, it gets a SIGBUS or SIGSEGV signal.", but fence-image-self-test was only accepting a SIGSEGV as notification of invalid access. Fixes fence-image-self-test Reviewed-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
2015-10-13	mmx: Use MMX2 intrinsics from xmmintrin.h directly.	Matt Turner	2	-71/+8
	We had lots of hacks to handle the inability to include xmmintrin.h without compiling with -msse (lest SSE instructions be used in pixman-mmx.c). Some recent version of gcc relaxed this restriction. Change configure.ac to test that xmmintrin.h can be included and that we can use some intrinsics from it, and remove the work-around code from pixman-mmx.c. Evidently allows gcc 4.9.3 to optimize better as well: text data bss dec hex filename 657078 30848 680 688606 a81de libpixman-1.so.0.33.3 before 656710 30848 680 688238 a806e libpixman-1.so.0.33.3 after Reviewed-by: Siarhei Siamashka <siarhei.siamashka@gmail.com> Tested-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk> Signed-off-by: Matt Turner <mattst88@gmail.com>
2015-09-29	vmx: implement fast path vmx_composite_over_n_8888	Siarhei Siamashka	1	-0/+54
	Running "lowlevel-blt-bench over_n_8888" on Playstation3 3.2GHz, Gentoo ppc (32-bit userland) gave the following results: before: over_n_8888 = L1: 147.47 L2: 205.86 M:121.07 after: over_n_8888 = L1: 287.27 L2: 261.09 M:133.48 Cairo non-trimmed benchmarks on POWER8, 3.4GHz 8 Cores: ocitysmap 659.69 -> 611.71 : 1.08x speedup xfce4-terminal-a1 2725.22 -> 2547.47 : 1.07x speedup Signed-off-by: Siarhei Siamashka <siarhei.siamashka@gmail.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2015-09-25	affine-bench: remove 8e margin from COVER area	Ben Avison	1	-6/+18
	Patch "Remove the 8e extra safety margin in COVER_CLIP analysis" reduced the required image area for setting the COVER flags in pixman.c:analyze_extent(). Do the same reduction in affine-bench. Leaving the old calculations in place would be very confusing for anyone reading the code. Also add a comment that explains how affine-bench wants to hit the COVER paths. This explains why the intricate extent calculations are copied from pixman.c. [Pekka: split patch, change comments, write commit message] Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk> Reviewed-by: Ben Avison <bavison@riscosopen.org>
2015-09-25	Remove the 8e extra safety margin in COVER_CLIP analysis	Ben Avison	1	-13/+4
	As discussed in http://lists.freedesktop.org/archives/pixman/2015-August/003905.html the 8 * pixman_fixed_e (8e) adjustment which was applied to the transformed coordinates is a legacy of rounding errors which used to occur in old versions of Pixman, but which no longer apply. For any affine transform, you are now guaranteed to get the same result by transforming the upper coordinate as though you transform the lower coordinate and add (size-1) steps of the increment in source coordinate space. No projective transform routines use the COVER_CLIP flags, so they cannot be affected. Proof by Siarhei Siamashka: Let's take a look at the following affine transformation matrix (with 16.16 fixed point values) and two vectors: \| a b c \| M = \| d e f \| \| 0 0 0x10000 \| \| x_dst \| P = \| y_dst \| \| 0x10000 \| \| 0x10000 \| ONE_X = \| 0 \| \| 0 \| The current matrix multiplication code does the following calculations: \| (a * x_dst + b * y_dst + 0x8000) / 0x10000 + c \| M * P = \| (d * x_dst + e * y_dst + 0x8000) / 0x10000 + f \| \| 0x10000 \| These calculations are not perfectly exact and we may get rounding because the integer coordinates are adjusted by 0.5 (or 0x8000 in the 16.16 fixed point format) before doing matrix multiplication. For example, if the 'a' coefficient is an odd number and 'b' is zero, then we are losing some of the least significant bits when dividing by 0x10000. So we need to strictly prove that the following expression is always true even though we have to deal with rounding: \| a \| M * (P + ONE_X) - M * P = M * ONE_X = \| d \| \| 0 \| or ((a * (x_dst + 0x10000) + b * y_dst + 0x8000) / 0x10000 + c) - ((a * x_dst + b * y_dst + 0x8000) / 0x10000 + c) = a It's easy to see that this is equivalent to a + ((a * x_dst + b * y_dst + 0x8000) / 0x10000 + c) - ((a * x_dst + b * y_dst + 0x8000) / 0x10000 + c) = a Which means that stepping exactly by one pixel horizontally in the destination image space (advancing 'x_dst' by 0x10000) is the same as changing the transformed 'x_src' coordinate in the source image space exactly by 'a'. The same applies to the vertical direction too. Repeating these steps, we can reach any pixel in the source image space and get exactly the same fixed point coordinates as doing matrix multiplications per each pixel. By the way, the older matrix multiplication implementation, which was relying on less accurate calculations with three intermediate roundings "((a + 0x8000) >> 16) + ((b + 0x8000) >> 16) + ((c + 0x8000) >> 16)", also has the same properties. However reverting http://cgit.freedesktop.org/pixman/commit/?id=ed39992564beefe6b12f81e842caba11aff98a9c and applying this "Remove the 8e extra safety margin in COVER_CLIP analysis" patch makes the cover test fail. The real reason why it fails is that the old pixman code was using "pixman_transform_point_3d()" function http://cgit.freedesktop.org/pixman/tree/pixman/pixman-matrix.c?id=pixman-0.28.2#n49 for getting the transformed coordinate of the top left corner pixel in the image scaling code, but at the same time using a different "pixman_transform_point()" function http://cgit.freedesktop.org/pixman/tree/pixman/pixman-matrix.c?id=pixman-0.28.2#n82 in the extents calculation code for setting the cover flag. And these functions did the intermediate rounding differently. That's why the 8e safety margin was needed. ** proof ends However, for COVER_CLIP_NEAREST, the actual margins added were not 8e. Because the half-way cases round down, that is, coordinate 0 hits pixel index -1 while coordinate e hits pixel index 0, the extra safety margins were actually 7e to the left and up, and 9e to the right and down. This patch removes the 7e and 9e margins and restores the -e adjustment required for NEAREST sampling in Pixman. For reference, see pixman/rounding.txt. For COVER_CLIP_BILINEAR, the margins were exactly 8e as there are no additional offsets to be restored, so simply removing the 8e additions is enough. Proof: All implementations must give the same numerical results as bits_image_fetch_pixel_nearest() / bits_image_fetch_pixel_bilinear(). The former does int x0 = pixman_fixed_to_int (x - pixman_fixed_e); which maps directly to the new test for the nearest flag, when you consider that x0 must fall in the interval [0,width). The latter does x1 = x - pixman_fixed_1 / 2; x1 = pixman_fixed_to_int (x1); x2 = x1 + 1; When you write a COVER path, you take advantage of the assumption that both x1 and x2 fall in the interval [0, width). As samplers are allowed to fetch the pixel at x2 unconditionally, we require x1 >= 0 x2 < width so x - pixman_fixed_1 / 2 >= 0 x - pixman_fixed_1 / 2 + pixman_fixed_1 < width * pixman_fixed_1 so pixman_fixed_to_int (x - pixman_fixed_1 / 2) >= 0 pixman_fixed_to_int (x + pixman_fixed_1 / 2) < width which matches the source code lines for the bilinear case, once you delete the lines that add the 8e margin. Signed-off-by: Ben Avison <bavison@riscosopen.org> [Pekka: adjusted commit message, left affine-bench changes for another patch] [Pekka: add commit message parts from Siarhei] Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk> Reviewed-by: Siarhei Siamashka <siarhei.siamashka@gmail.com> Reviewed-by: Ben Avison <bavison@riscosopen.org>
2015-09-25	pixman-general: Tighten up calculation of temporary buffer sizes	Ben Avison	1	-2/+2
	Each of the aligns can only add a maximum of 15 bytes to the space requirement. This permits some edge cases to use the stack buffer where previously it would have deduced that a heap buffer was required. Reviewed-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
2015-09-22	pixman-general: Fix stack related pointer arithmetic overflow	Siarhei Siamashka	1	-9/+7
	As https://bugs.freedesktop.org/show_bug.cgi?id=92027#c6 explains, the stack is allocated at the very top of the process address space in some configurations (32-bit x86 systems with ASLR disabled). And the careless computations done with the 'dest_buffer' pointer may overflow, failing the buffer upper limit check. The problem can be reproduced using the 'stress-test' program, which segfaults when executed via setarch: export CFLAGS="-O2 -m32" && ./autogen.sh ./configure --disable-libpng --disable-gtk && make setarch i686 -R test/stress-test This patch introduces the required corrections. The extra check for negative 'width' may be redundant (the invalid 'width' value is not supposed to reach here), but it's better to play safe when dealing with the buffers allocated on stack. Reported-by: Ludovic Courtès <ludo@gnu.org> Signed-off-by: Siarhei Siamashka <siarhei.siamashka@gmail.com> Reviewed-by: soren.sandmann@gmail.com Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2015-09-20	test: add a check for FE_DIVBYZERO	Thomas Petazzoni	2	-0/+7
	Some architectures, such as Microblaze and Nios2, currently do not implement FE_DIVBYZERO, even though they have <fenv.h> and feenableexcept(). This commit adds a configure.ac check to verify whether FE_DIVBYZERO is defined or not, and if not, disables the problematic code in test/utils.c. Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Signed-off-by: Marek Vasut <marex@denx.de> Acked-by: Siarhei Siamashka <siarhei.siamashka@gmail.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2015-09-18	vmx: Remove unused expensive functions	Oded Gabbay	1	-196/+0
	Now that we replaced the expensive functions with better performing alternatives, we should remove them so they will not be used again. Running Cairo benchmark on trimmed traces gave the following results: POWER8, 8 cores, 3.4GHz, RHEL 7.2 ppc64le. Speedups ======== t-firefox-scrolling 1232.30 -> 1096.55 : 1.12x t-gnome-terminal-vim 613.86 -> 553.10 : 1.11x t-evolution 405.54 -> 371.02 : 1.09x t-firefox-talos-gfx 919.31 -> 862.27 : 1.07x t-gvim 653.02 -> 616.85 : 1.06x t-firefox-canvas-alpha 941.29 -> 890.42 : 1.06x Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com> Acked-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk> Acked-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
2015-09-18	vmx: implement fast path vmx_composite_over_n_8_8888	Oded Gabbay	1	-0/+111
	POWER8, 8 cores, 3.4GHz, RHEL 7.2 ppc64le. reference memcpy speed = 25008.9MB/s (6252.2MP/s for 32bpp fills) Before After Change --------------------------------------------- L1 91.32 182.84 +100.22% L2 94.94 182.83 +92.57% M 95.55 181.51 +89.96% HT 88.96 162.09 +82.21% VT 87.4 168.35 +92.62% R 83.37 146.23 +75.40% RT 66.4 91.5 +37.80% Kops/s 683 859 +25.77% Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com> Acked-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk> Acked-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
2015-09-18	vmx: optimize vmx_composite_over_n_8888_8888_ca	Oded Gabbay	1	-31/+21
	This patch optimizes vmx_composite_over_n_8888_8888_ca by removing use of expand_alpha_1x128, unpack/pack and in_over_2x128 in favor of splat_alpha, in_over and MUL/ADD macros from pixman_combine32.h. Running "lowlevel-blt-bench -n over_8888_8888" on POWER8, 8 cores, 3.4GHz, RHEL 7.2 ppc64le gave the following results: reference memcpy speed = 23475.4MB/s (5868.8MP/s for 32bpp fills) Before After Change -------------------------------------------- L1 244.97 474.05 +93.51% L2 243.74 473.05 +94.08% M 243.29 467.16 +92.02% HT 144.03 252.79 +75.51% VT 174.24 279.03 +60.14% R 109.86 149.98 +36.52% RT 47.96 53.18 +10.88% Kops/s 524 576 +9.92% Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com> Acked-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk> Acked-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
2015-09-18	vmx: optimize scaled_nearest_scanline_vmx_8888_8888_OVER	Oded Gabbay	1	-62/+17
	This patch optimizes scaled_nearest_scanline_vmx_8888_8888_OVER and all the functions it calls (combine1, combine4 and core_combine_over_u_pixel_vmx). The optimization is done by removing use of expand_alpha_1x128 and expand_alpha_2x128 in favor of splat_alpha and MUL/ADD macros from pixman_combine32.h. Running "lowlevel-blt-bench -n over_8888_8888" on POWER8, 8 cores, 3.4GHz, RHEL 7.2 ppc64le gave the following results: reference memcpy speed = 24847.3MB/s (6211.8MP/s for 32bpp fills) Before After Change -------------------------------------------- L1 182.05 210.22 +15.47% L2 180.6 208.92 +15.68% M 180.52 208.22 +15.34% HT 130.17 178.97 +37.49% VT 145.82 184.22 +26.33% R 104.51 129.38 +23.80% RT 48.3 61.54 +27.41% Kops/s 430 504 +17.21% v2: Check *pm is not NULL before dereferencing it in combine1() Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com> Acked-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk> Acked-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
2015-09-17	armv6: enable over_n_8888	Pekka Paalanen	1	-5/+4
	Enable the fast path added in the previous patch by moving the lookup table entries to their proper locations. Lowlevel-blt-bench benchmark statistics with 30 iterations, showing the effect of adding this one patch on top of "armv6: Add over_n_8888 fast path (disabled)", which was applied on fd595692941f3d9ddea8934462bd1d18aed07c65. Before After Mean StdDev Mean StdDev Confidence Change L1 12.5 0.04 45.2 0.10 100.00% +263.1% L2 11.1 0.02 43.2 0.03 100.00% +289.3% M 9.4 0.00 42.4 0.02 100.00% +351.7% HT 8.5 0.02 25.4 0.10 100.00% +198.8% VT 8.4 0.02 22.3 0.07 100.00% +167.0% R 8.2 0.02 23.1 0.09 100.00% +183.6% RT 5.4 0.05 11.4 0.21 100.00% +110.3% At most 3 outliers rejected per test per set. Iterating here means that lowlevel-blt-bench was executed 30 times, and the statistics above were computed from the output. Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
2015-09-17	armv6: Add over_n_8888 fast path (disabled)	Ben Avison	2	-0/+48
	This new fast path is initially disabled by putting the entries in the lookup table after the sentinel. The compiler cannot tell the new code is not used, so it cannot eliminate the code. Also the lookup table size will include the new fast path. When the follow-up patch then enables the new fast path, the binary layout (alignments, size, etc.) will stay the same compared to the disabled case. Keeping the binary layout identical is important for benchmarking on Raspberry Pi 1. The addresses at which functions are loaded will have a significant impact on benchmark results, causing unexpected performance changes. Keeping all function addresses the same across the patch enabling a new fast path improves the reliability of benchmarks. Benchmark results are included in the patch enabling this fast path. [Pekka: disabled the fast path, commit message] Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
2015-09-16	test: Add cover-test v5	Ben Avison	2	-0/+450
	This test aims to verify both numerical correctness and the honouring of array bounds for scaled plots (both nearest-neighbour and bilinear) at or close to the boundary conditions for applicability of "cover" type fast paths and iter fetch routines. It has a secondary purpose: by setting the env var EXACT (to any value) it will only test plots that are exactly on the boundary condition. This makes it possible to ensure that "cover" routines are being used to the maximum, although this requires the use of a debugger or code instrumentation to verify. Changes in v4: Check the fence page size and skip the test if it is too large. Since we need to deal with pixman_fixed_t coordinates that go beyond the real image width, make the page size limit 16 kB. A 32 kB or larger page size would cause an a8 image width to be 32k or more, which is no longer representable in pixman_fixed_t. Use a shorthand variable 'filter' in test_cover(). Whitespace adjustments. Changes in v5: Skip if fenced memory is not supported. Do you know of any such platform? Signed-off-by: Ben Avison <bavison@riscosopen.org> [Pekka: changes in v4 and v5] Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk> Reviewed-by: Ben Avison <bavison@riscosopen.org> Acked-by: Oded Gabbay <oded.gabbay@gmail.com>
2015-09-09	implementation: add PIXMAN_DISABLE=wholeops	Pekka Paalanen	1	-0/+16
	Add a new option to PIXMAN_DISABLE: "wholeops". This option disables all whole-operation fast paths regardless of implementation level, except the general path (general_composite_rect). The purpose is to add a debug option that allows us to test optimized iterator paths specifically. With this, it is possible to see if: - fast paths mask bugs in iterators - compare fast paths with iterator paths for performance The effect was tested on x86_64 by running: $ PIXMAN_DISABLE='' ./test/lowlevel-blt-bench over_8888_8888 $ PIXMAN_DISABLE='wholeops' ./test/lowlevel-blt-bench over_8888_8888 In the first case time is spent in sse2_composite_over_8888_8888(), and in the latter in sse2_combine_over_u(). Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk> Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com>
2015-09-09	utils.[ch]: add fence_get_page_size()	Pekka Paalanen	2	-3/+22
	Add a function to get the page size used for memory fence purposes, and use it everywhere where getpagesize() was used. This offers a single point in code to override the page size, in case one wants to experiment how the tests work with a higher page size than what the developer's machine has. This also offers a clean API, without adding #ifdefs, to tests for checking the page size. Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk> Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com> Reviewed-by: Ben Avison <bavison@riscosopen.org>
2015-09-09	utils.c: fix fallback code for fence_image_create_bits()	Pekka Paalanen	1	-1/+1
	Used a wrong variable name, causing: /home/pq/git/pixman/demos/../test/utils.c: In function ‘fence_image_create_bits’: /home/pq/git/pixman/demos/../test/utils.c:562:46: error: ‘width’ undeclared (first use in this function) Use the correct variable. Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk> Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com> Reviewed-by: Ben Avison <bavison@riscosopen.org>
2015-09-03	test: add fence-image-self-test	Pekka Paalanen	2	-0/+238
	Tests that fence_malloc and fence_image_create_bits actually work: that out-of-bounds and out-of-row (unused stride area) accesses trigger SIGSEGV. If fence_malloc is a dummy (FENCE_MALLOC_ACTIVE not defined), this test is skipped. Changes in v2: - check FENCE_MALLOC_ACTIVE value, not whether it is defined - test that reading bytes near the fence pages does not cause a segmentation fault Changes in v3: - Do not print progress messages unless VERBOSE environment variable is set. Avoid spamming the terminal output of 'make check' on some versions of autotools. Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk> Reviewed-by: Ben Avison <bavison@riscosopen.org>
2015-09-01	utils.[ch]: add fence_image_create_bits ()	Pekka Paalanen	2	-0/+112
	Useful for detecting out-of-bounds accesses in composite operations. This will be used by follow-up patches adding new tests. Changes in v2: - fix style on fence_image_create_bits args - add page to stride only if stride_fence - add comment on the fallback definition about freeing storage Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk> Reviewed-by: Ben Avison <bavison@riscosopen.org>
2015-09-01	utils.[ch]: add FENCE_MALLOC_ACTIVE	Pekka Paalanen	2	-3/+14
	Define a new token to simplify checking whether fence_malloc() actually can catch out-of-bounds access. This will be used in the future to skip tests that rely on fence_malloc checking functionality. Changes in v2: - #define FENCE_MALLOC_ACTIVE always, but change its value to help catch use of it without including utils.h Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk> Reviewed-by: Ben Avison <bavison@riscosopen.org>
2015-08-28	scaling-test: list more details when verbose	Ben Avison	1	-22/+44
	Add mask details to the output. [Pekka: redo whitespace and print src,dst,mask x and y.] Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk> Reviewed-by: Ben Avison <bavison@riscosopen.org>
2015-08-18	lowlevel-blt-bench: make extra arguments an error	Pekka Paalanen	1	-0/+6
	If a user gives multiple patterns or extra arguments, only the last one was used as the pattern while the former were just ignored. This is a user error silently converted to something possibly unexpected. In presence of extra arguments, complain and quit. Cc: Ben Avison <bavison@riscosopen.org> Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
2015-08-01	Post-release version bump to 0.33.3	Oded Gabbay	1	-1/+1
	Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2015-08-01	Pre-release version bump to 0.33.2	Oded Gabbay	1	-1/+1
	Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2015-07-16	vmx: implement fast path iterator vmx_fetch_a8	Oded Gabbay	1	-0/+46
	no changes were observed when running cairo trimmed benchmarks. Running "lowlevel-blt-bench src_8_8888" on POWER8, 8 cores, 3.4GHz, RHEL 7.1 ppc64le gave the following results: reference memcpy speed = 25197.2MB/s (6299.3MP/s for 32bpp fills) Before After Change -------------------------------------------- L1 965.34 3936 +307.73% L2 942.99 3436.29 +264.40% M 902.24 2757.77 +205.66% HT 448.46 784.99 +75.04% VT 430.05 819.78 +90.62% R 412.9 717.04 +73.66% RT 168.93 220.63 +30.60% Kops/s 1025 1303 +27.12% It was benchmarked against commid id e2d211a from pixman/master Siarhei Siamashka reported that on playstation3, it shows the following results: == before == src_8_8888 = L1: 194.37 L2: 198.46 M:155.90 (148.35%) HT: 59.18 VT: 36.71 R: 38.93 RT: 12.79 ( 106Kops/s) == after == src_8_8888 = L1: 373.96 L2: 391.10 M:245.81 (233.88%) HT: 80.81 VT: 44.33 R: 48.10 RT: 14.79 ( 122Kops/s) Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com> Acked-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
2015-07-16	vmx: implement fast path iterator vmx_fetch_x8r8g8b8	Oded Gabbay	1	-0/+48
	It was benchmarked against commid id 2be523b from pixman/master POWER8, 8 cores, 3.4GHz, RHEL 7.1 ppc64le. cairo trimmed benchmarks : Speedups ======== t-firefox-asteroids 533.92 -> 489.94 : 1.09x Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com> Acked-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
2015-07-16	vmx: implement fast path scaled nearest vmx_8888_8888_OVER	Oded Gabbay	1	-0/+128
	It was benchmarked against commid id 2be523b from pixman/master POWER8, 8 cores, 3.4GHz, RHEL 7.1 ppc64le. reference memcpy speed = 24764.8MB/s (6191.2MP/s for 32bpp fills) Before After Change --------------------------------------------- L1 134.36 181.68 +35.22% L2 135.07 180.67 +33.76% M 134.6 180.51 +34.11% HT 121.77 128.79 +5.76% VT 120.49 145.07 +20.40% R 93.83 102.3 +9.03% RT 50.82 46.93 -7.65% Kops/s 448 422 -5.80% cairo trimmed benchmarks : Speedups ======== t-firefox-asteroids 533.92 -> 497.92 : 1.07x t-midori-zoomed 692.98 -> 651.24 : 1.06x Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com> Acked-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
2015-07-16	vmx: implement fast path vmx_composite_src_x888_8888	Oded Gabbay	1	-0/+60
	It was benchmarked against commid id 2be523b from pixman/master POWER8, 8 cores, 3.4GHz, RHEL 7.1 ppc64le. reference memcpy speed = 24764.8MB/s (6191.2MP/s for 32bpp fills) Before After Change --------------------------------------------- L1 1115.4 5006.49 +348.85% L2 1112.26 4338.01 +290.02% M 1110.54 2524.15 +127.29% HT 745.41 1140.03 +52.94% VT 749.03 1287.13 +71.84% R 423.91 547.6 +29.18% RT 205.79 194.98 -5.25% Kops/s 1414 1361 -3.75% cairo trimmed benchmarks : Speedups ======== t-gnome-system-monitor 1402.62 -> 1212.75 : 1.16x t-firefox-asteroids 533.92 -> 474.50 : 1.13x Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com> Acked-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
2015-07-16	vmx: implement fast path vmx_composite_over_n_8888_8888_ca	Oded Gabbay	1	-0/+112
	It was benchmarked against commid id 2be523b from pixman/master POWER8, 8 cores, 3.4GHz, RHEL 7.1 ppc64le. reference memcpy speed = 24764.8MB/s (6191.2MP/s for 32bpp fills) Before After Change --------------------------------------------- L1 61.92 244.91 +295.53% L2 62.74 243.3 +287.79% M 63.03 241.94 +283.85% HT 59.91 144.22 +140.73% VT 59.4 174.39 +193.59% R 53.6 111.37 +107.78% RT 37.99 46.38 +22.08% Kops/s 436 506 +16.06% cairo trimmed benchmarks : Speedups ======== t-xfce4-terminal-a1 1540.37 -> 1226.14 : 1.26x t-firefox-talos-gfx 1488.59 -> 1209.19 : 1.23x Slowdowns ========= t-evolution 553.88 -> 581.63 : 1.05x t-poppler 364.99 -> 383.79 : 1.05x t-firefox-scrolling 1223.65 -> 1304.34 : 1.07x The slowdowns can be explained in cases where the images are small and un-aligned to 16-byte boundary. In that case, the function will first work on the un-aligned area, even in operations of 1 byte. In case of small images, the overhead of such operations can be more than the savings we get from using the vmx instructions that are done on the aligned part of the image. In the C fast-path implementation, there is no special treatment for the un-aligned part, as it works in 4 byte quantities on the entire image. Because llbb is a synthetic test, I would assume it has much less alignment issues than "real-world" scenario, such as cairo benchmarks, which are basically recorded traces of real application activity. Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com> Acked-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
2015-07-16	vmx: implement fast path composite_add_8888_8888	Oded Gabbay	1	-0/+27
	Copied impl. from sse2 file and edited to use vmx functions It was benchmarked against commid id 2be523b from pixman/master POWER8, 16 cores, 3.4GHz, ppc64le : reference memcpy speed = 27036.4MB/s (6759.1MP/s for 32bpp fills) Before After Change --------------------------------------------- L1 248.76 3284.48 +1220.34% L2 264.09 2826.47 +970.27% M 261.24 2405.06 +820.63% HT 217.27 857.3 +294.58% VT 213.78 980.09 +358.46% R 176.61 442.95 +150.81% RT 107.54 150.08 +39.56% Kops/s 917 1125 +22.68% Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com> Acked-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
2015-07-16	vmx: implement fast path composite_add_8_8	Oded Gabbay	1	-0/+55
	Copied impl. from sse2 file and edited to use vmx functions It was benchmarked against commid id 2be523b from pixman/master POWER8, 16 cores, 3.4GHz, ppc64le : reference memcpy speed = 27036.4MB/s (6759.1MP/s for 32bpp fills) Before After Change --------------------------------------------- L1 687.63 9140.84 +1229.33% L2 715 7495.78 +948.36% M 717.39 8460.14 +1079.29% HT 569.56 1020.12 +79.11% VT 520.3 1215.56 +133.63% R 514.81 874.35 +69.84% RT 341.28 305.42 -10.51% Kops/s 1621 1579 -2.59% Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com> Acked-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
2015-07-16	vmx: implement fast path composite_over_8888_8888	Oded Gabbay	1	-0/+30
	Copied impl. from sse2 file and edited to use vmx functions It was benchmarked against commid id 2be523b from pixman/master POWER8, 16 cores, 3.4GHz, ppc64le : reference memcpy speed = 27036.4MB/s (6759.1MP/s for 32bpp fills) Before After Change --------------------------------------------- L1 129.47 1054.62 +714.57% L2 138.31 1011.02 +630.98% M 139.99 1008.65 +620.52% HT 122.11 468.45 +283.63% VT 121.06 532.21 +339.62% R 108.48 240.5 +121.70% RT 77.87 116.7 +49.87% Kops/s 758 981 +29.42% Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com> Acked-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
2015-07-16	vmx: implement fast path vmx_fill	Oded Gabbay	1	-0/+153
	Based on sse2 impl. It was benchmarked against commid id e2d211a from pixman/master Tested cairo trimmed benchmarks on POWER8, 8 cores, 3.4GHz, RHEL 7.1 ppc64le : speedups ======== t-swfdec-giant-steps 1383.09 -> 718.63 : 1.92x speedup t-gnome-system-monitor 1403.53 -> 918.77 : 1.53x speedup t-evolution 552.34 -> 415.24 : 1.33x speedup t-xfce4-terminal-a1 1573.97 -> 1351.46 : 1.16x speedup t-firefox-paintball 847.87 -> 734.50 : 1.15x speedup t-firefox-asteroids 565.99 -> 492.77 : 1.15x speedup t-firefox-canvas-swscroll 1656.87 -> 1447.48 : 1.14x speedup t-midori-zoomed 724.73 -> 642.16 : 1.13x speedup t-firefox-planet-gnome 975.78 -> 911.92 : 1.07x speedup t-chromium-tabs 292.12 -> 274.74 : 1.06x speedup t-firefox-chalkboard 690.78 -> 653.93 : 1.06x speedup t-firefox-talos-gfx 1375.30 -> 1303.74 : 1.05x speedup t-firefox-canvas-alpha 1016.79 -> 967.24 : 1.05x speedup Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com> Acked-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
2015-07-16	vmx: add helper functions	Oded Gabbay	1	-0/+476
	This patch adds the following helper functions for reuse of code, hiding BE/LE differences and maintainability. All of the functions were defined as static force_inline. Names were copied from pixman-sse2.c so conversion of fast-paths between sse2 and vmx would be easier from now on. Therefore, I tried to keep the input/output of the functions to be as close as possible to the sse2 definitions. The functions are: - load_128_aligned : load 128-bit from a 16-byte aligned memory address into a vector - load_128_unaligned : load 128-bit from memory into a vector, without guarantee of alignment for the source pointer - save_128_aligned : save 128-bit vector into a 16-byte aligned memory address - create_mask_16_128 : take a 16-bit value and fill with it a new vector - create_mask_1x32_128 : take a 32-bit pointer and fill a new vector with the 32-bit value from that pointer - create_mask_32_128 : take a 32-bit value and fill with it a new vector - unpack_32_1x128 : unpack 32-bit value into a vector - unpacklo_128_16x8 : unpack the eight low 8-bit values of a vector - unpackhi_128_16x8 : unpack the eight high 8-bit values of a vector - unpacklo_128_8x16 : unpack the four low 16-bit values of a vector - unpackhi_128_8x16 : unpack the four high 16-bit values of a vector - unpack_128_2x128 : unpack the eight low 8-bit values of a vector into one vector and the eight high 8-bit values into another vector - unpack_128_2x128_16 : unpack the four low 16-bit values of a vector into one vector and the four high 16-bit values into another vector - unpack_565_to_8888 : unpack an RGB_565 vector to 8888 vector - pack_1x128_32 : pack a vector and return the LSB 32-bit of it - pack_2x128_128 : pack two vectors into one and return it - negate_2x128 : xor two vectors with mask_00ff (separately) - is_opaque : returns whether all the pixels contained in the vector are opaque - is_zero : returns whether the vector equals 0 - is_transparent : returns whether all the pixels contained in the vector are transparent - expand_pixel_8_1x128 : expand an 8-bit pixel into lower 8 bytes of a vector - expand_alpha_1x128 : expand alpha from vector and return the new vector - expand_alpha_2x128 : expand alpha from one vector and another alpha from a second vector - expand_alpha_rev_2x128 : expand a reversed alpha from one vector and another reversed alpha from a second vector - pix_multiply_2x128 : do pix_multiply for two vectors (separately) - over_2x128 : perform over op. on two vectors - in_over_2x128 : perform in-over op. on two vectors v2: removed expand_pixel_32_1x128 as it was not used by any function and its implementation was erroneous Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com> Acked-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
2015-07-16	vmx: add LOAD_VECTOR macro	Oded Gabbay	1	-26/+24
	This patch adds a macro for loading a single vector. It also make the other LOAD_VECTORx macros use this macro as a base so code would be re-used. In addition, I fixed minor coding style issues. Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com> Acked-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>