Age | Commit message (Collapse) | Author | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
MSVC complains about uint32_t being used as an expression:
composite.c(902) : error C2275: 'uint32_t' : illegal use of this type
as an expression
|
|
OS/2 doesn't have a working mmap().
|
|
|
|
|
|
|
|
|
|
pixman-arm-neon-asm-bilinear.S
Enable fast paths which is supported by scanline functions in
pixman-arm-neon-asm-bilinear.S
|
|
General fetch->combine->store based bilinear scanline functions.
Need further optimizations and eventually will be replaced with optimal
functions one by one.
General functions should be located in pixman-arm-neon-asm-bilinear.S and
optimal functions in pixman-arm-neon-asm.S
Following general bilinear scanline functions are implemented
over_8888_8888
add_8888_8888
src_8888_8_8888
src_8888_8_0565
src_0565_8_x888
src_0565_8_0565
over_8888_8_8888
add_8888_8_8888
|
|
Defining PIXMAN_ARM_BIND_SCALED_BILINEAR_SRC_A8_DST macro for declaration of
scaled bilinear scanline functions in common header.
|
|
Previously, this function would do coordinate calculations in such a
way that (x_dst, y_dst) would only affect the alignment of the source
image, but not of the traps, which would always be considered to be in
absolute destination coordinates. This is unlike the
pixman_image_composite() function which also registers the mask to the
destination.
This patch makes it so that traps are also offset by (x_dst, y_dst).
Also add a comment explaining how this function is supposed to
operate, and update tri-test.c and composite-trap-test.c to deal with
the new semantics.
|
|
This improves the performance of the firefox-talos-gfx benchmark with
the image16 backend. Benchmark on an 800 MHz ARM Cortex A8:
Before:
[ # ] backend test min(s) median(s) stddev. count
[ 0] image16 firefox-talos-gfx 121.773 122.218 0.15% 6/6
After:
[ # ] backend test min(s) median(s) stddev. count
[ 0] image16 firefox-talos-gfx 85.247 85.563 0.22% 6/6
V2: Slightly better instruction scheduling based on comments from Taekyun Kim.
V3: Eliminate all stalls from the inner loop. Also based on comments from Taekyun Kim.
|
|
PIXMAN_LINK_WITH_ENV did not fail unless -Wall -Werror is used.
So even when the compiler did not support OpenMP, USE_OPENMP was defined.
Fix that by running the second OpenMP test only when first AC_OPENMP find supported
configure tested in the cases :
gcc without libgomp support, no openmp option, --enable-openmp and --disable-openmp
gcc with libgomp support, no openmp option, --enable-openmp and --disable-openmp
Not tested with autoconf version not knowing openmp (<2.62)
Warn when --enable-openmp is requested but no support is found
Signed-off-by: Gilles Espinasse <g.esp@free.fr>
|
|
Use the correct variable name
Signed-off-by: Gilles Espinasse <g.esp@free.fr>
|
|
Benchmark on ARM Cortex-A8 r1p3 @600MHz, 32-bit LPDDR @166MHz:
Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
before: op=1, src=20028888, dst=10020565, speed=33.59 MPix/s
after: op=1, src=20028888, dst=10020565, speed=46.25 MPix/s
Benchmark on ARM Cortex-A8 r2p2 @1GHz, 32-bit LPDDR @200MHz:
Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
before: op=1, src=20028888, dst=10020565, speed=63.86 MPix/s
after: op=1, src=20028888, dst=10020565, speed=84.22 MPix/s
|
|
Performance of the inner loop when working with the data in L1 cache:
ARM Cortex-A8: 41 cycles per 4 pixels (no stalls and partial dual issue)
ARM Cortex-A9: 48 cycles per 4 pixels (no stalls)
It might be still possible to improve performance even more on ARM Cortex-A8
with a better use of dual issue.
Benchmark on ARM Cortex-A8 r1p3 @600MHz, 32-bit LPDDR @166MHz:
Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
before: op=1, src=20028888, dst=20028888, speed=40.38 MPix/s
after: op=1, src=20028888, dst=20028888, speed=48.47 MPix/s
Benchmark on ARM Cortex-A8 r2p2 @1GHz, 32-bit LPDDR @200MHz:
Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
before: op=1, src=20028888, dst=20028888, speed=79.68 MPix/s
after: op=1, src=20028888, dst=20028888, speed=93.11 MPix/s
|
|
Now an extra 'flag' parameter is supported in bilinear scaline scaling
function generation macro. It can be used to enable 4 or 8 pixels per
loop iteration unrolling and provide save/restore code for d8-d15
registers.
|
|
This reduces code size and also puts less pressure on the
instruction decoder.
|
|
Now it's possible to override the main loop of bilinear scaling code
with optimized pipelined implementation.
|
|
|
|
Moving horizontal interpolation weights update instructions from the
beginning of loop to its end allows to hide some pipeline stalls and
improve performance.
|
|
Instead of two
mvn d24, d24
mvn d25, d25
use just one
mvn q12, q12
Also move another vmvn instruction into the created pipeline bubble,
as pointed out by Siarhei.
|
|
Up until now, all pixman release, both snapshots and releases were
uploaded to the "releases" directory on www.cairographics.org, but
it's better to development snapshots in the "snapshots" directory.
This patch changes Makefile.am to do that.
|
|
When run in PIXMAN_RANDOMIZE_TESTS mode, this test would go into an
infinite loop because the loop started at 'seed' but the stop
condition was still N_TESTS.
|
|
|
|
This format is particularly useful on big-endian architectures, where RGBA in
memory/file order corresponds to r8g8b8a8 as an uint32_t. This is important
because RGBA is in some cases the only available choice (for example as a pixel
format in OpenGL ES 2.0).
|
|
This patch makes so that composite and stress-test will start from a
random seed if the PIXMAN_RANDOMIZE_TESTS environment variable is
set. Running the test suite in this mode is useful to get more test
coverage.
Also, in stress-test.c make it so that setting the initial seed causes
threads to be turned off. This makes it much easier to see when
something fails.
|
|
All of the information previously passed to the iterator initializers
is now available in the iterator itself, so there is no need to pass
it as arguments anymore.
|
|
This makes _pixman_implementation_{src,dest}_iter_init() responsible
for filling parts of the information in the iterators. Specifically,
the information passed as arguments is stored in the iterator.
Also add a height field to pixman_iter_t().
|
|
There is no reason to go through
_pixman_implementation_{src,dest}_iter_init(), especially since
_pixman_implementation_src_iter_init() is doing various other checks
that only need to be done once.
Also call delegate->src_iter_init() directly in pixman-sse2.c
|
|
Instructions scheduling improved in the code responsible for fetching r5g6b5
pixels and converting them to the intermediate x8r8g8b8 color format used in
the interpolation part of code. Still a lot of NEON stalls are remaining,
which can be resolved later by the use of pipelining.
Benchmark on ARM Cortex-A8 r2p2 @1GHz, 32-bit LPDDR @200MHz:
Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
before: op=1, src=10020565, dst=10020565, speed=32.29 MPix/s
op=1, src=10020565, dst=20020888, speed=36.82 MPix/s
after: op=1, src=10020565, dst=10020565, speed=41.35 MPix/s
op=1, src=10020565, dst=20020888, speed=49.16 MPix/s
|
|
Benchmark on ARM Cortex-A8 r2p2 @1GHz, 32-bit LPDDR @200MHz:
Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
before: op=1, src=10020565, dst=10020565, speed=3.30 MPix/s
after: op=1, src=10020565, dst=10020565, speed=32.29 MPix/s
|
|
Benchmark on ARM Cortex-A8 r2p2 @1GHz, 32-bit LPDDR @200MHz:
Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
before: op=1, src=10020565, dst=20020888, speed=3.39 MPix/s
after: op=1, src=10020565, dst=20020888, speed=36.82 MPix/s
|
|
Benchmark on ARM Cortex-A8 r2p2 @1GHz, 32-bit LPDDR @200MHz:
Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
before: op=1, src=20028888, dst=10020565, speed=6.56 MPix/s
after: op=1, src=20028888, dst=10020565, speed=61.65 MPix/s
|
|
This is a cleanup for old and now duplicated code. The performance improvement
is mostly coming from the enabled use of software prefetch, but instructions
scheduling is also slightly better.
Benchmark on ARM Cortex-A8 r2p2 @1GHz, 32-bit LPDDR @200MHz:
Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
before: op=1, src=20028888, dst=20028888, speed=53.24 MPix/s
after: op=1, src=20028888, dst=20028888, speed=74.36 MPix/s
|
|
This allows to generate bilinear scanline scaling functions targeting
various source and destination color formats. Right now a8r8g8b8/x8r8g8b8
and r5g6b5 color formats are supported. More formats can be added if needed.
|
|
It can be reused in different ARM NEON bilinear scaling fast path functions.
|