~nh/mesa - nh's Mesa repository; mostly radeonsi related development

Age	Commit message (Collapse)	Author	Files	Lines
2017-11-06	mesa: flush and wait after creating a fallback texturefences-threads-ddebug	Nicolai Hähnle	1	-0/+5
	Fixes non-deterministic failures in dEQP-EGL.functional.sharing.gles2.multithread.simple_egl_sync.images.texture_source.teximage2d_render and others in dEQP-EGL.functional.sharing.gles2.multithread.* Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2017-11-06	mesa: increase MaxServerWaitTimeout	Nicolai Hähnle	1	-1/+1
	The current value was introduced in commit a27180d0d8666, which claims that it represents ~1.11 years. However, it is interpreted in nanoseconds, so it actually only represents ~9.8 hours. That seems a bit short. Use the largest value consistent with both int32 and int64. It corresponds to ~292 years in nanoseconds. Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2017-11-06	st/mesa: use asynchronous flushes (TODO)	Nicolai Hähnle	2	-4/+26
	TODO: It looks like si_fence_server_sync still isn't strict enough for unflushed fences.
2017-11-06	st/mesa: remove redundant flushes from st_flush	Nicolai Hähnle	3	-3/+6
	st_flush should flush state tracker-internal state and the pipe, but not mesa/main state. Of the four callers: - glFlush/glFinish already call FLUSH_{VERTICES,STATE}. - st_vdpau doesn't need to call them. - st_manager will now call them explicitly. Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2017-11-06	st/dri: use stapi flush instead of pipe flush when creating fences	Nicolai Hähnle	1	-5/+6
	There may be pending operations (e.g. vertices) that need to be flushed by the state tracker. Found by inspection. Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2017-11-06	radeonsi: use a threaded context even for debug contexts	Nicolai Hähnle	1	-9/+2
	Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2017-11-06	radeonsi: record and dump time of flush	Nicolai Hähnle	3	-1/+8
	Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2017-11-06	ddebug: optionally handle transfer commands like draws	Nicolai Hähnle	4	-66/+288
	Transfer commands can have associated GPU operations. Enabled by passing GALLIUM_DDEBUG=transfers. Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2017-11-06	ddebug: dump context and before/after times of draws	Nicolai Hähnle	2	-0/+10
	Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2017-11-06	ddebug: generalize print_named_xxx via a PRINT_NAMED macro	Nicolai Hähnle	1	-15/+10
	Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2017-11-06	ddebug: rewrite to always use a threaded approach	Nicolai Hähnle	4	-515/+546
	This patch has multiple goals: 1. Off-load the writing of records in 'always' mode to another thread for performance. 2. Allow using ddebug with threaded contexts. This really forces us to move some of the "after_draw" handling into another thread. 3. Simplify the different modes of ddebug, both in the code and in the user interface, i.e. GALLIUM_DDEBUG. In particular, there's no 'pipelined' anymore, since we're always pipelined; and 'noflush' is replaced by 'flush', since we no longer flush by default. 4. Fix the fences in pipelining mode. They previously relied on writes via pipe_context::clear_buffer. However, on radeonsi, those could (quite reasonably) end up in the SDMA buffer. So we use the newly added PIPE_FLUSH_{TOP,BOTTOM}_OF_PIPE fences instead. 5. Improve pipelined mode overall, using the finer grained information provided by the new fences. Overall, the result is that pipelined mode should be more useful, and using ddebug in default mode is much less invasive, in the sense that it changes the overall driver behavior less (which is kind of crucial for a driver debugging tool). An example of the new hang debug output: Gallium debugger active. Hang detection timeout is 1000ms. GPU hang detected, collecting information... Draw # driver prev BOP TOP BOP dump file ------------------------------------------------------------- 2 YES YES YES NO /home/nha/ddebug_dumps/shader_runner_19919_00000000 3 YES NO YES NO /home/nha/ddebug_dumps/shader_runner_19919_00000001 4 YES NO YES NO /home/nha/ddebug_dumps/shader_runner_19919_00000002 5 YES NO YES NO /home/nha/ddebug_dumps/shader_runner_19919_00000003 Done. We can see that there were almost certainly 4 draws in flight when the hang happened: the top-of-pipe fence was signaled for all 4 draws, the bottom-of-pipe fence for none of them. In virtually all cases, we'd expect the first draw in the list to be at fault, but due to the GPU parallelism, it's possible (though highly unlikely) that one of the later draws causes a component to get stuck in a way that prevents the earlier draws from making progress as well. (In the above example, there were actually only 3 draws truly in flight: the last draw is a blit that waits for the earlier draws; however, its top-of-pipe fence is emitted before the cache flush and wait, and so the fact that the draw hasn't truly started yet can only be seen from a closer inspection of GPU state.) Acked-by: Marek Olšák <marek.olsak@amd.com>
2017-11-06	ddebug: use an atomic increment when numbering files	Nicolai Hähnle	1	-1/+3
	Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2017-11-06	dd/util: extract dd_get_debug_filename_and_mkdir	Nicolai Hähnle	1	-12/+18
	Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2017-11-06	gallium/u_dump: add and use util_dump_transfer_usage	Nicolai Hähnle	4	-16/+61
	Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2017-11-06	gallium/u_dump: add util_dump_ns	Nicolai Hähnle	2	-0/+13
	Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2017-11-06	gallium/u_dump: export util_dump_ptr	Nicolai Hähnle	2	-2/+5
	Change format to %p while we're at it. Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2017-11-06	radeonsi: implement PIPE_FLUSH_{TOP,BOTTOM}_OF_PIPE	Nicolai Hähnle	1	-1/+88
	v2: use uncached system memory for the fence, and use the CPU to clear it so we never read garbage when checking the fence
2017-11-06	radeonsi: document some subtle details of fence_finish & fence_server_sync	Nicolai Hähnle	1	-0/+22
	v2: remove the change to si_fence_server_sync, we'll handle that more robustly Reviewed-by: Marek Olšák <marek.olsak@amd.com> (v1)
2017-11-06	gallium: add pipe_context::callback	Nicolai Hähnle	3	-0/+58
	For running post-draw operations inside the driver thread. ddebug will use it. Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2017-11-06	gallium/u_threaded: implement pipe_context::set_log_context	Nicolai Hähnle	1	-0/+11
	Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2017-11-06	gallium/u_threaded: avoid syncs for get_query_result	Nicolai Hähnle	1	-17/+48
	Queries should still get marked as flushed when flushes are executed asynchronously in the driver thread. To this end, the management of the unflushed_queries list is moved into the driver thread. Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2017-11-06	gallium/u_threaded: implement asynchronous flushes	Nicolai Hähnle	6	-27/+238
	This requires out-of-band creation of fences, and will be signaled to the pipe_context::flush implementation by a special TC_FLUSH_ASYNC flag. v2: - remove an incorrect assertion - handle fence_server_sync for unsubmitted fences by relying on the improved cs_add_fence_dependency - only implement asynchronous flushes on amdgpu
2017-11-06	gallium/u_threaded: mark queries flushed only for non-deferred flushes	Nicolai Hähnle	2	-4/+6
	The driver uses (and must use) the flushed flag of queries as a hint that it does not have to check for synchronization with currently queued up commands. Deferred flushes do not actually flush queued up commands, so we must not set the flushed flag for them. Found by inspection. Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2017-11-06	radeonsi: move fence functions to si_fence.c	Nicolai Hähnle	6	-267/+312
	Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2017-11-06	winsys/amdgpu: handle cs_add_fence_dependency for deferred/unsubmitted fences	Nicolai HÃ¤hnle	4	-12/+41
	The idea is to fix the following interleaving of operations that can arise from deferred fences: Thread 1 / Context 1 Thread 2 / Context 2 -------------------- -------------------- f = deferred flush <------- application-side synchronization -------> fence_server_sync(f) ... flush() flush() We will now stall in fence_server_sync until the flush of context 1 has completed. This scenario was unlikely to occur previously, because applications seem to be doing Thread 1 / Context 1 Thread 2 / Context 2 -------------------- -------------------- f = glFenceSync() glFlush() <------- application-side synchronization -------> glWaitSync(f) ... and indeed they probably have to use this ordering to avoid deadlocks in the GLX model, where all GL operations conceptually go through a single connection to the X server. However, it's less clear whether applications have to do this with other WSI (i.e. EGL). Besides, even this sequence of GL commands can be translated into the Gallium-level sequence outlined above when Gallium threading and asynchronous flushes are used. So it makes sense to be more robust. As a side effect, we no longer busy-wait on submission_in_progress. We won't enable asynchronous flushes on radeon, but add a cs_add_fence_dependency stub anyway to document the potential issue.
2017-11-03	gallium: add PIPE_FLUSH_{TOP,BOTTOM}_OF_PIPE bits	Nicolai Hähnle	2	-0/+16
	These bits are intended to be used by the ddebug hang detection and are named in analogy to the Vulkan stage bits (and the corresponding Radeon pipeline event). Hang detection needs fences on the granularity of individual commands, which nothing else really covers. The closest alternative would have been PIPE_QUERY_GPU_FINISHED, but (a) queries are a per-context object and we really want a per-screen object, (b) queries don't offer a wait with timeout, and (c) in any case, PIPE_QUERY_GPU_FINISHED is meant to imply that GPU caches are flushed, which the new bits explicitly aren't. Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2017-11-03	gallium: add PIPE_FLUSH_ASYNC and PIPE_FLUSH_HINT_FINISH	Nicolai Hähnle	3	-1/+18
	Also document some subtleties of pipe_context::flush. Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2017-11-03	util/u_queue: add util_queue_fence_wait_timeout	Nicolai Hähnle	4	-26/+121
	v2: - style fixes - fix missing timeout handling in futex path Reviewed-by: Marek Olšák <marek.olsak@amd.com> (v1)
2017-11-03	threads: update for late C11 changes	Nicolai Hähnle	4	-64/+60
	C11 threads were changed to use struct timespec instead of xtime, and thrd_sleep got a second argument. See http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1554.htm and http://en.cppreference.com/w/c/thread/{thrd_sleep,cnd_timedwait,mtx_timedlock} Note that cnd_timedwait is spec'd to be relative to TIME_UTC / CLOCK_REALTIME. v2: Fix Windows build errors. Tested with a default Appveyor config that uses Visual Studio 2013. Judging from Brian's email and random internet sources, Visual Studio 2015 does have timespec and timespec_get, hence the _MSC_VER-based guard which I have not tested. Cc: Jose Fonseca <jfonseca@vmware.com> Cc: Brian Paul <brianp@vmware.com> Reviewed-by: Marek Olšák <marek.olsak@amd.com> (v1)
2017-11-03	gallium: remove unused and deprecated u_time.h	Nicolai Hähnle	8	-157/+1
	Cc: Jose Fonseca <jfonseca@vmware.com> Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2017-11-03	util: move os_time.[ch] to src/util	Nicolai Hähnle	57	-78/+76
	Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2017-11-03	radeonsi: always use async compiles when creating shader/compute states	Nicolai Hähnle	2	-34/+50
	With Gallium threaded contexts, creating shader/compute states is effectively a screen operation, so we should not use context state. In particular, this allows us to avoid using the context's LLVM TargetMachine. This isn't an issue yet because u_threaded_context filters out non-async debug callbacks, and we disable threaded contexts for debug contexts. However, we may want to change that in the future. Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2017-11-03	radeonsi: fix potential use-after-free of debug callbacks	Nicolai Hähnle	1	-0/+4
	Found by inspection. Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2017-11-03	radeonsi: move pipe debug callback to si_context	Nicolai Hähnle	6	-19/+19
	Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2017-11-03	u_queue: add util_queue_finish for waiting for previously added jobs	Nicolai Hähnle	2	-0/+37
	Schedule one job for every thread, and wait on a barrier inside the job execution function. v2: avoid alloca (fixes Windows build error) Reviewed-by: Marek Olšák <marek.olsak@amd.com> (v1)
2017-11-03	util: move pipe_barrier into src/util and rename to util_barrier	Nicolai Hähnle	5	-88/+87
	The #if guard is probably not 100% equivalent to the previous PIPE_OS check, but if anything it should be an over-approximation (are there pthread implementations without barriers?), so people will get either a good implementation or compile errors that are easy to fix. Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2017-11-03	gallium: add async debug message forwarding helper	Nicolai Hähnle	4	-0/+192
	v2: use util_vasprintf for Windows portability Reviewed-by: Marek Olšák <marek.olsak@amd.com> (v1)
2017-11-03	st/mesa: guard sampler views changes with a mutex	Nicolai Hähnle	5	-96/+250
	Some locking is unfortunately required, because well-formed GL programs can have multiple threads racing to access the same texture, e.g.: two threads/contexts rendering from the same texture, or one thread destroying a context while the other is rendering from or modifying a texture. Since even the simple mutex caused noticable slowdowns in the piglit drawoverhead micro-benchmark, this patch uses a slightly more involved approach to keep locks out of the fast path: - the initial lookup of sampler views happens without taking a lock - a per-texture lock is only taken when we have to modify the sampler view(s) - since each thread mostly operates only on the entry corresponding to its context, the main issue is re-allocation of the sampler view array when it needs to be grown, but the old copy is not freed Old copies of the sampler views array are kept around in a linked list until the entire texture object is deleted. The total memory wasted in this way is roughly equal to the size of the current sampler views array. Fixes non-deterministic memory corruption in some dEQP-EGL.functional.sharing.gles2.multithread.* tests, e.g. dEQP-EGL.functional.sharing.gles2.multithread.simple.images.texture_source.create_texture_render Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2017-11-03	st/mesa: re-arrange st_finalize_texture	Nicolai Hähnle	2	-8/+11
	Move the early-out for surface-based textures earlier. This narrows the scope of the locking added in a follow-up commit. Fix one remaining case of initializing a surface-based texture without properly finalizing it. Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2017-11-03	gallium: clarify the constraints on sampler_view_destroy	Nicolai Hähnle	3	-7/+20
	r600 expects the context that created the sampler view to still be alive (there is a per-context list of sampler views). svga currently bails when the context of destruction is not the same as creation. The GL state tracker, which is the only one that runs into the multi-context subtleties (due to share groups), already guarantees that sampler views are destroyed before their context of creation is destroyed. Most drivers are context-agnostic, so the warning message in pipe_sampler_view_release doesn't really make sense. Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2017-11-03	radeonsi: reduce the scope of sel->mutex in si_shader_select_with_key	Nicolai Hähnle	1	-4/+4
	We only need the lock to guard changes in the variant linked list. The actual compilation can happen outside the lock, since we use the ready fence as a guard. v2: fix double-unlock Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2017-11-03	radeonsi: use ready fences on all shaders, not just optimized ones	Nicolai Hähnle	3	-26/+67
	There's a race condition between si_shader_select_with_key and si_bind_XX_shader: Thread 1 Thread 2 -------- -------- si_shader_select_with_key begin compiling the first variant (guarded by sel->mutex) si_bind_XX_shader select first_variant by default as state->current si_shader_select_with_key match state->current and early-out Since thread 2 never takes sel->mutex, it may go on rendering without a PM4 for that shader, for example. The solution taken by this patch is to broaden the scope of shader->optimized_ready to a fence shader->ready that applies to all shaders. This does not hurt the fast path (if anything it makes it faster, because we don't explicitly check is_optimized). It will also allow reducing the scope of sel->mutex locks, but this is deferred to a later commit for better bisectability. Fixes dEQP-EGL.functional.sharing.gles2.multithread.simple.buffers.bufferdata_render Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2017-11-03	u_queue: add a futex-based implementation of fences	Nicolai Hähnle	3	-0/+99
	Fences are now 4 bytes instead of 96 bytes (on my 64-bit system). Signaling a fence is a single atomic operation in the fast case plus a syscall in the slow case. Testing if a fence is signaled is the same as before (a simple comparison), but waiting on a fence is now no more expensive than just testing it in the fast (already signaled) case. v2: - style fixes - use p_atomic_xxx macros with the right barriers
2017-11-03	u_queue: add util_queue_fence_reset	Nicolai Hähnle	2	-3/+14
	Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2017-11-03	u_queue: export util_queue_fence_signal	Nicolai Hähnle	2	-1/+2
	Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2017-11-03	u_queue: group fence functions together	Nicolai Hähnle	1	-9/+10
	Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2017-11-03	util/u_atomic: add p_atomic_xchg	Nicolai Hähnle	1	-1/+31
	The closest to it in the old-style gcc builtins is __sync_lock_test_and_set, however, that is only guaranteed to work with values 0 and 1 and only provides an acquire barrier. I also don't know about other OSes, so we provide a simple & stupid emulation via p_atomic_cmpxchg.
2017-11-03	util: move futex helpers into futex.h	Nicolai Hähnle	4	-19/+56
	v2: style fixes Reviewed-by: Marek Olšák <marek.olsak@amd.com> (v1)
2017-11-03	TODO some simple_mtx.h fixups	Nicolai Hähnle	1	-1/+3

2017-11-03	mesa: Add new fast mtx_t mutex type for basic use cases	Timothy Arceri	2	-0/+149
	While modern pthread mutexes are very fast, they still incur a call to an external DSO and overhead of the generality and features of pthread mutexes. Most mutexes in mesa only needs lock/unlock, and the idea here is that we can inline the atomic operation and make the fast case just two intructions. Mutexes are subtle and finicky to implement, so we carefully copy the implementation from Ulrich Dreppers well-written and well-reviewed paper: "Futexes Are Tricky" http://www.akkadia.org/drepper/futex.pdf We implement "mutex3", which gives us a mutex that has no syscalls on uncontended lock or unlock. Further, the uncontended case boils down to a cmpxchg and an untaken branch and the uncontended unlock is just a locked decr and an untaken branch. We use __builtin_expect() to indicate that contention is unlikely so that gcc will put the contention code out of the main code flow. A fast mutex only supports lock/unlock, can't be recursive or used with condition variables. We keep the pthread mutex implementation around as full_mtx_t for the few places where we use condition variables or recursive locking. For platforms or compilers where futex and atomics aren't available, mtx_t falls back to the pthread mutex. The pthread mutex lock/unlock overhead shows up on benchmarks for CPU bound applications. Most CPU bound cases are helped and some of our internal bind_buffer_object heavy benchmarks gain up to 10%. Signed-off-by: Kristian Høgsberg <krh@bitplanet.net> Signed-off-by: Timothy Arceri <tarceri@itsqueeze.com>