~gongzg/beignet - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author	Files	Lines
2014-12-14	GBE/CL: use 2D image to implement large image1D_buffer.image_refine	Zhigang Gong	5	-15/+68
	Per OpenCL spec, the minimum CL_DEVICE_IMAGE_MAX_BUFFER_SIZE is 65536 which is too large for 1D surface on Gen platforms. Have to use a 2D surface to implement it. As OpenCL spec only allows the image1d_t to be accessed via default sampler, it is doable as it will never use a float coordinates and never use linear non-nearest filters. Signed-off-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-12-13	GBE: remove some image1d_buffer related builtin functions.	Zhigang Gong	2	-9/+9
	Per OpenCL spec, image1d buffer only support no sampler access. Remove those unsupported functions. Signed-off-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-12-12	works fine now.	Zhigang Gong	5	-13/+17
	Signed-off-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-12-12	minor fix, still broken.	Zhigang Gong	2	-1/+4
	Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
2014-12-12	draft to fix sampler.	Zhigang Gong	6	-12/+164
	Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
2014-12-12	GBE: switch to use CLANG native image types.	Zhigang Gong	9	-424/+175
	No need to keep this hacky implementation now. Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
2014-12-12	Refactor all image builtin functions.	Zhigang Gong	4	-416/+618
	Refactor almost all the image builtin related functions to simplfy the code and get rid of most of the awful macros. Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
2014-12-05	Update optimization tips.	Zhigang Gong	1	-14/+92
	Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
2014-12-05	CL: Don't find mesa source code.	Zhigang Gong	1	-6/+6
	As build with mesa has been broken for a long time, we disable it to avoid potential build problem. Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
2014-12-04	GBE: Add some missing constant expression cases.	Zhigang Gong	4	-11/+135
	Major for two types of constant expression cases: 1. The destination is a vector. 2. Some missing operators. Signed-off-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-12-04	GBE: Add constant pointer in the memcpy intrinsic.	Zhigang Gong	3	-1/+187
	Blender may generate such type of intrinsics. Now fix it. Also fixed a previous typo which will not assert when it should assert. Signed-off-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-12-04	refine bswap utest to cover nsetc fail cases.	Luo Xionghu	2	-0/+8
	two bswap call in one block would trigger nsetc failures. the fail was fixed in backend already, just update the utest. Signed-off-by: Luo Xionghu <xionghu.luo@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-12-04	GBE: Fix the printf issue caused by new bti implementation	Ruiling Song	1	-4/+16
	The new bti implementation does not deal with printf internal buffer specially. Which cause printf print nothing! But I think it is better to declare the internal buffer for printf in global memory space instead of private space. Then the bti implementation don't have to deal with it specially. Signed-off-by: Ruiling Song <ruiling.song@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-12-04	GBE: Fix a disassembly bug.	Ruiling Song	1	-2/+2
	It looks a typo, which wrongly interprete bti/msg_type field. Signed-off-by: Ruiling Song <ruiling.song@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-12-04	utests: Add const private array initialization test.	Ruiling Song	3	-0/+37
	Signed-off-by: Ruiling Song <ruiling.song@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-12-04	GBE: support const private array initialization.	Ruiling Song	2	-45/+54
	Developers are allowed to declare initialized private array like below: void func() { const arr[]={1, 2, 3, 4}; } The implementation is simply put them into __constant memory space. Signed-off-by: Ruiling Song <ruiling.song@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-12-03	Change CL_DEVICE_PREFERRED_VECTOR_WIDTH_CHAR from 8 to 16.	Chuanbo Weng	1	-1/+1
	Because accessing global memory by uchar16/char16 will fully utilize memory bandwidth, so change CL_DEVICE_PREFERRED_VECTOR_WIDTH_CHAR from 8 to 16. Three OpenCV cases will speedup from this patch: OCL_ThreshFixture_Threshold, 25% improvement OCL_MaxFixture_Max, 105% improvement OCL_MinFixture_Min, 105% improvement. Signed-off-by: Chuanbo Weng <chuanbo.weng@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-12-03	GBE: Re-implement BTI logic in backend	Ruiling Song	2	-109/+159
	Previously, we search from the use-point of pointers, like load/store and try to find all the possible pointer sources. But sometimes we may meet ptrtoint/add/inttoptr pattern, and what's worse, for the operands of add instruction, it is hard to determine which one is from pointer and which one maybe a offset. So what we do in this patch is: let's start the search from the def-point (like GlobalVariable, kernel function pointer argument, AllocaInst, which we care about) and traversal all their uses. And during the traversal, we will record the escape point(i.e. Store/load/atomic instructions). So later, when we generate these kinds of instructions, we can query their possible sources and get the corresponding BTI. v2: refine the error message when found an illegal pointer. Signed-off-by: Ruiling Song <ruiling.song@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-12-02	utests: make utests maths ULP values consistent with specification	Meng Mengmeng	3	-8/+96
	Signed-off-by: Meng Mengmeng <mengmeng.meng@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-12-02	add utest of CL_MEM_ALLOC_HOST_PTR	Guo Yejun	3	-0/+32
	Signed-off-by: Guo Yejun <yejun.guo@intel.com> Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
2014-12-02	enable CL_MEM_ALLOC_HOST_PTR with user_ptr to avoid copy between GPU/CPU	Guo Yejun	3	-16/+33
	when user ptr is enabled, allocates page aligned system memory for CL_MEM_ALLOC_HOST_PTR inside the driver and wraps it as GPU memory to avoid the copy between GPU and CPU. and also do some code refine for the relative user_ptr code. tests verified: beignet/utest, conformance/basic, buffers, mem_host_flags Signed-off-by: Guo Yejun <yejun.guo@intel.com> Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
2014-12-02	refine utest of cl_mem_use_host_ptr	Guo Yejun	2	-12/+1
	From application perspective, userptr is transparent. App does not need to know if userptr is enabled or not, just invokes standard OpenCL APIs. Signed-off-by: Guo Yejun <yejun.guo@intel.com> Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
2014-12-02	add test of cl_mem_use_host_ptr into benchmark	Guo Yejun	5	-24/+66
	and also refine the code to move time_subtract into utest_helper.hpp/cpp Signed-off-by: Guo Yejun <yejun.guo@intel.com> Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
2014-12-02	clean code, the logic is already at the beginning of function	Guo Yejun	1	-16/+0
	Signed-off-by: Guo Yejun <yejun.guo@intel.com> Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
2014-12-02	fix bswap implementation issue.	Luo Xionghu	1	-36/+28
	the ir registers are SSA defined, so each register should be asigned once. this could fix the "dnetc -test rc5-72 0" bswap issue. Signed-off-by: Luo Xionghu <xionghu.luo@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@intel.com>
2014-12-02	fix dnetc overflow issue.	Luo Xionghu	1	-1/+2
	the overflow type should be unsigned for uadd_with_overflow. this patch fixed the "dnetc -test rc5-72 0" 15 fails out of 32 when disabled bswap. Signed-off-by: Luo Xionghu <xionghu.luo@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@intel.com>
2014-12-02	GBE: optimize GEP constant offset calculation.	Zhigang Gong	1	-3/+5
	If the type is array or vector, we do not need to iterate each element. We can compute it directly. v2: Use more generic SequentialType and StructType to identify whether we can compute the offset directly. Signed-off-by: Zhigang Gong <zhigang.gong@intel.com> Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
2014-12-02	GBE: fix a regression caused by the negative index handling patch.	Zhigang Gong	1	-1/+1
	The typeIndex is correct and should not mutiply the step. Signed-off-by: Zhigang Gong <zhigang.gong@intel.com> Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
2014-12-02	Fix based on piglit OpenCL falied case (cl-api-compile-program).	Yan Wang	1	-4/+2
	1. Return the expected error code. 2. Don't destroy cl_program object after comile error because it may be used still in the future. Signed-off-by: Yan Wang <yan.wang@linux.intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-12-01	utests: Add one case to test negative index array access.	Zhigang Gong	3	-0/+55
	Signed-off-by: Zhigang Gong <zhigang.gong@intel.com> Reviewed-by: Ruiling Song <ruiling.song@intel.com>
2014-12-01	GBE: Fix bug with negative constant GEP index.	Zhigang Gong	3	-11/+13
	GEP index may be negative constant value as below: %arrayidx = getelementptr inbounds <4 x i32> addrspace(1)* %src4, i32 %add.ptr.sum, i32 -4 The previous implementation assumes it's a unsigned value which is incorrect and may cause infinite loop. Signed-off-by: Zhigang Gong <zhigang.gong@intel.com> Reviewed-by: Ruiling Song <ruiling.song@intel.com>
2014-12-01	GBE: Output CFG of Gen IR to dot file.	Ruiling Song	3	-0/+26
	Add an environment variable 'OCL_OUTPUT_CFG_GEN_IR' to control it. Signed-off-by: Ruiling Song <ruiling.song@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-11-28	fix issue to pass utest of runtime_climage_from_boname for BDW	Guo Yejun	1	-2/+2
	To create cl image from bo name with offset, the offset needs to be added into surface_base_addr_lo/hi. Signed-off-by: Guo Yejun <yejun.guo@intel.com> Tested-by: "Zhu, BingbingX" <bingbingx.zhu@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-11-27	utests: fix indent in CMakeLists.txt	Zhigang Gong	1	-10/+10
	Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
2014-11-27	add test for clCreateImageFromLibvaIntel	Guo Yejun	3	-1/+226
	Signed-off-by: Guo Yejun <yejun.guo@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-11-27	fix issue to create cl image from libva with non-zero offset	Guo Yejun	4	-7/+5
	Beignet accepts buffer object name to share data between libva, it supports to create cl image from the bo name with a non-zero offset, but it does not work at some platforms. The driver calls intel_bo_gem_create_from_name to retrieve the dri_bo, and the offset of dri_bo is changed by the non-zero offset. At some platforms, the change of the offset has side effect when the kernel is executed again and so intel_bo_gem_create_from_name is called for the second time. So, do not change the offset of dri_bo, but maintain the non-zero offset in cl_image, and maintain the non-zero offset until we write the surface state into batch buffer. V2: correct the offset parameter passed to dri_bo_emit_reloc Signed-off-by: Guo Yejun <yejun.guo@intel.com>
2014-11-27	utests: reduce work group size to 256 to satisfy BYT platform.	Zhigang Gong	1	-1/+1
	The maximum work group size on BYT is 256. Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
2014-11-26	GBE: Place loop exits after loop blocks when sorting basic blocks.	Ruiling Song	1	-10/+84
	This again is to solve register liveness issue. Details see comment inline. This could fix opencv failure under strict conformance mode: ./opencv_test_core --gtest_filter=OCL_Arithm/PolarToCart.angleInRadians/0 v2: Add a FIXME tag for irreducible graph v3: assert if child number larger than 2. Signed-off-by: Ruiling Song <ruiling.song@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-11-25	GBE: don't split instruction for some special case.	Zhigang Gong	1	-1/+11
	If the src and dst are the same byte vector or the src is scalar, we don't need to split the instruction. Thus the following instructions: ( 269) (-f1) sel(8) g95<2>:B g100<16,8,2>:B 0W { align1 WE_normal 1Q }; ( 271) (-f1) sel(8) g95.16<2>:B g100.16<16,8,2>:B 0W { align1 WE_normal 2Q }; could be optimized to one sind16 instruction: ( 263) (-f1) sel(16) g95<2>:B g100<16,8,2>:B 0W { align1 WE_normal 1H }; Signed-off-by: Zhigang Gong <zhigang.gong@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2014-11-25	utests: fix a typo in test cases.	Zhigang Gong	1	-1/+1
	due to a stray . at utests/builtin_pow.cpp:79:112. Reported by "Rebecca N. Palmer" <rebecca_palmer@zoho.com>. Signed-off-by: Zhigang Gong <zhigang.gong@intel.com> Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
2014-11-25	utests: fix work group size issue in compiler_fill_image_2d_array.	Zhigang Gong	1	-2/+2
	Reduce work group size from 1024 to 256 to fit all platforms. Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
2014-11-25	GBE: disable spill register under simd16 mode.	Zhigang Gong	1	-3/+2
	Register spilling awlays cost much more than fallback to simd8 which could avoid register spilling or at least reduce the spilled registers. Signed-off-by: Zhigang Gong <zhigang.gong@intel.com> Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
2014-11-24	Change the IVB/HSW's max_work_group_size to 512, and BYT to 256.	Yang Rong	1	-15/+15
	To decide the kernel's work group size, application should get CL_DEVICE_MAX_WORK_GROUP_SIZE first, and then get the CL_KERNEL_WORK_GROUP_SIZE after clBuildProgram. But some application only check the CL_DEVICE_MAX_WORK_GROUP_SIZE, and if kernel run simd8 mode or other cause, may exceed the CL_KERNEL_WORK_GROUP_SIZE. So change to CL_DEVICE_MAX_WORK_GROUP_SIZE to the minimum CL_KERNEL_WORK_GROUP_SIZE. Signed-off-by: Yang Rong <rong.r.yang@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-11-21	Fix the opencv_test_core/OCL_Arithm random segment fault.	Yang Rong	1	-37/+36
	If call cl_event_delete before call back, then event will be deleted if application release event in the call back. So must move the cl_event_delete at the last. V2: V1 will not delete event if not user event, also need delete it. Signed-off-by: Yang Rong <rong.r.yang@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-11-21	BDW: Change the default tiling mode to TILING_Y on BDW.	Yang Rong	1	-3/+7
	TILING_Y's performance is better than TILING_X'S on BDW, but almost same on IVB/HSW. Using the TILING_Y as default tiling mode temporary, still need to find out the root cause why different behavior between BDW and IVB/HSW. V2: still using static and only initialize once. Signed-off-by: Yang Rong <rong.r.yang@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-11-19	add the reduced self loop node detection.	Luo Xionghu	1	-11/+26
	if the self loop node is reduced, the llvm loop info couldn't detect such kind of self loops, handle it by checking whether the compacted node has a successor pointed to itself. v2: differentiate the compacted node from basic node to make the logic clearer, comments the while node as it is not enabled now. Signed-off-by: Luo Xionghu <xionghu.luo@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-11-19	Fix NO_TILING alignment bug.	Yang Rong	1	-1/+1
	Also need align height when CL_NO_TILING. This patch can fix some tiling_y error. Signed-off-by: Yang Rong <rong.r.yang@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-11-19	re-enable userptr with fix: CPU access after GPU finishes the rendering	Guo Yejun	3	-15/+41
	1. the wait logic is integrated into function cl_mem_map/unmap_auto 2. use cl_mem_map/unmap_auto for userptr inside clEnqueueRead/WriteBuffer 3. do not use cl_buffer_subdata for userptr, use cl_mem_map/memcpy instead Signed-off-by: Guo Yejun <yejun.guo@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-11-18	reuse the loop info from llvm.	Luo Xionghu	2	-36/+21
	the original loop detect algorithm caused the luxmark building performance 10x regression, this patch reused the loop info from llvm to handle SelfLoopNode. the trimmed path couldn't recognize nested while structures(if nodes in while caused performance regression). also the simple while loop node is still not handled yet. Signed-off-by: Luo Xionghu <xionghu.luo@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-11-18	Change the IVB/HSW L3 SQC credit setting.	Yang Rong	1	-2/+2
	Set the L3SQ General Priority Credit to max, and L3SQ High Priority Credit to zero, it can slightly improve the performacne, about 2% of luxmark. Signed-off-by: Yang Rong <rong.r.yang@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>