summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2014-12-14GBE/CL: use 2D image to implement large image1D_buffer.image_refineZhigang Gong5-15/+68
Per OpenCL spec, the minimum CL_DEVICE_IMAGE_MAX_BUFFER_SIZE is 65536 which is too large for 1D surface on Gen platforms. Have to use a 2D surface to implement it. As OpenCL spec only allows the image1d_t to be accessed via default sampler, it is doable as it will never use a float coordinates and never use linear non-nearest filters. Signed-off-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-12-13GBE: remove some image1d_buffer related builtin functions.Zhigang Gong2-9/+9
Per OpenCL spec, image1d buffer only support no sampler access. Remove those unsupported functions. Signed-off-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-12-12works fine now.Zhigang Gong5-13/+17
Signed-off-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-12-12minor fix, still broken.Zhigang Gong2-1/+4
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
2014-12-12draft to fix sampler.Zhigang Gong6-12/+164
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
2014-12-12GBE: switch to use CLANG native image types.Zhigang Gong9-424/+175
No need to keep this hacky implementation now. Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
2014-12-12Refactor all image builtin functions.Zhigang Gong4-416/+618
Refactor almost all the image builtin related functions to simplfy the code and get rid of most of the awful macros. Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
2014-12-05Update optimization tips.Zhigang Gong1-14/+92
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
2014-12-05CL: Don't find mesa source code.Zhigang Gong1-6/+6
As build with mesa has been broken for a long time, we disable it to avoid potential build problem. Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
2014-12-04GBE: Add some missing constant expression cases.Zhigang Gong4-11/+135
Major for two types of constant expression cases: 1. The destination is a vector. 2. Some missing operators. Signed-off-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-12-04GBE: Add constant pointer in the memcpy intrinsic.Zhigang Gong3-1/+187
Blender may generate such type of intrinsics. Now fix it. Also fixed a previous typo which will not assert when it should assert. Signed-off-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-12-04refine bswap utest to cover nsetc fail cases.Luo Xionghu2-0/+8
two bswap call in one block would trigger nsetc failures. the fail was fixed in backend already, just update the utest. Signed-off-by: Luo Xionghu <xionghu.luo@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-12-04GBE: Fix the printf issue caused by new bti implementationRuiling Song1-4/+16
The new bti implementation does not deal with printf internal buffer specially. Which cause printf print nothing! But I think it is better to declare the internal buffer for printf in global memory space instead of private space. Then the bti implementation don't have to deal with it specially. Signed-off-by: Ruiling Song <ruiling.song@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-12-04GBE: Fix a disassembly bug.Ruiling Song1-2/+2
It looks a typo, which wrongly interprete bti/msg_type field. Signed-off-by: Ruiling Song <ruiling.song@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-12-04utests: Add const private array initialization test.Ruiling Song3-0/+37
Signed-off-by: Ruiling Song <ruiling.song@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-12-04GBE: support const private array initialization.Ruiling Song2-45/+54
Developers are allowed to declare initialized private array like below: void func() { const arr[]={1, 2, 3, 4}; } The implementation is simply put them into __constant memory space. Signed-off-by: Ruiling Song <ruiling.song@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-12-03Change CL_DEVICE_PREFERRED_VECTOR_WIDTH_CHAR from 8 to 16.Chuanbo Weng1-1/+1
Because accessing global memory by uchar16/char16 will fully utilize memory bandwidth, so change CL_DEVICE_PREFERRED_VECTOR_WIDTH_CHAR from 8 to 16. Three OpenCV cases will speedup from this patch: OCL_ThreshFixture_Threshold, 25% improvement OCL_MaxFixture_Max, 105% improvement OCL_MinFixture_Min, 105% improvement. Signed-off-by: Chuanbo Weng <chuanbo.weng@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-12-03GBE: Re-implement BTI logic in backendRuiling Song2-109/+159
Previously, we search from the use-point of pointers, like load/store and try to find all the possible pointer sources. But sometimes we may meet ptrtoint/add/inttoptr pattern, and what's worse, for the operands of add instruction, it is hard to determine which one is from pointer and which one maybe a offset. So what we do in this patch is: let's start the search from the def-point (like GlobalVariable, kernel function pointer argument, AllocaInst, which we care about) and traversal all their uses. And during the traversal, we will record the escape point(i.e. Store/load/atomic instructions). So later, when we generate these kinds of instructions, we can query their possible sources and get the corresponding BTI. v2: refine the error message when found an illegal pointer. Signed-off-by: Ruiling Song <ruiling.song@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-12-02utests: make utests maths ULP values consistent with specificationMeng Mengmeng3-8/+96
Signed-off-by: Meng Mengmeng <mengmeng.meng@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-12-02add utest of CL_MEM_ALLOC_HOST_PTRGuo Yejun3-0/+32
Signed-off-by: Guo Yejun <yejun.guo@intel.com> Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
2014-12-02enable CL_MEM_ALLOC_HOST_PTR with user_ptr to avoid copy between GPU/CPUGuo Yejun3-16/+33
when user ptr is enabled, allocates page aligned system memory for CL_MEM_ALLOC_HOST_PTR inside the driver and wraps it as GPU memory to avoid the copy between GPU and CPU. and also do some code refine for the relative user_ptr code. tests verified: beignet/utest, conformance/basic, buffers, mem_host_flags Signed-off-by: Guo Yejun <yejun.guo@intel.com> Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
2014-12-02refine utest of cl_mem_use_host_ptrGuo Yejun2-12/+1
From application perspective, userptr is transparent. App does not need to know if userptr is enabled or not, just invokes standard OpenCL APIs. Signed-off-by: Guo Yejun <yejun.guo@intel.com> Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
2014-12-02add test of cl_mem_use_host_ptr into benchmarkGuo Yejun5-24/+66
and also refine the code to move time_subtract into utest_helper.hpp/cpp Signed-off-by: Guo Yejun <yejun.guo@intel.com> Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
2014-12-02clean code, the logic is already at the beginning of functionGuo Yejun1-16/+0
Signed-off-by: Guo Yejun <yejun.guo@intel.com> Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
2014-12-02fix bswap implementation issue.Luo Xionghu1-36/+28
the ir registers are SSA defined, so each register should be asigned once. this could fix the "dnetc -test rc5-72 0" bswap issue. Signed-off-by: Luo Xionghu <xionghu.luo@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@intel.com>
2014-12-02fix dnetc overflow issue.Luo Xionghu1-1/+2
the overflow type should be unsigned for uadd_with_overflow. this patch fixed the "dnetc -test rc5-72 0" 15 fails out of 32 when disabled bswap. Signed-off-by: Luo Xionghu <xionghu.luo@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@intel.com>
2014-12-02GBE: optimize GEP constant offset calculation.Zhigang Gong1-3/+5
If the type is array or vector, we do not need to iterate each element. We can compute it directly. v2: Use more generic SequentialType and StructType to identify whether we can compute the offset directly. Signed-off-by: Zhigang Gong <zhigang.gong@intel.com> Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
2014-12-02GBE: fix a regression caused by the negative index handling patch.Zhigang Gong1-1/+1
The typeIndex is correct and should not mutiply the step. Signed-off-by: Zhigang Gong <zhigang.gong@intel.com> Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
2014-12-02Fix based on piglit OpenCL falied case (cl-api-compile-program).Yan Wang1-4/+2
1. Return the expected error code. 2. Don't destroy cl_program object after comile error because it may be used still in the future. Signed-off-by: Yan Wang <yan.wang@linux.intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-12-01utests: Add one case to test negative index array access.Zhigang Gong3-0/+55
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com> Reviewed-by: Ruiling Song <ruiling.song@intel.com>
2014-12-01GBE: Fix bug with negative constant GEP index.Zhigang Gong3-11/+13
GEP index may be negative constant value as below: %arrayidx = getelementptr inbounds <4 x i32> addrspace(1)* %src4, i32 %add.ptr.sum, i32 -4 The previous implementation assumes it's a unsigned value which is incorrect and may cause infinite loop. Signed-off-by: Zhigang Gong <zhigang.gong@intel.com> Reviewed-by: Ruiling Song <ruiling.song@intel.com>
2014-12-01GBE: Output CFG of Gen IR to dot file.Ruiling Song3-0/+26
Add an environment variable 'OCL_OUTPUT_CFG_GEN_IR' to control it. Signed-off-by: Ruiling Song <ruiling.song@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-11-28fix issue to pass utest of runtime_climage_from_boname for BDWGuo Yejun1-2/+2
To create cl image from bo name with offset, the offset needs to be added into surface_base_addr_lo/hi. Signed-off-by: Guo Yejun <yejun.guo@intel.com> Tested-by: "Zhu, BingbingX" <bingbingx.zhu@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-11-27utests: fix indent in CMakeLists.txtZhigang Gong1-10/+10
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
2014-11-27add test for clCreateImageFromLibvaIntelGuo Yejun3-1/+226
Signed-off-by: Guo Yejun <yejun.guo@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-11-27fix issue to create cl image from libva with non-zero offsetGuo Yejun4-7/+5
Beignet accepts buffer object name to share data between libva, it supports to create cl image from the bo name with a non-zero offset, but it does not work at some platforms. The driver calls intel_bo_gem_create_from_name to retrieve the dri_bo, and the offset of dri_bo is changed by the non-zero offset. At some platforms, the change of the offset has side effect when the kernel is executed again and so intel_bo_gem_create_from_name is called for the second time. So, do not change the offset of dri_bo, but maintain the non-zero offset in cl_image, and maintain the non-zero offset until we write the surface state into batch buffer. V2: correct the offset parameter passed to dri_bo_emit_reloc Signed-off-by: Guo Yejun <yejun.guo@intel.com>
2014-11-27utests: reduce work group size to 256 to satisfy BYT platform.Zhigang Gong1-1/+1
The maximum work group size on BYT is 256. Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
2014-11-26GBE: Place loop exits after loop blocks when sorting basic blocks.Ruiling Song1-10/+84
This again is to solve register liveness issue. Details see comment inline. This could fix opencv failure under strict conformance mode: ./opencv_test_core --gtest_filter=OCL_Arithm/PolarToCart.angleInRadians/0 v2: Add a FIXME tag for irreducible graph v3: assert if child number larger than 2. Signed-off-by: Ruiling Song <ruiling.song@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-11-25GBE: don't split instruction for some special case.Zhigang Gong1-1/+11
If the src and dst are the same byte vector or the src is scalar, we don't need to split the instruction. Thus the following instructions: ( 269) (-f1) sel(8) g95<2>:B g100<16,8,2>:B 0W { align1 WE_normal 1Q }; ( 271) (-f1) sel(8) g95.16<2>:B g100.16<16,8,2>:B 0W { align1 WE_normal 2Q }; could be optimized to one sind16 instruction: ( 263) (-f1) sel(16) g95<2>:B g100<16,8,2>:B 0W { align1 WE_normal 1H }; Signed-off-by: Zhigang Gong <zhigang.gong@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2014-11-25utests: fix a typo in test cases.Zhigang Gong1-1/+1
due to a stray . at utests/builtin_pow.cpp:79:112. Reported by "Rebecca N. Palmer" <rebecca_palmer@zoho.com>. Signed-off-by: Zhigang Gong <zhigang.gong@intel.com> Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
2014-11-25utests: fix work group size issue in compiler_fill_image_2d_array.Zhigang Gong1-2/+2
Reduce work group size from 1024 to 256 to fit all platforms. Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
2014-11-25GBE: disable spill register under simd16 mode.Zhigang Gong1-3/+2
Register spilling awlays cost much more than fallback to simd8 which could avoid register spilling or at least reduce the spilled registers. Signed-off-by: Zhigang Gong <zhigang.gong@intel.com> Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
2014-11-24Change the IVB/HSW's max_work_group_size to 512, and BYT to 256.Yang Rong1-15/+15
To decide the kernel's work group size, application should get CL_DEVICE_MAX_WORK_GROUP_SIZE first, and then get the CL_KERNEL_WORK_GROUP_SIZE after clBuildProgram. But some application only check the CL_DEVICE_MAX_WORK_GROUP_SIZE, and if kernel run simd8 mode or other cause, may exceed the CL_KERNEL_WORK_GROUP_SIZE. So change to CL_DEVICE_MAX_WORK_GROUP_SIZE to the minimum CL_KERNEL_WORK_GROUP_SIZE. Signed-off-by: Yang Rong <rong.r.yang@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-11-21Fix the opencv_test_core/OCL_Arithm random segment fault.Yang Rong1-37/+36
If call cl_event_delete before call back, then event will be deleted if application release event in the call back. So must move the cl_event_delete at the last. V2: V1 will not delete event if not user event, also need delete it. Signed-off-by: Yang Rong <rong.r.yang@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-11-21BDW: Change the default tiling mode to TILING_Y on BDW.Yang Rong1-3/+7
TILING_Y's performance is better than TILING_X'S on BDW, but almost same on IVB/HSW. Using the TILING_Y as default tiling mode temporary, still need to find out the root cause why different behavior between BDW and IVB/HSW. V2: still using static and only initialize once. Signed-off-by: Yang Rong <rong.r.yang@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-11-19add the reduced self loop node detection.Luo Xionghu1-11/+26
if the self loop node is reduced, the llvm loop info couldn't detect such kind of self loops, handle it by checking whether the compacted node has a successor pointed to itself. v2: differentiate the compacted node from basic node to make the logic clearer, comments the while node as it is not enabled now. Signed-off-by: Luo Xionghu <xionghu.luo@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-11-19Fix NO_TILING alignment bug.Yang Rong1-1/+1
Also need align height when CL_NO_TILING. This patch can fix some tiling_y error. Signed-off-by: Yang Rong <rong.r.yang@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-11-19re-enable userptr with fix: CPU access after GPU finishes the renderingGuo Yejun3-15/+41
1. the wait logic is integrated into function cl_mem_map/unmap_auto 2. use cl_mem_map/unmap_auto for userptr inside clEnqueueRead/WriteBuffer 3. do not use cl_buffer_subdata for userptr, use cl_mem_map/memcpy instead Signed-off-by: Guo Yejun <yejun.guo@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-11-18reuse the loop info from llvm.Luo Xionghu2-36/+21
the original loop detect algorithm caused the luxmark building performance 10x regression, this patch reused the loop info from llvm to handle SelfLoopNode. the trimmed path couldn't recognize nested while structures(if nodes in while caused performance regression). also the simple while loop node is still not handled yet. Signed-off-by: Luo Xionghu <xionghu.luo@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-11-18Change the IVB/HSW L3 SQC credit setting.Yang Rong1-2/+2
Set the L3SQ General Priority Credit to max, and L3SQ High Priority Credit to zero, it can slightly improve the performacne, about 2% of luxmark. Signed-off-by: Yang Rong <rong.r.yang@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>