Age | Commit message (Collapse) | Author | Files | Lines |
|
Per OpenCL spec, the minimum CL_DEVICE_IMAGE_MAX_BUFFER_SIZE is 65536
which is too large for 1D surface on Gen platforms.
Have to use a 2D surface to implement it. As OpenCL spec only allows
the image1d_t to be accessed via default sampler, it is doable as it
will never use a float coordinates and never use linear non-nearest
filters.
Signed-off-by: Zhigang Gong <zhigang.gong@linux.intel.com>
|
|
Per OpenCL spec, image1d buffer only support no sampler access.
Remove those unsupported functions.
Signed-off-by: Zhigang Gong <zhigang.gong@linux.intel.com>
|
|
Signed-off-by: Zhigang Gong <zhigang.gong@linux.intel.com>
|
|
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
|
|
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
|
|
No need to keep this hacky implementation now.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
|
|
Refactor almost all the image builtin related functions to simplfy the code
and get rid of most of the awful macros.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
|
|
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
|
|
As build with mesa has been broken for a long time, we
disable it to avoid potential build problem.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
|
|
Major for two types of constant expression cases:
1. The destination is a vector.
2. Some missing operators.
Signed-off-by: Zhigang Gong <zhigang.gong@linux.intel.com>
|
|
Blender may generate such type of intrinsics. Now fix it.
Also fixed a previous typo which will not assert when it
should assert.
Signed-off-by: Zhigang Gong <zhigang.gong@linux.intel.com>
|
|
two bswap call in one block would trigger nsetc failures.
the fail was fixed in backend already, just update the utest.
Signed-off-by: Luo Xionghu <xionghu.luo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
|
|
The new bti implementation does not deal with printf internal buffer specially.
Which cause printf print nothing! But I think it is better to declare the
internal buffer for printf in global memory space instead of private space.
Then the bti implementation don't have to deal with it specially.
Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
|
|
It looks a typo, which wrongly interprete bti/msg_type field.
Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
|
|
Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
|
|
Developers are allowed to declare initialized private array like below:
void func() {
const arr[]={1, 2, 3, 4};
}
The implementation is simply put them into __constant memory space.
Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
|
|
Because accessing global memory by uchar16/char16 will fully utilize
memory bandwidth, so change CL_DEVICE_PREFERRED_VECTOR_WIDTH_CHAR from
8 to 16. Three OpenCV cases will speedup from this patch:
OCL_ThreshFixture_Threshold, 25% improvement
OCL_MaxFixture_Max, 105% improvement
OCL_MinFixture_Min, 105% improvement.
Signed-off-by: Chuanbo Weng <chuanbo.weng@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
|
|
Previously, we search from the use-point of pointers, like load/store
and try to find all the possible pointer sources. But sometimes we may meet
ptrtoint/add/inttoptr pattern, and what's worse, for the operands of
add instruction, it is hard to determine which one is from pointer and
which one maybe a offset.
So what we do in this patch is: let's start the search from the def-point
(like GlobalVariable, kernel function pointer argument, AllocaInst, which
we care about) and traversal all their uses. And during the traversal,
we will record the escape point(i.e. Store/load/atomic instructions).
So later, when we generate these kinds of instructions, we can query their
possible sources and get the corresponding BTI.
v2:
refine the error message when found an illegal pointer.
Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
|
|
Signed-off-by: Meng Mengmeng <mengmeng.meng@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
|
|
Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
|
|
when user ptr is enabled, allocates page aligned system memory for
CL_MEM_ALLOC_HOST_PTR inside the driver and wraps it as GPU memory
to avoid the copy between GPU and CPU.
and also do some code refine for the relative user_ptr code.
tests verified: beignet/utest, conformance/basic, buffers, mem_host_flags
Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
|
|
From application perspective, userptr is transparent. App does not
need to know if userptr is enabled or not, just invokes standard
OpenCL APIs.
Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
|
|
and also refine the code to move time_subtract into utest_helper.hpp/cpp
Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
|
|
Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
|
|
the ir registers are SSA defined, so each register should be asigned
once. this could fix the "dnetc -test rc5-72 0" bswap issue.
Signed-off-by: Luo Xionghu <xionghu.luo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@intel.com>
|
|
the overflow type should be unsigned for uadd_with_overflow.
this patch fixed the "dnetc -test rc5-72 0" 15 fails out of
32 when disabled bswap.
Signed-off-by: Luo Xionghu <xionghu.luo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@intel.com>
|
|
If the type is array or vector, we do not need to iterate each element.
We can compute it directly.
v2:
Use more generic SequentialType and StructType to identify whether
we can compute the offset directly.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
|
|
The typeIndex is correct and should not mutiply the step.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
|
|
1. Return the expected error code.
2. Don't destroy cl_program object after comile error because it
may be used still in the future.
Signed-off-by: Yan Wang <yan.wang@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
|
|
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Ruiling Song <ruiling.song@intel.com>
|
|
GEP index may be negative constant value as below:
%arrayidx = getelementptr inbounds <4 x i32> addrspace(1)* %src4, i32 %add.ptr.sum, i32 -4
The previous implementation assumes it's a unsigned value which is incorrect
and may cause infinite loop.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Ruiling Song <ruiling.song@intel.com>
|
|
Add an environment variable 'OCL_OUTPUT_CFG_GEN_IR' to control it.
Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
|
|
To create cl image from bo name with offset, the offset needs to
be added into surface_base_addr_lo/hi.
Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Tested-by: "Zhu, BingbingX" <bingbingx.zhu@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
|
|
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
|
|
Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
|
|
Beignet accepts buffer object name to share data between libva,
it supports to create cl image from the bo name with a non-zero
offset, but it does not work at some platforms.
The driver calls intel_bo_gem_create_from_name to retrieve the
dri_bo, and the offset of dri_bo is changed by the non-zero offset.
At some platforms, the change of the offset has side effect when
the kernel is executed again and so intel_bo_gem_create_from_name
is called for the second time.
So, do not change the offset of dri_bo, but maintain the non-zero
offset in cl_image, and maintain the non-zero offset until we write
the surface state into batch buffer.
V2: correct the offset parameter passed to dri_bo_emit_reloc
Signed-off-by: Guo Yejun <yejun.guo@intel.com>
|
|
The maximum work group size on BYT is 256.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
|
|
This again is to solve register liveness issue. Details see comment inline.
This could fix opencv failure under strict conformance mode:
./opencv_test_core --gtest_filter=OCL_Arithm/PolarToCart.angleInRadians/0
v2:
Add a FIXME tag for irreducible graph
v3:
assert if child number larger than 2.
Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
|
|
If the src and dst are the same byte vector or the src
is scalar, we don't need to split the instruction.
Thus the following instructions:
( 269) (-f1) sel(8) g95<2>:B g100<16,8,2>:B 0W { align1 WE_normal 1Q };
( 271) (-f1) sel(8) g95.16<2>:B g100.16<16,8,2>:B 0W { align1 WE_normal 2Q };
could be optimized to one sind16 instruction:
( 263) (-f1) sel(16) g95<2>:B g100<16,8,2>:B 0W { align1 WE_normal 1H };
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
due to a stray . at utests/builtin_pow.cpp:79:112.
Reported by "Rebecca N. Palmer" <rebecca_palmer@zoho.com>.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
|
|
Reduce work group size from 1024 to 256 to fit all platforms.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
|
|
Register spilling awlays cost much more than fallback to simd8
which could avoid register spilling or at least reduce the spilled
registers.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
|
|
To decide the kernel's work group size, application should get
CL_DEVICE_MAX_WORK_GROUP_SIZE first, and then get the CL_KERNEL_WORK_GROUP_SIZE
after clBuildProgram.
But some application only check the CL_DEVICE_MAX_WORK_GROUP_SIZE, and if kernel run
simd8 mode or other cause, may exceed the CL_KERNEL_WORK_GROUP_SIZE.
So change to CL_DEVICE_MAX_WORK_GROUP_SIZE to the minimum CL_KERNEL_WORK_GROUP_SIZE.
Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
|
|
If call cl_event_delete before call back, then event will be deleted if
application release event in the call back. So must move the cl_event_delete at the last.
V2: V1 will not delete event if not user event, also need delete it.
Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
|
|
TILING_Y's performance is better than TILING_X'S on BDW, but almost same
on IVB/HSW. Using the TILING_Y as default tiling mode temporary, still need
to find out the root cause why different behavior between BDW and IVB/HSW.
V2: still using static and only initialize once.
Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
|
|
if the self loop node is reduced, the llvm loop info couldn't detect
such kind of self loops, handle it by checking whether the compacted
node has a successor pointed to itself.
v2: differentiate the compacted node from basic node to make the logic
clearer, comments the while node as it is not enabled now.
Signed-off-by: Luo Xionghu <xionghu.luo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
|
|
Also need align height when CL_NO_TILING. This patch can fix some tiling_y error.
Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
|
|
1. the wait logic is integrated into function cl_mem_map/unmap_auto
2. use cl_mem_map/unmap_auto for userptr inside clEnqueueRead/WriteBuffer
3. do not use cl_buffer_subdata for userptr, use cl_mem_map/memcpy instead
Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
|
|
the original loop detect algorithm caused the luxmark building
performance 10x regression, this patch reused the loop info from llvm to
handle SelfLoopNode.
the trimmed path couldn't recognize nested while structures(if nodes in
while caused performance regression). also the simple while loop node
is still not handled yet.
Signed-off-by: Luo Xionghu <xionghu.luo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
|
|
Set the L3SQ General Priority Credit to max, and L3SQ High Priority Credit
to zero, it can slightly improve the performacne, about 2% of luxmark.
Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
|