Age | Commit message (Collapse) | Author | Files | Lines |
|
The CL_ENQUEUE_FILL_BUFFER_ALIGN8_* internal program is the same
program, only add the program's ref once, but when delete context,
caculate the internal program count, will add them individually.
This mismatch will cause the context be free by mistake.
New different CL_ENQUEUE_FILL_BUFFER_ALIGN8_* program for clearly.
Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Ruiling Song <ruiling.song@intel.com>
|
|
when we are merging STOREs, we should use the very last instruction
as the insertion point.
Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Ruiling Song <ruiling.song@intel.com>
|
|
It is for the user who don't has root permission.
V2: change to option name to OCL_ICD_INSTALL_PREFIX.
Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Ruiling Song <ruiling.song@intel.com>
|
|
for ABS(UD) = UD on Gen, so delete it,
or it make compilation failed on some platform
Signed-off-by: rander.wang <rander.wang@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
GCC6 refine the c headers and need to add the needed function header,
like the abs in math.h.
Signed-off-by: Pan Xiuli <xiuli.pan@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
this fix basic test in conformance tests failed for vec8 of char because
of overflow. And it fix many test items failed in opencv because of offset error
(1)modify the size of searchInsnArray to 32, it is the max size for char
And add check for overflow if too many insn
(2)Make sure the start insn is the first insn of searched array
because if it is not the first, the offset maybe invalid. And
it is complex to modify offset without error
V2: refine search index, using J not I
V3: remove (2), now add offset to the pointer of start
pass OpenCV, conformance basic and compiler tests, utests
V4: check pointer type, if 64bit, modify it by 64, or 32
V5: refine findSafeInstruction() and variable naming in
findConsecutiveAccess().
Signed-off-by: rander.wang <rander.wang@intel.com>
Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
cl_intel_device_side_avc_motion_estimation.
fix build warnings.
Signed-off-by: Chuanbo Weng <chuanbo.weng@intel.com>
Signed-off-by: Xionghu Luo <xionghu.luo@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
cl_intel_device_side_avc_motion_estimation.
fix build warnings.
Signed-off-by: Chuanbo Weng <chuanbo.weng@intel.com>
Signed-off-by: Xionghu Luo <xionghu.luo@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
cl_intel_device_side_avc_motion_estimation.
fix build warnings.
Signed-off-by: Chuanbo Weng <chuanbo.weng@intel.com>
Signed-off-by: Xionghu Luo <xionghu.luo@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
This patch mainly contains:
1. built-in function __gen_ocl_ime implementation.
2. Lots of built-in functions of cl_intel_device_side_avc_motion_estimation
are implemented.
3. This extension is required to run in simd16 mode.
v2: move the utests to seprate patches one by one;
as all the utests has extension function check, no need to put them
in stand alone utest;
uncomment the self test;
fix extension check logic issue, should be && instead of ||.
Signed-off-by: Chuanbo Weng <chuanbo.weng@intel.com>
Signed-off-by: Xionghu Luo <xionghu.luo@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
built_in_prgs and built_in_kernels seems useless, remove them.
Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Ruiling Song <ruiling.song@intel.com>
|
|
Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Ruiling Song <ruiling.song@intel.com>
|
|
Before release internal resources, must set them to null, otherwize,
when delete these resources, will call release context again.
The ctx->built_in_prgs should be release by application.
Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Ruiling Song <ruiling.song@intel.com>
|
|
remove the negation check for adding zero.
it also can be applied this optimization
V2: refine the function name for zeroAdd
Signed-off-by: rander.wang <rander.wang@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
for this case 1.0f/src, 2.0f/src can be converted,
but 3.0f/src and i/src cant
Signed-off-by: rander.wang <rander.wang@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
when the src0 of fdiv is a immedia value and it is
exactly pow of 2, like 2.0f, 4.0f, 1.0/8.0f,
fdiv %0, imm, %1 can be convert to
rcp %0, %1
mul %0, %0, imm.
for fdiv cost 8cycle, rcp 4cycle. it will save at least
3cycle.
pass the conformance test and utests
V2: refine negation flag
V3: modify negation by negate
Signed-off-by: rander.wang <rander.wang@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
remove a few unnecessary codes , and get 20% improvement
at worse case. If X is a NAN, there are some if-return
codes to return NAN. Now change it to add(x - x) which
get the same NAN
pass the conformance tests and utests
Signed-off-by: rander.wang <rander.wang@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
Now save 40% time than before
(1) group many branches which deal with corner case to one branch.
(2) using HW exp2 and log2 to replace some instructions
pass conformance tests and utest
Signed-off-by: rander.wang <rander.wang@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
Now change max group size to 256. it is a reasonable
size for Gen9. According to performance test, 256 make
good progress in openCV and no regression. So change it
Signed-off-by: rander.wang <rander.wang@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
Now it works for sequence: load(0), load(1), load(2)
but it cant work for load(2), load(0), load(1). because
it compared the last merged load and the new one not all
the loads
for sequence: load(0), load(1), load(2). the load(0) is the
start, can find that load(1) is successor without space, so
put it to a merge fifo. then the start is moving to the top
of fifo load(1), and compared with load(2). Also load(2) can be merged
for load(2), load(0), load(1). load(2) cant be merged with
load(0) for a space between them. So skip load(0) and mov to next
load(1).And this load(1) can be merged. But it never go back merge
load(0)
Now change the algorithm.
(1) find all loads maybe merged arround the start by the distance to
the start. the distance is depended on data type, for 32bit data, the
distance is 4. Put them in a list
(2) sort the list by the distance from the start.
(3) search the continuous sequence including the start to merge
V2: (1)refine the sort and compare algoritm. First find all the IO
in small offset compared to start. Then call std:sort
(2)check the number of candidate IO to be favorable to performance
for most cases there is no chance to merge IO
Signed-off-by: rander.wang <rander.wang@intel.com>
Reviewed-by: Ruiling Song <ruiling.song@intel.com>
|
|
there are some global immediates in global var list of LLVM.
these imm can be integrated in instructions. for compiler_global_immediate_optimized test
in utest, there are two global immediates:
L0:
MOV(1) %42<0>:UD : 0x0:UD
MOV(1) %43<0>:UD : 0x30:UD
used by:
ADD(16) %49<1>:D : %42<0,1,0>:D %48<8,8,1>:D
ADD(16) %54<1>:D : %43<0,1,0>:D %53<8,8,1>:D
it can be
ADD(16) %49<1>:D : %48<8,8,1>:D 0x0:UD
ADD(16) %54<1>:D : %53<8,8,1>:D 0x30:UD
Then the MOV can be removed. And after this optimization, ADD 0 can be change
to MOV, then local copy propagation can be done.
V2: (1) add environment variable to enable/disable the optimization
(2) refine the architecture of imm optimization, inherit from global
optimizer not local block optimizer
V3: merge with latest master driver
V4: (1)refine some type errors
(2)remove UD/D check for no need
(3)refine imm calculate for UD/D
Signed-off-by: rander.wang <rander.wang@intel.com>
Reviewed-by: Ruiling Song <ruiling.song@intel.com>
|
|
There are some changes:
1. Clone the module before call LLVMLinkModules2, remove other
clones for it.
2. Don't delete module in function llvmToGen.
3. Add a function programNewFromLLVMFile so genProgramNewFromLLVM
and buildFromLLVMModule only handle llvm module. Actually,
programNewFromLLVMFile is only used by clCreateProgramWithLLVMIntel,
and I think it is useless, maybe we could delete it at all.
V2: define errDiag beside #if/#endif.
Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Pan Xiuli <xiuli.pan@intel.com>
|
|
Signed-off-by: Yan Wang <yan.wang@linux.intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
It seems we missed some newly added device ID for SKL.
Signed-off-by: Pan Xiuli <xiuli.pan@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
Account for internal program ctx references in cl_context_delete
Signed-off-by: Patrick Beaulieu <patrick.beaulieu@avigilon.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
src modifier is not supported by some instructions.
so return false when it exists. This fix piglit %
scalar-arithmetic-int failed
V2: (1)add hadd rhadd
(2)confirmed math functions support midifer except IDIV/Mod
Signed-off-by: rander.wang <rander.wang@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
Check the device supported subgroup sizes, and use
intel_reqd_sub_group_size to build kernels in these size. Then check if
there is spill for each kernel.
V2: Fix memory leak
Signed-off-by: Pan Xiuli <xiuli.pan@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
Add CL_DEVICE_SUB_GROUP_SIZES_INTEL for clGetDeviceInfo, add
CL_KERNEL_SPILL_MEM_SIZE_INTEL for clGetKernelWorkGroupInfo and add
CL_KERNEL_COMPILE_SUB_GROUP_SIZE_INTEL for clGetKernelSubGroupInfo.
We only have this extension for LLVM 40+ for frontend support.
V2: Add opencl-c define
Signed-off-by: Pan Xiuli <xiuli.pan@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
If we get intel_reqd_sub_group_size attribute from frontend then set it
to backend.
V2: Refine the codeGenNum with runtime caclculate and fail the build if
the size from frontend is illegal.
Signed-off-by: Pan Xiuli <xiuli.pan@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
for the following GEN IR, %41 is kernel argument (struct)
the first LOAD will be mov, and the second LOAD will be indirect move
(see lowerFunctionArguments). It hurts performance,
and even impacts the correctness of reg liveness of indriect mov
LOADI.uint64 %1114 72
ADD.int64 %78 %41 %1114
LOAD.int64.private.aligned {%79} %78 bti:255
LOADI.int64 %1115 8
ADD.int64 %1116 %78 %1115
LOAD.int64.private.aligned {%80} %1116 bti:255
this function folds the constants of 72 and 8 together,
and so it will be direct mov.
the GEN IR looks like:
LOADI.int64 %1115 80
ADD.int64 %1116 %41 %1115
Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
It is similar with 2D image for avoiding extended image width truncated.
Signed-off-by: Yan Wang <yan.wang@linux.intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
It will test aligned4 and aligned16 kernel for 3D image.
Signed-off-by: Yan Wang <yan.wang@linux.intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
1. Only copy the data by origin and region defined.
2. Add clFinish to guarantee the kernel copying is finished when blocking writing.
Signed-off-by: Yan Wang <yan.wang@linux.intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
1. Support wrrting data by mapping/unmapping mode.
2. Add mapping record logic.
3. Add clFinish to guarantee the kernel copying is finished.
4. Fix the error of calling clEnqueueMapImageByKernel.
blocking_map and map_flags need be switched.
Signed-off-by: Yan Wang <yan.wang@linux.intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
large image.
Signed-off-by: Yan Wang <yan.wang@linux.intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
image by kernel copying.
Signed-off-by: Yan Wang <yan.wang@linux.intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
USE_HOST_PTR mode.
Signed-off-by: Yan Wang <yan.wang@linux.intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
It is used to reproduce the bug of clCopyImage/clFillImage of conformance test.
Signed-off-by: Yan Wang <yan.wang@linux.intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
It is used to reproduce the bug of allocations of conformance test.
Signed-off-by: Yan Wang <yan.wang@linux.intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
the negative Add is like:
exp -a
llvm transfer it to:
add x -a, 0
exp x
Signed-off-by: rander.wang <rander.wang@intel.com>
Reviewed-by: Pan Xiuli <xiuli.pan@intel.com>
|
|
LLVM transform Mad(a, -b, c) to
Add b, -b, 0
Mad val, a, b, c
pow(a,-b) and other buildin math function to the same instruction sequence like above
for Gen support negtive modifier, mad(a, -b, c) is native suppoted.
Do it just like a: mov b, -b, so it is a Mov operation like LocalCopyPropagation
Signed-off-by: rander.wang <rander.wang@intel.com>
Reviewed-by: Pan Xiuli <xiuli.pan@intel.com>
|
|
Signed-off-by: rander.wang <rander.wang@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
there some patterns like:
sqrt r1, r2;
load r4, 1.0; ===> rqrt r3, r2
div r3, r4, r1;
Signed-off-by: rander.wang <rander.wang@intel.com>
Reviewed-by: Pan Xiuli <xiuli.pan@intel.com>
|
|
Found a missing macro that need change to support LLVM40+.
Signed-off-by: Pan Xiuli <xiuli.pan@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
remove some corner cases check for these path can not be
reached.And refine branch code to select. These improvements
get 20% performance. and the performance of OCL_ExpFixture_Exp
in opencv can match up to other Gen driver
Signed-off-by: rander.wang <rander.wang@intel.com>
Reviewed-by: Pan Xiuli <xiuli.pan@intel.com>
|
|
the test OCL_Magnitude of opencv is slow on beignet because
of hypot. refine the hypot, change algorithm and remove
unnecessary code to get 30% up
Signed-off-by: rander.wang <rander.wang@intel.com>
Reviewed-by: Pan Xiuli <xiuli.pan@intel.com>
|
|
"imagedim_non_pow_2" cases of basic modudle of confrmance shows
regression after use TILE_Y mode for large image by previous patch.
This bug comes from the non-align16 kernel of clEnqueueCopyBufferToImage
and clEnqueueCopyImageToBuffer.
It will force CL_RGBA/CL_UNORM_INT8/8191x8192 image of conformance test
to CL_R/CL_UNSIGNED_INT8/32764x8192 image for copying.
So it makes width as 8191 x 4 = 32764 and its width will exceed the maximum
width (16 x 1024 = 16384) of GEN surface state structure which only has 14 bits.
So use align4 copy kernel to avoid this bug.
Signed-off-by: Yan Wang <yan.wang@linux.intel.com>
|
|
Signed-off-by: Yan Wang <yan.wang@linux.intel.com>
|