summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2017-07-27backend: Fix a bug in load-store optimization.Song, Ruiling1-25/+46
when we are merging STOREs, we should use the very last instruction as the insertion point. Signed-off-by: Ruiling Song <ruiling.song@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2017-07-27Runtime: fix a cl_gpgpu_bind_image_for_vme NULL SIGSEGV.Yang, Rong R1-1/+2
Signed-off-by: Yang Rong <rong.r.yang@intel.com> Reviewed-by: Ruiling Song <ruiling.song@intel.com>
2017-07-27cmake: add option OCL_ICD_INSTALL_PREFIX to set icd file install path.Yang, Rong R1-12/+15
It is for the user who don't has root permission. V2: change to option name to OCL_ICD_INSTALL_PREFIX. Signed-off-by: Yang Rong <rong.r.yang@intel.com> Reviewed-by: Ruiling Song <ruiling.song@intel.com>
2017-07-20backend: refine global immediate optimizationrander1-4/+0
for ABS(UD) = UD on Gen, so delete it, or it make compilation failed on some platform Signed-off-by: rander.wang <rander.wang@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2017-07-20Fix GCC6 build bugPan Xiuli1-0/+1
GCC6 refine the c headers and need to add the needed function header, like the abs in math.h. Signed-off-by: Pan Xiuli <xiuli.pan@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2017-07-18backend: refine load store optimizationSong, Ruiling1-37/+88
this fix basic test in conformance tests failed for vec8 of char because of overflow. And it fix many test items failed in opencv because of offset error (1)modify the size of searchInsnArray to 32, it is the max size for char And add check for overflow if too many insn (2)Make sure the start insn is the first insn of searched array because if it is not the first, the offset maybe invalid. And it is complex to modify offset without error V2: refine search index, using J not I V3: remove (2), now add offset to the pointer of start pass OpenCV, conformance basic and compiler tests, utests V4: check pointer type, if 64bit, modify it by 64, or 32 V5: refine findSafeInstruction() and variable naming in findConsecutiveAccess(). Signed-off-by: rander.wang <rander.wang@intel.com> Signed-off-by: Ruiling Song <ruiling.song@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2017-07-12add utest compiler_block_motion_estimate_intel for extension ↵Luo Xionghu3-0/+224
cl_intel_device_side_avc_motion_estimation. fix build warnings. Signed-off-by: Chuanbo Weng <chuanbo.weng@intel.com> Signed-off-by: Xionghu Luo <xionghu.luo@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2017-07-12add utest compiler_skip_check for extension ↵Luo Xionghu3-0/+244
cl_intel_device_side_avc_motion_estimation. fix build warnings. Signed-off-by: Chuanbo Weng <chuanbo.weng@intel.com> Signed-off-by: Xionghu Luo <xionghu.luo@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2017-07-12add utest compiler_intra_prediction for extenstion ↵Luo Xionghu3-1/+211
cl_intel_device_side_avc_motion_estimation. fix build warnings. Signed-off-by: Chuanbo Weng <chuanbo.weng@intel.com> Signed-off-by: Xionghu Luo <xionghu.luo@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2017-07-12Implement extension cl_intel_device_side_avc_motion_estimation.Chuanbo Weng32-36/+2282
This patch mainly contains: 1. built-in function __gen_ocl_ime implementation. 2. Lots of built-in functions of cl_intel_device_side_avc_motion_estimation are implemented. 3. This extension is required to run in simd16 mode. v2: move the utests to seprate patches one by one; as all the utests has extension function check, no need to put them in stand alone utest; uncomment the self test; fix extension check logic issue, should be && instead of ||. Signed-off-by: Chuanbo Weng <chuanbo.weng@intel.com> Signed-off-by: Xionghu Luo <xionghu.luo@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2017-07-10Runtime: remove ctx's useless fileds.Yang, Rong R3-43/+5
built_in_prgs and built_in_kernels seems useless, remove them. Signed-off-by: Yang Rong <rong.r.yang@intel.com> Reviewed-by: Ruiling Song <ruiling.song@intel.com>
2017-07-10Utest: fix a build-in program leak.Yang, Rong R1-0/+1
Signed-off-by: Yang Rong <rong.r.yang@intel.com> Reviewed-by: Ruiling Song <ruiling.song@intel.com>
2017-07-10Runtime: fix a recurrent release context error.Yang, Rong R1-10/+8
Before release internal resources, must set them to null, otherwize, when delete these resources, will call release context again. The ctx->built_in_prgs should be release by application. Signed-off-by: Yang Rong <rong.r.yang@intel.com> Reviewed-by: Ruiling Song <ruiling.song@intel.com>
2017-07-06backend: improve add zero patternrander1-5/+5
remove the negation check for adding zero. it also can be applied this optimization V2: refine the function name for zeroAdd Signed-off-by: rander.wang <rander.wang@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2017-07-06utests: add utest for fdiv to rcprander3-1/+71
for this case 1.0f/src, 2.0f/src can be converted, but 3.0f/src and i/src cant Signed-off-by: rander.wang <rander.wang@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2017-07-06backend: refine fdiv to rcp at some casesrander1-0/+28
when the src0 of fdiv is a immedia value and it is exactly pow of 2, like 2.0f, 4.0f, 1.0/8.0f, fdiv %0, imm, %1 can be convert to rcp %0, %1 mul %0, %0, imm. for fdiv cost 8cycle, rcp 4cycle. it will save at least 3cycle. pass the conformance test and utests V2: refine negation flag V3: modify negation by negate Signed-off-by: rander.wang <rander.wang@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2017-07-04backend: refine math log functionrander1-40/+10
remove a few unnecessary codes , and get 20% improvement at worse case. If X is a NAN, there are some if-return codes to return NAN. Now change it to add(x - x) which get the same NAN pass the conformance tests and utests Signed-off-by: rander.wang <rander.wang@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2017-07-04backend: refine pow functionrander1-146/+148
Now save 40% time than before (1) group many branches which deal with corner case to one branch. (2) using HW exp2 and log2 to replace some instructions pass conformance tests and utest Signed-off-by: rander.wang <rander.wang@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2017-07-04Runtime: refine max group size for SKL & KBLrander1-9/+9
Now change max group size to 256. it is a reasonable size for Gen9. According to performance test, 256 make good progress in openCV and no regression. So change it Signed-off-by: rander.wang <rander.wang@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2017-06-23backend: refine load/store merging algorithmrander1-9/+78
Now it works for sequence: load(0), load(1), load(2) but it cant work for load(2), load(0), load(1). because it compared the last merged load and the new one not all the loads for sequence: load(0), load(1), load(2). the load(0) is the start, can find that load(1) is successor without space, so put it to a merge fifo. then the start is moving to the top of fifo load(1), and compared with load(2). Also load(2) can be merged for load(2), load(0), load(1). load(2) cant be merged with load(0) for a space between them. So skip load(0) and mov to next load(1).And this load(1) can be merged. But it never go back merge load(0) Now change the algorithm. (1) find all loads maybe merged arround the start by the distance to the start. the distance is depended on data type, for 32bit data, the distance is 4. Put them in a list (2) sort the list by the distance from the start. (3) search the continuous sequence including the start to merge V2: (1)refine the sort and compare algoritm. First find all the IO in small offset compared to start. Then call std:sort (2)check the number of candidate IO to be favorable to performance for most cases there is no chance to merge IO Signed-off-by: rander.wang <rander.wang@intel.com> Reviewed-by: Ruiling Song <ruiling.song@intel.com>
2017-06-23backend: add global immediate optimizationrander1-25/+342
there are some global immediates in global var list of LLVM. these imm can be integrated in instructions. for compiler_global_immediate_optimized test in utest, there are two global immediates: L0: MOV(1) %42<0>:UD : 0x0:UD MOV(1) %43<0>:UD : 0x30:UD used by: ADD(16) %49<1>:D : %42<0,1,0>:D %48<8,8,1>:D ADD(16) %54<1>:D : %43<0,1,0>:D %53<8,8,1>:D it can be ADD(16) %49<1>:D : %48<8,8,1>:D 0x0:UD ADD(16) %54<1>:D : %53<8,8,1>:D 0x30:UD Then the MOV can be removed. And after this optimization, ADD 0 can be change to MOV, then local copy propagation can be done. V2: (1) add environment variable to enable/disable the optimization (2) refine the architecture of imm optimization, inherit from global optimizer not local block optimizer V3: merge with latest master driver V4: (1)refine some type errors (2)remove UD/D check for no need (3)refine imm calculate for UD/D Signed-off-by: rander.wang <rander.wang@intel.com> Reviewed-by: Ruiling Song <ruiling.song@intel.com>
2017-06-23GBE: clean llvm module's clone and release.Yang, Rong R10-58/+76
There are some changes: 1. Clone the module before call LLVMLinkModules2, remove other clones for it. 2. Don't delete module in function llvmToGen. 3. Add a function programNewFromLLVMFile so genProgramNewFromLLVM and buildFromLLVMModule only handle llvm module. Actually, programNewFromLLVMFile is only used by clCreateProgramWithLLVMIntel, and I think it is useless, maybe we could delete it at all. V2: define errDiag beside #if/#endif. Signed-off-by: Yang Rong <rong.r.yang@intel.com> Reviewed-by: Pan Xiuli <xiuli.pan@intel.com>
2017-06-22Add missed kernel names into built-in kernel list.Yan Wang1-1/+16
Signed-off-by: Yan Wang <yan.wang@linux.intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2017-06-22Runtime: Add missing SKL deivce IDPan Xiuli2-1/+9
It seems we missed some newly added device ID for SKL. Signed-off-by: Pan Xiuli <xiuli.pan@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2017-06-16Fix context leak with internal kernelsPatrick Beaulieu1-1/+21
Account for internal program ctx references in cl_context_delete Signed-off-by: Patrick Beaulieu <patrick.beaulieu@avigilon.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2017-06-16backend: refine the local copy propagation.rander.wang1-0/+34
src modifier is not supported by some instructions. so return false when it exists. This fix piglit % scalar-arithmetic-int failed V2: (1)add hadd rhadd (2)confirmed math functions support midifer except IDIV/Mod Signed-off-by: rander.wang <rander.wang@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2017-06-16Utset: Add test case for cl_intel_required_subgroup_size extensionPan Xiuli5-0/+75
Check the device supported subgroup sizes, and use intel_reqd_sub_group_size to build kernels in these size. Then check if there is spill for each kernel. V2: Fix memory leak Signed-off-by: Pan Xiuli <xiuli.pan@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2017-06-16Runtime: Add new API enums for cl_intel_required_subgroup_size extensionPan Xiuli7-0/+50
Add CL_DEVICE_SUB_GROUP_SIZES_INTEL for clGetDeviceInfo, add CL_KERNEL_SPILL_MEM_SIZE_INTEL for clGetKernelWorkGroupInfo and add CL_KERNEL_COMPILE_SUB_GROUP_SIZE_INTEL for clGetKernelSubGroupInfo. We only have this extension for LLVM 40+ for frontend support. V2: Add opencl-c define Signed-off-by: Pan Xiuli <xiuli.pan@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2017-06-16Backend: Add intel_reqd_sub_group_size supportPan Xiuli3-13/+45
If we get intel_reqd_sub_group_size attribute from frontend then set it to backend. V2: Refine the codeGenNum with runtime caclculate and fail the build if the size from frontend is illegal. Signed-off-by: Pan Xiuli <xiuli.pan@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2017-06-16do constant folding for kernel struct argsGuo Yejun6-0/+213
for the following GEN IR, %41 is kernel argument (struct) the first LOAD will be mov, and the second LOAD will be indirect move (see lowerFunctionArguments). It hurts performance, and even impacts the correctness of reg liveness of indriect mov LOADI.uint64 %1114 72 ADD.int64 %78 %41 %1114 LOAD.int64.private.aligned {%79} %78 bti:255 LOADI.int64 %1115 8 ADD.int64 %1116 %78 %1115 LOAD.int64.private.aligned {%80} %1116 bti:255 this function folds the constants of 72 and 8 together, and so it will be direct mov. the GEN IR looks like: LOADI.int64 %1115 80 ADD.int64 %1116 %41 %1115 Signed-off-by: Guo Yejun <yejun.guo@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2017-06-14Use aligned16 and aligne4 kernel to copy for large 3D image with TILE_Y.Yan Wang7-37/+149
It is similar with 2D image for avoiding extended image width truncated. Signed-off-by: Yan Wang <yan.wang@linux.intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2017-06-14Add test case for large 3D image with TILE_Y.Yan Wang1-0/+98
It will test aligned4 and aligned16 kernel for 3D image. Signed-off-by: Yan Wang <yan.wang@linux.intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2017-06-13Optimize clEnqueueWriteImageByKernel and clEnqueuReadImageByKernel.Yan Wang1-7/+18
1. Only copy the data by origin and region defined. 2. Add clFinish to guarantee the kernel copying is finished when blocking writing. Signed-off-by: Yan Wang <yan.wang@linux.intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2017-06-13Fix bug of clEnqueueUnmapMemObjectForKernel and clEnqueueMapImageByKernel.Yan Wang1-34/+113
1. Support wrrting data by mapping/unmapping mode. 2. Add mapping record logic. 3. Add clFinish to guarantee the kernel copying is finished. 4. Fix the error of calling clEnqueueMapImageByKernel. blocking_map and map_flags need be switched. Signed-off-by: Yan Wang <yan.wang@linux.intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2017-06-13Add clFinish for guarantee the kernel copying is finished when create TILE_Y ↵Yan Wang1-0/+7
large image. Signed-off-by: Yan Wang <yan.wang@linux.intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2017-06-13Add cl_mem_record_map_mem_for_kernel() for record map adress for TILE_Y ↵Yan Wang2-26/+88
image by kernel copying. Signed-off-by: Yan Wang <yan.wang@linux.intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2017-06-13Add utest to test writing data into large image (TILE_Y) by map/unmap and ↵Yan Wang1-0/+115
USE_HOST_PTR mode. Signed-off-by: Yan Wang <yan.wang@linux.intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2017-06-13Add utest to test writing data into large image (TILE_Y) by map/unmap mode.Yan Wang1-0/+198
It is used to reproduce the bug of clCopyImage/clFillImage of conformance test. Signed-off-by: Yan Wang <yan.wang@linux.intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2017-06-13Add utest case for filling image by small region.Yan Wang1-0/+50
It is used to reproduce the bug of allocations of conformance test. Signed-off-by: Yan Wang <yan.wang@linux.intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2017-06-09utests: added for optimization negativeAddrander3-1/+46
the negative Add is like: exp -a llvm transfer it to: add x -a, 0 exp x Signed-off-by: rander.wang <rander.wang@intel.com> Reviewed-by: Pan Xiuli <xiuli.pan@intel.com>
2017-06-09Backend: Add optimization for negative modifierrander1-4/+28
LLVM transform Mad(a, -b, c) to Add b, -b, 0 Mad val, a, b, c pow(a,-b) and other buildin math function to the same instruction sequence like above for Gen support negtive modifier, mad(a, -b, c) is native suppoted. Do it just like a: mov b, -b, so it is a Mov operation like LocalCopyPropagation Signed-off-by: rander.wang <rander.wang@intel.com> Reviewed-by: Pan Xiuli <xiuli.pan@intel.com>
2017-06-09utests: add utest for sqrt-div optimizationrander3-1/+71
Signed-off-by: rander.wang <rander.wang@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2017-06-09backend: add sqrt-div pattern to instruction selectrander1-0/+69
there some patterns like: sqrt r1, r2; load r4, 1.0; ===> rqrt r3, r2 div r3, r4, r1; Signed-off-by: rander.wang <rander.wang@intel.com> Reviewed-by: Pan Xiuli <xiuli.pan@intel.com>
2017-06-09Runtime: Fix a mssing llvm version marco for LLVM40+Pan Xiuli1-1/+1
Found a missing macro that need change to support LLVM40+. Signed-off-by: Pan Xiuli <xiuli.pan@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2017-06-09keep GEN IR as SSA styleGuo Yejun1-3/+5
Signed-off-by: Guo Yejun <yejun.guo@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2017-06-09backend: refine exp function with float inputrander1-2/+56
remove some corner cases check for these path can not be reached.And refine branch code to select. These improvements get 20% performance. and the performance of OCL_ExpFixture_Exp in opencv can match up to other Gen driver Signed-off-by: rander.wang <rander.wang@intel.com> Reviewed-by: Pan Xiuli <xiuli.pan@intel.com>
2017-06-09backend: refine hypot functionrander1-14/+60
the test OCL_Magnitude of opencv is slow on beignet because of hypot. refine the hypot, change algorithm and remove unnecessary code to get 30% up Signed-off-by: rander.wang <rander.wang@intel.com> Reviewed-by: Pan Xiuli <xiuli.pan@intel.com>
2017-05-25Fix bug of clEnqueueCopyBufferToImage and clEnqueueCopyImageToBuffer.Yan Wang5-28/+89
"imagedim_non_pow_2" cases of basic modudle of confrmance shows regression after use TILE_Y mode for large image by previous patch. This bug comes from the non-align16 kernel of clEnqueueCopyBufferToImage and clEnqueueCopyImageToBuffer. It will force CL_RGBA/CL_UNORM_INT8/8191x8192 image of conformance test to CL_R/CL_UNSIGNED_INT8/32764x8192 image for copying. So it makes width as 8191 x 4 = 32764 and its width will exceed the maximum width (16 x 1024 = 16384) of GEN surface state structure which only has 14 bits. So use align4 copy kernel to avoid this bug. Signed-off-by: Yan Wang <yan.wang@linux.intel.com>
2017-05-25Add utest to reproduce the bug of imagedim_non_pow_2 cases of conformance test.Yan Wang1-0/+46
Signed-off-by: Yan Wang <yan.wang@linux.intel.com>
2017-05-25build: fix cmake code generation dependencies.Ismo Puustinen1-2/+2
There is a race condition between building .bc and header files and generating code from .cl targets. Fix the race by adding the dependency to generated files. Signed-off-by: Ismo Puustinen <ismo.puustinen@intel.com>