beignet - Beignet OpenCL Library for Intel Ivy Bridge and newer GPUs (mirrored from https://gitlab.freedesktop.org/beignet/beignet)

Age	Commit message (Collapse)	Author	Files	Lines
2017-09-21	backend: use simd-1 for scalar dst in indirectMov.	Song, Ruiling	1	-14/+24
	This fix a failure introduced by load-store optimization on IVB. the test case is: builtin_kernel_block_motion_estimate_intel Signed-off-by: Ruiling Song <ruiling.song@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2017-09-21	GBE: remove static context to fix Segmentation fault.	Yang Rong	4	-33/+39
	If application has static clProgram, when application exit, the static context has been deleted before delete static clProgram will cause segmentation fault. As the global static context is just for link, use the individual context of each llvm module, when link the llvm module, generate the new llvm module from src. V2: fix llvm 3.8 build error and CleanLlvmResource delete bug. Signed-off-by: Yang Rong <rong.r.yang@intel.com> Reviewed-by: Ruiling Song <ruiling.song@intel.com>
2017-09-21	GBE: enable llvm5.0 support.	Yang Rong	6	-33/+87
	1. getOrInsertFunction without nullptr. 2. handle f16 rounding. 3. remove llvm value dump. 4. handle AddrSpaceCastInst when parsing block info. V2: use stripPointerCasts instead of BitCast and AddrSpaceCast. Signed-off-by: Yang Rong <rong.r.yang@intel.com> Reviewed-by: Ruiling Song <ruiling.song@intel.com>
2017-09-21	libocl: enable llvm5.0 support.	Yang Rong	3	-32/+59
	There are 2 changes: 1. enable cl_khr_3d_image_writes, llvm5.0 required. 2. change enqueue_ndrange functions and ndrange_t type for llvm5.0. Signed-off-by: Yang Rong <rong.r.yang@intel.com> Reviewed-by: Ruiling Song <ruiling.song@intel.com>
2017-09-21	libocl: Consider only bottom ilogb(2m-1)+1 bits	Jan Vesely	1	-30/+30
	Signed-off-by: Jan Vesely <jano.vesely@gmail.com> Reviewed-by: Ruiling Song <ruiling.song@intel.com>
2017-09-21	libocl: Add shuffle and shuffle2 builtins for half type	Jan Vesely	2	-0/+4
	Signed-off-by: Jan Vesely <jano.vesely@gmail.com> Reviewed-by: Ruiling Song <ruiling.song@intel.com>
2017-07-27	GBE: fix a errMsg uninitialized build warning.	Yang, Rong R	1	-3/+3
	Signed-off-by: Yang Rong <rong.r.yang@intel.com> Reviewed-by: Ruiling Song <ruiling.song@intel.com>
2017-07-27	backend: Fix a bug in load-store optimization.	Song, Ruiling	1	-25/+46
	when we are merging STOREs, we should use the very last instruction as the insertion point. Signed-off-by: Ruiling Song <ruiling.song@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2017-07-20	backend: refine global immediate optimization	rander	1	-4/+0
	for ABS(UD) = UD on Gen, so delete it, or it make compilation failed on some platform Signed-off-by: rander.wang <rander.wang@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2017-07-20	Fix GCC6 build bug	Pan Xiuli	1	-0/+1
	GCC6 refine the c headers and need to add the needed function header, like the abs in math.h. Signed-off-by: Pan Xiuli <xiuli.pan@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2017-07-18	backend: refine load store optimization	Song, Ruiling	1	-37/+88
	this fix basic test in conformance tests failed for vec8 of char because of overflow. And it fix many test items failed in opencv because of offset error (1)modify the size of searchInsnArray to 32, it is the max size for char And add check for overflow if too many insn (2)Make sure the start insn is the first insn of searched array because if it is not the first, the offset maybe invalid. And it is complex to modify offset without error V2: refine search index, using J not I V3: remove (2), now add offset to the pointer of start pass OpenCV, conformance basic and compiler tests, utests V4: check pointer type, if 64bit, modify it by 64, or 32 V5: refine findSafeInstruction() and variable naming in findConsecutiveAccess(). Signed-off-by: rander.wang <rander.wang@intel.com> Signed-off-by: Ruiling Song <ruiling.song@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2017-07-12	Implement extension cl_intel_device_side_avc_motion_estimation.	Chuanbo Weng	24	-33/+2113
	This patch mainly contains: 1. built-in function __gen_ocl_ime implementation. 2. Lots of built-in functions of cl_intel_device_side_avc_motion_estimation are implemented. 3. This extension is required to run in simd16 mode. v2: move the utests to seprate patches one by one; as all the utests has extension function check, no need to put them in stand alone utest; uncomment the self test; fix extension check logic issue, should be && instead of \|\|. Signed-off-by: Chuanbo Weng <chuanbo.weng@intel.com> Signed-off-by: Xionghu Luo <xionghu.luo@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2017-07-06	backend: improve add zero pattern	rander	1	-5/+5
	remove the negation check for adding zero. it also can be applied this optimization V2: refine the function name for zeroAdd Signed-off-by: rander.wang <rander.wang@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2017-07-06	backend: refine fdiv to rcp at some cases	rander	1	-0/+28
	when the src0 of fdiv is a immedia value and it is exactly pow of 2, like 2.0f, 4.0f, 1.0/8.0f, fdiv %0, imm, %1 can be convert to rcp %0, %1 mul %0, %0, imm. for fdiv cost 8cycle, rcp 4cycle. it will save at least 3cycle. pass the conformance test and utests V2: refine negation flag V3: modify negation by negate Signed-off-by: rander.wang <rander.wang@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2017-07-04	backend: refine math log function	rander	1	-40/+10
	remove a few unnecessary codes , and get 20% improvement at worse case. If X is a NAN, there are some if-return codes to return NAN. Now change it to add(x - x) which get the same NAN pass the conformance tests and utests Signed-off-by: rander.wang <rander.wang@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2017-07-04	backend: refine pow function	rander	1	-146/+148
	Now save 40% time than before (1) group many branches which deal with corner case to one branch. (2) using HW exp2 and log2 to replace some instructions pass conformance tests and utest Signed-off-by: rander.wang <rander.wang@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2017-06-23	backend: refine load/store merging algorithm	rander	1	-9/+78
	Now it works for sequence: load(0), load(1), load(2) but it cant work for load(2), load(0), load(1). because it compared the last merged load and the new one not all the loads for sequence: load(0), load(1), load(2). the load(0) is the start, can find that load(1) is successor without space, so put it to a merge fifo. then the start is moving to the top of fifo load(1), and compared with load(2). Also load(2) can be merged for load(2), load(0), load(1). load(2) cant be merged with load(0) for a space between them. So skip load(0) and mov to next load(1).And this load(1) can be merged. But it never go back merge load(0) Now change the algorithm. (1) find all loads maybe merged arround the start by the distance to the start. the distance is depended on data type, for 32bit data, the distance is 4. Put them in a list (2) sort the list by the distance from the start. (3) search the continuous sequence including the start to merge V2: (1)refine the sort and compare algoritm. First find all the IO in small offset compared to start. Then call std:sort (2)check the number of candidate IO to be favorable to performance for most cases there is no chance to merge IO Signed-off-by: rander.wang <rander.wang@intel.com> Reviewed-by: Ruiling Song <ruiling.song@intel.com>
2017-06-23	backend: add global immediate optimization	rander	1	-25/+342
	there are some global immediates in global var list of LLVM. these imm can be integrated in instructions. for compiler_global_immediate_optimized test in utest, there are two global immediates: L0: MOV(1) %42<0>:UD : 0x0:UD MOV(1) %43<0>:UD : 0x30:UD used by: ADD(16) %49<1>:D : %42<0,1,0>:D %48<8,8,1>:D ADD(16) %54<1>:D : %43<0,1,0>:D %53<8,8,1>:D it can be ADD(16) %49<1>:D : %48<8,8,1>:D 0x0:UD ADD(16) %54<1>:D : %53<8,8,1>:D 0x30:UD Then the MOV can be removed. And after this optimization, ADD 0 can be change to MOV, then local copy propagation can be done. V2: (1) add environment variable to enable/disable the optimization (2) refine the architecture of imm optimization, inherit from global optimizer not local block optimizer V3: merge with latest master driver V4: (1)refine some type errors (2)remove UD/D check for no need (3)refine imm calculate for UD/D Signed-off-by: rander.wang <rander.wang@intel.com> Reviewed-by: Ruiling Song <ruiling.song@intel.com>
2017-06-23	GBE: clean llvm module's clone and release.	Yang, Rong R	7	-57/+69
	There are some changes: 1. Clone the module before call LLVMLinkModules2, remove other clones for it. 2. Don't delete module in function llvmToGen. 3. Add a function programNewFromLLVMFile so genProgramNewFromLLVM and buildFromLLVMModule only handle llvm module. Actually, programNewFromLLVMFile is only used by clCreateProgramWithLLVMIntel, and I think it is useless, maybe we could delete it at all. V2: define errDiag beside #if/#endif. Signed-off-by: Yang Rong <rong.r.yang@intel.com> Reviewed-by: Pan Xiuli <xiuli.pan@intel.com>
2017-06-16	backend: refine the local copy propagation.	rander.wang	1	-0/+34
	src modifier is not supported by some instructions. so return false when it exists. This fix piglit % scalar-arithmetic-int failed V2: (1)add hadd rhadd (2)confirmed math functions support midifer except IDIV/Mod Signed-off-by: rander.wang <rander.wang@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2017-06-16	Runtime: Add new API enums for cl_intel_required_subgroup_size extension	Pan Xiuli	1	-0/+4
	Add CL_DEVICE_SUB_GROUP_SIZES_INTEL for clGetDeviceInfo, add CL_KERNEL_SPILL_MEM_SIZE_INTEL for clGetKernelWorkGroupInfo and add CL_KERNEL_COMPILE_SUB_GROUP_SIZE_INTEL for clGetKernelSubGroupInfo. We only have this extension for LLVM 40+ for frontend support. V2: Add opencl-c define Signed-off-by: Pan Xiuli <xiuli.pan@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2017-06-16	Backend: Add intel_reqd_sub_group_size support	Pan Xiuli	3	-13/+45
	If we get intel_reqd_sub_group_size attribute from frontend then set it to backend. V2: Refine the codeGenNum with runtime caclculate and fail the build if the size from frontend is illegal. Signed-off-by: Pan Xiuli <xiuli.pan@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2017-06-16	do constant folding for kernel struct args	Guo Yejun	6	-0/+213
	for the following GEN IR, %41 is kernel argument (struct) the first LOAD will be mov, and the second LOAD will be indirect move (see lowerFunctionArguments). It hurts performance, and even impacts the correctness of reg liveness of indriect mov LOADI.uint64 %1114 72 ADD.int64 %78 %41 %1114 LOAD.int64.private.aligned {%79} %78 bti:255 LOADI.int64 %1115 8 ADD.int64 %1116 %78 %1115 LOAD.int64.private.aligned {%80} %1116 bti:255 this function folds the constants of 72 and 8 together, and so it will be direct mov. the GEN IR looks like: LOADI.int64 %1115 80 ADD.int64 %1116 %41 %1115 Signed-off-by: Guo Yejun <yejun.guo@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2017-06-09	Backend: Add optimization for negative modifier	rander	1	-4/+28
	LLVM transform Mad(a, -b, c) to Add b, -b, 0 Mad val, a, b, c pow(a,-b) and other buildin math function to the same instruction sequence like above for Gen support negtive modifier, mad(a, -b, c) is native suppoted. Do it just like a: mov b, -b, so it is a Mov operation like LocalCopyPropagation Signed-off-by: rander.wang <rander.wang@intel.com> Reviewed-by: Pan Xiuli <xiuli.pan@intel.com>
2017-06-09	backend: add sqrt-div pattern to instruction select	rander	1	-0/+69
	there some patterns like: sqrt r1, r2; load r4, 1.0; ===> rqrt r3, r2 div r3, r4, r1; Signed-off-by: rander.wang <rander.wang@intel.com> Reviewed-by: Pan Xiuli <xiuli.pan@intel.com>
2017-06-09	keep GEN IR as SSA style	Guo Yejun	1	-3/+5
	Signed-off-by: Guo Yejun <yejun.guo@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2017-06-09	backend: refine exp function with float input	rander	1	-2/+56
	remove some corner cases check for these path can not be reached.And refine branch code to select. These improvements get 20% performance. and the performance of OCL_ExpFixture_Exp in opencv can match up to other Gen driver Signed-off-by: rander.wang <rander.wang@intel.com> Reviewed-by: Pan Xiuli <xiuli.pan@intel.com>
2017-06-09	backend: refine hypot function	rander	1	-14/+60
	the test OCL_Magnitude of opencv is slow on beignet because of hypot. refine the hypot, change algorithm and remove unnecessary code to get 30% up Signed-off-by: rander.wang <rander.wang@intel.com> Reviewed-by: Pan Xiuli <xiuli.pan@intel.com>
2017-05-23	refresh DAG when an arg has both direct and indirect read	Guo Yejun	1	-1/+16
	when the return value is ARG_INDIRECT_READ, there is still possible that some IRs read it directly, and will be handled in buildConstantPush() so we need to refresh the dag afer function buildConstantPush another method is to update DAG accordingly, but i don't think it is easy compared with the refresh method, so i do not choose it. Signed-off-by: Guo Yejun <yejun.guo@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2017-05-23	Backend: Add sel ir output for MATH function	Pan Xiuli	1	-0/+42
	We only output MATH function before, now we can know which math function is it. Signed-off-by: Pan Xiuli <xiuli.pan@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2017-05-23	backend: fix tgamma error after restructure	rander	1	-25/+31
	Signed-off-by: rander.wang <rander.wang@intel.com> Reviewed-by: Pan Xiuli <xiuli.pan@intel.com>
2017-05-18	Backend: Fix performance regression with sampler refine fro LLVM40	Pan Xiuli	2	-9/+41
	After the refine we can not know if a sampler is a constant initialized or not. Then the compiler optimization for constant sampler will break and we will runtime decide which SAMPLE instruction will use. Now fix the sampler refine for LLVM40 to enable the constant check. V2: Fix a typo of function __gen_ocl_sampler_to_int type. Signed-off-by: Pan Xiuli <xiuli.pan@intel.com> Reviewed-by: Yang Rong <rong.r.yang@intel.com>
2017-05-18	Backend: Fix llvm40 assert about literal structs	Pan Xiuli	1	-1/+2
	In llvm literal structs have no name, so check it first. Signed-off-by: Pan Xiuli <xiuli.pan@intel.com> Reviewed-by: Guo, Yejun <yejun.guo@intel.com>
2017-05-17	backend: refine asin function	rander.wang	1	-21/+7
	refine the algorithm to remove unnecessary operations Signed-off-by: rander.wang <rander.wang@intel.com> Tested-by: Yang Rong <rong.r.yang@intel.com>
2017-05-17	backend: refine atan	rander.wang	1	-53/+58
	remove private array and convert if to select Signed-off-by: rander.wang <rander.wang@intel.com> Tested-by: Yang Rong <rong.r.yang@intel.com>
2017-05-17	backend: refine acos	rander.wang	1	-4/+9
	refine algorithm to remove branch Signed-off-by: rander.wang <rander.wang@intel.com> Tested-by: Yang Rong <rong.r.yang@intel.com>
2017-05-17	backend: refine sincos	rander.wang	2	-13/+277
	remove redundent operation to get more performance Signed-off-by: rander.wang <rander.wang@intel.com> Tested-by: Yang Rong <rong.r.yang@intel.com>
2017-05-17	backend: refine tan function	rander.wang	1	-16/+45
	get it from crlibm and refine it for gen Signed-off-by: rander.wang <rander.wang@intel.com> Tested-by: Yang Rong <rong.r.yang@intel.com>
2017-05-17	backend: refine cos function	rander.wang	1	-26/+25
	do it like sin function Signed-off-by: rander.wang <rander.wang@intel.com> Tested-by: Yang Rong <rong.r.yang@intel.com>
2017-05-17	backend: refine sin function	rander.wang	1	-20/+22
	(1)refine the NAN check (2)using sqrt to get cos (3)remove small range check Signed-off-by: rander.wang <rander.wang@intel.com> Tested-by: Yang Rong <rong.r.yang@intel.com>
2017-05-17	backend: refine the argue reduce	rander.wang	1	-24/+14
	using a simple algorithm to get it Signed-off-by: rander.wang <rander.wang@intel.com> Tested-by: Yang Rong <rong.r.yang@intel.com>
2017-05-17	backend: refine pow function	rander.wang	1	-134/+141
	remove private array and some unnecessary if check. convert some if to select. improve about 50% performance Signed-off-by: rander.wang <rander.wang@intel.com> Tested-by: Yang Rong <rong.r.yang@intel.com>
2017-05-17	backend: refine the code structure of math	rander.wang	9	-7538/+4073
	mov all the common math function to match_common.cl. it is easy to maitain Signed-off-by: rander.wang <rander.wang@intel.com> Tested-by: Yang Rong <rong.r.yang@intel.com>
2017-05-15	GLK: add geminilake backend support.	Yang Rong	5	-2/+47
	Geminilake's backend is same as bxt. Signed-off-by: Yang Rong <rong.r.yang@intel.com> Reviewed-by: Pan Xiuli <xiuli.pan@intel.com>
2017-05-15	GBE: set memcpy and memset functions's linkage to LinkOnceAnyLinkage at last ↵	Yang, Rong R	3	-7/+14
	call. LLVM IR pass will produce memcpy and memset, if set LinkOnceAnyLinkage, memcpy and memset will be delete before and cause fail. Signed-off-by: Yang Rong <rong.r.yang@intel.com> Reviewed-by: Pan Xiuli <xiuli.pan@intel.com>
2017-05-04	backend: refine normalize function to pass cft	rander.wang	1	-16/+33
	remove the call of length to (1)make faster (2) to pass param with 0x1.xxxp1023 Signed-off-by: rander.wang <rander.wang@intel.com> Tested-by: Yang Rong <rong.r.yang@intel.com>
2017-05-04	backend: refine powr to pass cft	rander	1	-77/+25
	refine corner case according to spec Signed-off-by: rander.wang <rander.wang@intel.com> Tested-by: Yang Rong <rong.r.yang@intel.com>
2017-05-04	backend: refine modf to pass the cft	rander	2	-8/+84
	Signed-off-by: rander.wang <rander.wang@intel.com> Tested-by: Yang Rong <rong.r.yang@intel.com>
2017-05-04	backend: refine min\|max mag	rander	1	-20/+6
	do it according to spec to make cft pass Signed-off-by: rander.wang <rander.wang@intel.com> Tested-by: Yang Rong <rong.r.yang@intel.com>
2017-05-04	backend: refine remquo to pass cft	rander	2	-65/+396
	do the thing according to spec Signed-off-by: rander.wang <rander.wang@intel.com> Tested-by: Yang Rong <rong.r.yang@intel.com>