~hejunyan/beignet - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author	Files	Lines
2015-02-04	Fix a bug of 1d image array test case.	Junyan He	1	-6/+8
	Because of the HW limitation, vertical stride is at least aligned to 2. For 1D array image, the data has interval. The image size is just twice as big as the buffer size we think. Use clEnqueueWriteImage is safe and fix this bug. Signed-off-by: Junyan He <junyan.he@linux.intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-28	check the predication in case of endless loop.	Luo	1	-0/+5
	v2: Add comment from ruiling: or dead loop, it has an unconditional branch at its end. Simply do not treat it as a loop is also acceptable. I ran into this problem when I execute ./opencv_test_imgproc --gtest_filter=OCL_Imgproc/HoughLines.RealImage/0 And it fix the problem. Signed-off-by: Luo <xionghu.luo@intel.com> Reviewed-by: Ruiling Song <ruiling.song@intel.com>
2015-01-26	GBE: add GEN_TYPE_HF to getTypeSize.	Zhigang Gong	1	-0/+1
	Gen8 use GEN_TYPE_HF, we need to let getTypeSize support it. Signed-off-by: Zhigang Gong <zhigang.gong@intel.com> Tested-by: Zhu Bingbing <bingbingx.zhu@intel.com>
2015-01-26	Fix bug for bitcast test case because of long type.	Junyan He	1	-5/+5
	ulong and uint64_t have different size on i386 and i386_64, which cause the test case failure. Signed-off-by: Junyan He <junyan.he@linux.intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-23	Add the check for src and dst span different registers.	Junyan He	1	-2/+41
	On IVB and HSW, When dst spans two registers, the source MUST span two registers. So the following instructions: mov (16) r104.0<2>:uw r126.0<8;8,1>:uw { Align1, H1 } mov (16) r104.1<2>:uw r111.0<8;8,1>:uw { Align1, H1 } mov (16) r106.0<2>:uw r110.0<8;8,1>:uw { Align1, H1 } mov (16) r106.1<2>:uw r109.0<8;8,1>:uw { Align1, H1 } are illegal. Add the check to split instruction into 2 SIMD8 instructions here. TODO: These instructions are allowed on BDW, need to improve. Signed-off-by: Junyan He <junyan.he@linux.intel.com> Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
2015-01-23	Add test case for long bitcast.	Junyan He	3	-0/+275
	Signed-off-by: Junyan He <junyan.he@linux.intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-23	update utest to loose userptr limitation	Guo Yejun	2	-2/+2
	the limitation is loosed from page size to cache line size alignment inside driver, update utest accordingly. Signed-off-by: Guo Yejun <yejun.guo@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-23	loose the alignment limitation for host_ptr of CL_MEM_USE_HOST_PTR	Guo Yejun	3	-4/+22
	the current limitation is both host_ptr and buffer size should be page aligned, loose the limitation of host_ptr to be cache line size (64byte) alignment, and no limitation for the size. Signed-off-by: Guo Yejun <yejun.guo@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-23	correct the cache line size to be 64	Guo Yejun	2	-2/+2
	the correct value of cache line size is 64 bytes, not 128. Signed-off-by: Guo Yejun <yejun.guo@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-22	GBE: fix popcount bugs.	Zhigang Gong	4	-10/+20
	We need to pass correct popcount source type to backend. Signed-off-by: Zhigang Gong <zhigang.gong@intel.com> Reviewed-by: "Luo, Xionghu" <xionghu.luo@intel.com>
2015-01-21	GBE: fix an ACC register related instruction scheduling bug	Zhigang Gong	3	-2/+18
	Some instructions modify the ACC register in the gen_context stage which's not regonized by current instruction scheduling algorithm. This patch fix this bug by checking all the possible SEL_OPs which may change the ACC implicitly. The corresponding bugzilla link is as below: https://bugs.freedesktop.org/show_bug.cgi?id=88587 Signed-off-by: Zhigang Gong <zhigang.gong@intel.com> Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
2015-01-19	Bump version to 1.0.1.Release_v1.0.1	Zhigang Gong	2	-2/+5
	Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
2015-01-16	fix the wrong implementation of popcount.	Luo Xionghu	2	-7/+4
	add disassembly for cbit. Signed-off-by: Luo Xionghu <xionghu.luo@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-13	Fix the printf buffer size bug.	Junyan He	9	-17/+26
	We can not know the accurate size of the printf buffer size before run the kernel. Sometimes, especially when the global work items size is huge, the output buffer is not enough and the print message logic will cause the segment fault. We increase the printf buffer to 16M at most and add out of range check to avoid crash. Signed-off-by: Junyan He <junyan.he@linux.intel.com> Signed-off-by: Zhigang Gong <zhigang.gong@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-12	GBE: Fix a disassembly bug.	Ruiling Song	1	-2/+2
	It looks a typo, which wrongly interprete bti/msg_type field. Signed-off-by: Ruiling Song <ruiling.song@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-12	GBE: disable spill register under simd16 mode.	Zhigang Gong	1	-3/+2
	Register spilling awlays cost much more than fallback to simd8 which could avoid register spilling or at least reduce the spilled registers. Signed-off-by: Zhigang Gong <zhigang.gong@intel.com> Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
2015-01-12	add the reduced self loop node detection.	Luo Xionghu	1	-11/+26
	if the self loop node is reduced, the llvm loop info couldn't detect such kind of self loops, handle it by checking whether the compacted node has a successor pointed to itself. v2: differentiate the compacted node from basic node to make the logic clearer, comments the while node as it is not enabled now. Signed-off-by: Luo Xionghu <xionghu.luo@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-12	reuse the loop info from llvm.	Luo Xionghu	2	-36/+21
	the original loop detect algorithm caused the luxmark building performance 10x regression, this patch reused the loop info from llvm to handle SelfLoopNode. the trimmed path couldn't recognize nested while structures(if nodes in while caused performance regression). also the simple while loop node is still not handled yet. Signed-off-by: Luo Xionghu <xionghu.luo@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-12	utests: Add const private array initialization test.	Ruiling Song	3	-0/+37
	Signed-off-by: Ruiling Song <ruiling.song@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-12	Change CL_DEVICE_PREFERRED_VECTOR_WIDTH_CHAR from 8 to 16.	Chuanbo Weng	1	-1/+1
	Because accessing global memory by uchar16/char16 will fully utilize memory bandwidth, so change CL_DEVICE_PREFERRED_VECTOR_WIDTH_CHAR from 8 to 16. Three OpenCV cases will speedup from this patch: OCL_ThreshFixture_Threshold, 25% improvement OCL_MaxFixture_Max, 105% improvement OCL_MinFixture_Min, 105% improvement. Signed-off-by: Chuanbo Weng <chuanbo.weng@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-12	add howto for old gcc version	Guo Yejun	1	-0/+58
	Signed-off-by: Guo Yejun <yejun.guo@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-12	only build tests that do not need compiler when standalone compiler is provided	Guo Yejun	2	-9/+23
	the built test case is load_program_from_bin_file, it demos how to generate from source kernel compiler_ceil.cl to binary kernel compiler_ceil.bin with the standalone compiler for a specific gen pci id, and also demos how to load and execute the binary kernel when the compiler is not available in the running system. btw, please make sure compiler_ceil.bin is really updated if there is already one there, the safe way is to delete it first. Signed-off-by: Guo Yejun <yejun.guo@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-12	add utest of CL_MEM_ALLOC_HOST_PTR	Guo Yejun	3	-0/+32
	Signed-off-by: Guo Yejun <yejun.guo@intel.com> Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
2015-01-12	add CMake option USE_STANDALONE_GBE_COMPILER and STANDALONE_GBE_COMPILER_DIR	Guo Yejun	7	-16/+110
	At some platforms with old c/c++ environment, C++11 features are not supported, it results in the failure to build the gbe compiler part which depends on LLVM/clang using C++11 features. The way to resolve is to build a standalone gbe compiler within another feasible system, and build beignet with the already built standalone gbe compiler by setting USE_STANDALONE_GBE_COMPILER=true. The path of the standalone compiler is /usr/local/lib/beignet as default or could be specified by STANDALONE_GBE_COMPILER_DIR. Once USE_STANDALONE_GBE_COMPILER is given, all the gbe compiler relative code will not be built any longer, only libcl.so and libgebinterp.so are built. And libcl.so is special for GEN_PCI_ID, which is queried from the building machie or could be specified as CMake option. v2: separate the CMake option name. update the commit comments. add back the script for gen pci id, and build driver with it. v3: add file FindStandaloneGbeCompiler.cmake to make the main cmakefile clean. Signed-off-by: Guo Yejun <yejun.guo@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-12	add option BUILD_STANDALONE_GBE_COMPILER to build static compiler	Guo Yejun	1	-10/+29
	The standalone compiler (gbe_bin_generater), depending on LLVM/clang, could only be built with C++11 features. To make it workable within old c/c++ version environment, add one CMAKE option to link against all static libraries. And also zip the compiler and necessary files into a tar ball. v2: change the option name to BUILD_STANDALONE_GBE_COMPILER. zip necessary files into a tar ball. Signed-off-by: Guo Yejun <yejun.guo@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-09	Add read buffer/image benchmark.	Yang Rong	5	-1/+159
	Add there two benchmark to compare the buffer and image performance V2: init the coord before read image. V3: Correct the image's width and buffer's read index. Signed-off-by: Yang Rong <rong.r.yang@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-09	CL/Driver/HSW: Convert L3 cycle for texture to uncachable.	Zhigang Gong	1	-1/+1
	This is to workaround a bug we found with darktable. After this patch, darktable could work fine on HSW. And based on the test result, most of the benchmarks haven't been affected much by this patch. Signed-off-by: Zhigang Gong <zhigang.gong@intel.com> Tested-by: "Meng, Mengmeng" <mengmeng.meng@intel.com>
2015-01-09	Change the IVB/HSW L3 SQC credit setting.	Yang Rong	1	-2/+2
	Set the L3SQ General Priority Credit to max, and L3SQ High Priority Credit to zero, it can slightly improve the performacne, about 2% of luxmark. Signed-off-by: Yang Rong <rong.r.yang@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-09	utests: skip one test when it fail to open XDisplay.	Zhigang Gong	1	-0/+4
	Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
2015-01-09	Fix loop condition of PrintfSet constructor.	Yan Wang	1	-1/+1
	Signed-off-by: Yan Wang <yan.wang@linux.intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-09	remove useless dependency libocl	Guo Yejun	1	-2/+0
	libocl is the name of sub directory, the project name in the sub directory, it is not something that others can depend on. Signed-off-by: Guo Yejun <yejun.guo@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-09	CL/Driver: quick fix regression caused by remove MI_FLUSH.	Zhigang Gong	1	-0/+2
	On Gen8, we also need an extra pipe control after the MEDIA_STATE_FLUSH. Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
2015-01-09	refine gbe_bin_generater usage to add -t option	Guo Yejun	1	-1/+1
	-t option specifies the gen target pci id, it tells gbe_bin_generater the target platform that it compiles for. The compile result is llvm level binary if this option is not given. Signed-off-by: Guo Yejun <yejun.guo@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-09	libocl: Reimplement trigonometric functions.	Ruiling Song	1	-378/+172
	Previous version was ported from msun which derived from fdlibm, which is good for cpu, with lots of if-condition check to try to optimize for different input data. But it is really bad for gpu. So I reimplement these functions based on well-known payne & Hanek's algorithm. Compared with previous version, it could reduce the static ASM instruction number of sin/cos from about 1700 to 400. Signed-off-by: Ruiling Song <ruiling.song@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-09	CL/Driver: enable atomics in L3 for HSW.	Zhigang Gong	2	-1/+14
	This could get more than 10x boost for some atomic stress workloads. Signed-off-by: Zhigang Gong <zhigang.gong@intel.com> Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
2015-01-09	libocl: remove useless code.	Ruiling Song	1	-57/+0
	These kind of logic already handled by atan2(). Signed-off-by: Ruiling Song <ruiling.song@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-09	fix utest build for some old gcc version	Guo Yejun	5	-26/+26
	change the keyword from constexpr to const, update the code for explicit type conversion and std::map's iterator. Signed-off-by: Guo Yejun <yejun.guo@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-09	do not use C++11 features inside libgbeinterp	Guo Yejun	12	-87/+111
	some embedded systems have not upgraded the c/c++ environment, it makes the request to remove the C++11 features. It is possible for the CL_EMBEDDED_PROFILE with some more changes (to be done later). This change modifies the keyword auto and nullptr. btw, C++ new feature is a must for libgbe (the OpenCL compiler) which depends on LLVM/clang Signed-off-by: Guo Yejun <yejun.guo@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-09	change Immediate::operator= from private to public	Guo Yejun	1	-1/+2
	change the attribute of "Immediate & operator= (const Immediate &)" in class Immediate from private to public, otherwise, a compile issue appears when build with old gcc versions for the following code in function.hpp: INLINE ImmediateIndex newImmediate(const Immediate &imm) { const ImmediateIndex index(this->immediateNum()); this->immediates.push_back(imm); return index; } Signed-off-by: Guo Yejun <yejun.guo@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-09	do not include llvm/clang headers for libgbeinterp	Guo Yejun	2	-1/+12
	libgbeinterp does not depend on llvm/clange, so remove these header files for code clean. Signed-off-by: Guo Yejun <yejun.guo@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-09	Remove obsolete MI_FLUSH	Zhenyu Wang	4	-11/+3
	This is caught in emulator debug that MI_FLUSH is obsolete from IVB/HSW and beignet used wrong flush bit too, so don't go risk but remove it. Current kernel would take care to flush ring after each request, so shouldn't need extra flush. Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com> Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
2015-01-09	Don't check some edge condtion in non-strict mode.	Zhigang Gong	1	-2/+2
	Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
2015-01-09	add edge case detection for powr in utests	Meng Mengmeng	2	-3/+6
	power(x,y) return Nan for x<0 in spec, so add that for powr. Signed-off-by: Meng Mengmeng <mengmeng.meng@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-09	utests: make utests maths ULP values consistent with specification	Meng Mengmeng	3	-8/+96
	Signed-off-by: Meng Mengmeng <mengmeng.meng@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-09	runtime: fix max work group size for IVBGT1.	Zhigang Gong	1	-2/+2
	If the kernel is compiled under simd8 mode, the maximum work group size should be 8 * 6 * 6 = 288. The original 512 is too large for it. Signed-off-by: Zhigang Gong <zhigang.gong@intel.com> Tested-by: "Meng, Mengmeng" <mengmeng.meng@intel.com>
2015-01-09	runtime: tweak max memory allocation size.	Zhigang Gong	2	-2/+12
	Increase the maximum memory allocation size to at least 512MB and will set it to larger if the system has more total memory. This tweak will make darktable happy to handle big pictures. Signed-off-by: Zhigang Gong <zhigang.gong@intel.com> v2: reduce max constant buffer to 128MB. v3: fix the sysinfo usage. Signed-off-by: Zhigang Gong <zhigang.gong@intel.com> Tested-by: "Meng, Mengmeng" <mengmeng.meng@intel.com>
2015-01-09	utests: reduce test count.	Zhigang Gong	1	-4/+5
	No need to iterate so many times. Signed-off-by: Zhigang Gong <zhigang.gong@intel.com> Tested-by: "Meng, Mengmeng" <mengmeng.meng@intel.com>
2015-01-09	Fix PrintfState copying.	Yan Wang	1	-4/+29
	PrintfState include std::string object and shouldn't be copied by malloc/memcpy. Signed-off-by: Yan Wang <yan.wang@linux.intel.com> Reviewed-by: He Junyan <junyan.he@inbox.com>
2015-01-09	Separate flush and invalidate in function intel_gpgpu_pipe_control.	Yang, Rong	2	-2/+36
	HSW has a limitation when PIPECONTROL with RO Cache Invalidation: Prior to programming a PIPECONTROL command with any of the RO cache invalidation bit set, program a PIPECONTROL flush command with CS stall bit and HDC Flush bit set. So must use two PIPECONTROL commands to flush and invalidate L3 cache in HSW. This patch fix some random fails which has very heavy DC read/write in HSW. Signed-off-by: Yang, Rong <rong.r.yang@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-09	Use libdrm interface to get device id	Zhenyu Wang	2	-22/+2
	Remove own ioctl call for device id but use libdrm interface instead. This not only saves one extra ioctl call as it's already been read when gem bufmgr inits, and also would allow to override device id with libdrm helper environment 'INTEL_DEVID_OVERRIDE'. To combine with aub dump, you can do device debugging with fulsim emulator by choosing any device you want and don't need hw metal at all. Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>