summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2015-01-20Add functions for conversion between native and fake long.Junyan He2-0/+61
Because on BDW, the native long's store and load need A64 instruction set. This instruction set is specially added to support 64 bits read and write but has a big limtation, in which the BTI parameter should refer to a stateless surface. This is unaccepitable for us because it can cause overwrite problems. We fallback to use the old manner which read and write the long as the vec2 of int into/from top and bottom halves. The pack/unpacked functions here play the role of assembling/disassmebling the data before/after the write/read. UnPack before writing like this: mov(8) g108<1>:UD g112<8,4,2>:UD { align1 WE_normal 1Q }; mov(8) g110<1>:UD g112.1<8,4,2>:UD { align1 WE_normal 1Q }; mov(8) g109<1>:UD g114<8,4,2>:UD { align1 WE_normal 2Q }; mov(8) g111<1>:UD g114.1<8,4,2>:UD { align1 WE_normal 2Q }; send(16) null:UW g106<8,8,1>:UD and Pack after reading like this: send(16) g120<1>:UW g124<8,8,1>:UD mov(8) g116<2>:UD g120<4,4,1>:UD { align1 WE_normal 1Q }; mov(8) g118<2>:UD g121<4,4,1>:UD { align1 WE_normal 1Q }; mov(8) g116.1<2>:UD g122<4,4,1>:UD { align1 WE_normal 1Q }; mov(8) g118.1<2>:UD g123<4,4,1>:UD { align1 WE_normal 1Q }; Signed-off-by: Junyan He <junyan.he@linux.intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-20Disasm supports to print long imm value in instruction.Junyan He1-0/+13
Signed-off-by: Junyan He <junyan.he@linux.intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-20Modify the load IMM 64 function.Junyan He4-5/+5
We split the load imm 64 into int64 and uint64. Signed-off-by: Junyan He <junyan.he@linux.intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-20Add long support flag into gen selectionJunyan He1-2/+12
Signed-off-by: Junyan He <junyan.he@linux.intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-20Add the u64 imm type in registerJunyan He1-0/+7
Signed-off-by: Junyan He <junyan.he@linux.intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-20Modify the split logic in encoderJunyan He1-11/+58
For the instruction like: MOV(16) rxx<4,4:1>:UQ ryy<4,4:1>:UQ the src or dst will stride 4 lines, which is illegal. The src and dst can not cross more than 2 adjacent lines. We need to split this kind of instruction into two 8 instructions here. Signed-off-by: Junyan He <junyan.he@linux.intel.com> Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
2015-01-20Add the long unpacked ud uw into the instruction schedule considerationJunyan He2-3/+12
Besides long and double, unpacked ud for long <8,4:2> and unpacked uw for long <16,4:4> can also stride several registers. We need to add the second half into the dependency when doing instruction schedule. Signed-off-by: Junyan He <junyan.he@linux.intel.com> Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
2015-01-20Add unpacked ud and unpacked uw for long type.Junyan He1-7/+41
The unpacked ud and uw is used for type conversion. If src type and dst type are different, the hstride in bytes must be same. So conversion for long need to be: MOV r1<2>:UD r2<4,4:1>:UQ for ulong to ud and MOV r1<4>:UW r2<4,4:1>:UQ for ulong to uw Signed-off-by: Junyan He <junyan.he@linux.intel.com> Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
2015-01-20Add long imm value in gen8 instruction.Junyan He2-5/+11
gen8 now support 64 bits immediate value for one src instruction. Signed-off-by: Junyan He <junyan.he@linux.intel.com> Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
2015-01-20Add long type support for disasm.Junyan He1-4/+8
Signed-off-by: Junyan He <junyan.he@linux.intel.com> Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
2015-01-20Import the native long type of ul1 ul8 and ul16Junyan He1-0/+51
The native long type is supported on BDW and later, we need to import it as a native reigster type. We declare it using vec4, which makes it like: rxx<4,4:1>:UQ and rxx<4,4:1>:Q We have the restriction that the reg vstride can not cross lines, so if using rxxx<8,8:1>:UQ, the vstride will cross two adjacent lines and will be illegal. We can just fallback to width 4 to fit the request. Signed-off-by: Junyan He <junyan.he@linux.intel.com> Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
2015-01-19update document.Zhigang Gong2-0/+4
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
2015-01-16fix the wrong implementation of popcount.Luo Xionghu2-7/+4
add disassembly for cbit. Signed-off-by: Luo Xionghu <xionghu.luo@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-16fix llvm.trunc.float instruction bug.Luo1-3/+1
float to float trunc should use RNDU IR instruction. v2: fix typo. should be RNDD instead of RNDU. v3: use RNDZ rather than RNDD. Signed-off-by: Luo <xionghu.luo@intel.com> Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-15add llvm intrinsic call translate.Luo2-4/+228
add sqrt, ceil, ctlz, fma, trunc, copysign intrinsicID to handle llvm call functions; the copysignf is from libFun. Signed-off-by: Luo <xionghu.luo@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-15add clz(count leading zero) utest.Luo Xionghu3-0/+80
this kernl calls the llvm __builtin_clz to generate llvm.clz function then call the gen instruction clz, different from the test compiler_clz_int, which use the fbh to implement. Signed-off-by: Luo Xionghu <xionghu.luo@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-15add LZD IR instruction.Luo Xionghu6-1/+16
the LZD IR instruction was missed, should be enabled to generate harware supported instruction. v2: add gen backend implementation. Signed-off-by: Luo Xionghu <xionghu.luo@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-13Fix the printf buffer size bug.Junyan He9-17/+26
We can not know the accurate size of the printf buffer size before run the kernel. Sometimes, especially when the global work items size is huge, the output buffer is not enough and the print message logic will cause the segment fault. We increase the printf buffer to 16M at most and add out of range check to avoid crash. Signed-off-by: Junyan He <junyan.he@linux.intel.com> Signed-off-by: Zhigang Gong <zhigang.gong@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-12add howto for old gcc versionGuo Yejun1-0/+58
Signed-off-by: Guo Yejun <yejun.guo@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-12only build tests that do not need compiler when standalone compiler is providedGuo Yejun2-9/+23
the built test case is load_program_from_bin_file, it demos how to generate from source kernel compiler_ceil.cl to binary kernel compiler_ceil.bin with the standalone compiler for a specific gen pci id, and also demos how to load and execute the binary kernel when the compiler is not available in the running system. btw, please make sure compiler_ceil.bin is really updated if there is already one there, the safe way is to delete it first. Signed-off-by: Guo Yejun <yejun.guo@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-12add CMake option USE_STANDALONE_GBE_COMPILER and STANDALONE_GBE_COMPILER_DIRGuo Yejun7-16/+110
At some platforms with old c/c++ environment, C++11 features are not supported, it results in the failure to build the gbe compiler part which depends on LLVM/clang using C++11 features. The way to resolve is to build a standalone gbe compiler within another feasible system, and build beignet with the already built standalone gbe compiler by setting USE_STANDALONE_GBE_COMPILER=true. The path of the standalone compiler is /usr/local/lib/beignet as default or could be specified by STANDALONE_GBE_COMPILER_DIR. Once USE_STANDALONE_GBE_COMPILER is given, all the gbe compiler relative code will not be built any longer, only libcl.so and libgebinterp.so are built. And libcl.so is special for GEN_PCI_ID, which is queried from the building machie or could be specified as CMake option. v2: separate the CMake option name. update the commit comments. add back the script for gen pci id, and build driver with it. v3: add file FindStandaloneGbeCompiler.cmake to make the main cmakefile clean. Signed-off-by: Guo Yejun <yejun.guo@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-12add option BUILD_STANDALONE_GBE_COMPILER to build static compilerGuo Yejun1-10/+29
The standalone compiler (gbe_bin_generater), depending on LLVM/clang, could only be built with C++11 features. To make it workable within old c/c++ version environment, add one CMAKE option to link against all static libraries. And also zip the compiler and necessary files into a tar ball. v2: change the option name to BUILD_STANDALONE_GBE_COMPILER. zip necessary files into a tar ball. Signed-off-by: Guo Yejun <yejun.guo@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-09Add read buffer/image benchmark.Yang Rong5-1/+159
Add there two benchmark to compare the buffer and image performance V2: init the coord before read image. V3: Correct the image's width and buffer's read index. Signed-off-by: Yang Rong <rong.r.yang@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-09CL/Driver/HSW: Convert L3 cycle for texture to uncachable.Zhigang Gong1-1/+1
This is to workaround a bug we found with darktable. After this patch, darktable could work fine on HSW. And based on the test result, most of the benchmarks haven't been affected much by this patch. Signed-off-by: Zhigang Gong <zhigang.gong@intel.com> Tested-by: "Meng, Mengmeng" <mengmeng.meng@intel.com>
2015-01-09utests: skip one test when it fail to open XDisplay.Zhigang Gong1-0/+4
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
2015-01-08Fix loop condition of PrintfSet constructor.Yan Wang1-1/+1
Signed-off-by: Yan Wang <yan.wang@linux.intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-07remove useless dependency liboclGuo Yejun1-2/+0
libocl is the name of sub directory, the project name in the sub directory, it is not something that others can depend on. Signed-off-by: Guo Yejun <yejun.guo@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-07CL/Driver: quick fix regression caused by remove MI_FLUSH.Zhigang Gong1-0/+2
On Gen8, we also need an extra pipe control after the MEDIA_STATE_FLUSH. Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
2015-01-07refine gbe_bin_generater usage to add -t optionGuo Yejun1-1/+1
-t option specifies the gen target pci id, it tells gbe_bin_generater the target platform that it compiles for. The compile result is llvm level binary if this option is not given. Signed-off-by: Guo Yejun <yejun.guo@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-07libocl: Reimplement trigonometric functions.Ruiling Song1-378/+172
Previous version was ported from msun which derived from fdlibm, which is good for cpu, with lots of if-condition check to try to optimize for different input data. But it is really bad for gpu. So I reimplement these functions based on well-known payne & Hanek's algorithm. Compared with previous version, it could reduce the static ASM instruction number of sin/cos from about 1700 to 400. Signed-off-by: Ruiling Song <ruiling.song@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-07CL/Driver: enable atomics in L3 for HSW.Zhigang Gong2-1/+14
This could get more than 10x boost for some atomic stress workloads. Signed-off-by: Zhigang Gong <zhigang.gong@intel.com> Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
2015-01-07libocl: remove useless code.Ruiling Song1-57/+0
These kind of logic already handled by atan2(). Signed-off-by: Ruiling Song <ruiling.song@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-06fix utest build for some old gcc versionGuo Yejun5-26/+26
change the keyword from constexpr to const, update the code for explicit type conversion and std::map's iterator. Signed-off-by: Guo Yejun <yejun.guo@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-06do not use C++11 features inside libgbeinterpGuo Yejun12-87/+111
some embedded systems have not upgraded the c/c++ environment, it makes the request to remove the C++11 features. It is possible for the CL_EMBEDDED_PROFILE with some more changes (to be done later). This change modifies the keyword auto and nullptr. btw, C++ new feature is a must for libgbe (the OpenCL compiler) which depends on LLVM/clang Signed-off-by: Guo Yejun <yejun.guo@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-06change Immediate::operator= from private to publicGuo Yejun1-1/+2
change the attribute of "Immediate & operator= (const Immediate &)" in class Immediate from private to public, otherwise, a compile issue appears when build with old gcc versions for the following code in function.hpp: INLINE ImmediateIndex newImmediate(const Immediate &imm) { const ImmediateIndex index(this->immediateNum()); this->immediates.push_back(imm); return index; } Signed-off-by: Guo Yejun <yejun.guo@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-06do not include llvm/clang headers for libgbeinterpGuo Yejun2-1/+12
libgbeinterp does not depend on llvm/clange, so remove these header files for code clean. Signed-off-by: Guo Yejun <yejun.guo@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-06Remove obsolete MI_FLUSHZhenyu Wang4-11/+3
This is caught in emulator debug that MI_FLUSH is obsolete from IVB/HSW and beignet used wrong flush bit too, so don't go risk but remove it. Current kernel would take care to flush ring after each request, so shouldn't need extra flush. Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com> Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
2015-01-06Don't check some edge condtion in non-strict mode.Zhigang Gong1-2/+2
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
2015-01-04add edge case detection for powr in utestsMeng Mengmeng2-3/+6
power(x,y) return Nan for x<0 in spec, so add that for powr. Signed-off-by: Meng Mengmeng <mengmeng.meng@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2015-01-04runtime: fix max work group size for IVBGT1.Zhigang Gong1-2/+2
If the kernel is compiled under simd8 mode, the maximum work group size should be 8 * 6 * 6 = 288. The original 512 is too large for it. Signed-off-by: Zhigang Gong <zhigang.gong@intel.com> Tested-by: "Meng, Mengmeng" <mengmeng.meng@intel.com>
2015-01-04runtime: tweak max memory allocation size.Zhigang Gong2-2/+12
Increase the maximum memory allocation size to at least 512MB and will set it to larger if the system has more total memory. This tweak will make darktable happy to handle big pictures. Signed-off-by: Zhigang Gong <zhigang.gong@intel.com> v2: reduce max constant buffer to 128MB. v3: fix the sysinfo usage. Signed-off-by: Zhigang Gong <zhigang.gong@intel.com> Tested-by: "Meng, Mengmeng" <mengmeng.meng@intel.com>
2015-01-04utests: reduce test count.Zhigang Gong1-4/+5
No need to iterate so many times. Signed-off-by: Zhigang Gong <zhigang.gong@intel.com> Tested-by: "Meng, Mengmeng" <mengmeng.meng@intel.com>
2014-12-29Fix PrintfState copying.Yan Wang1-4/+29
PrintfState include std::string object and shouldn't be copied by malloc/memcpy. Signed-off-by: Yan Wang <yan.wang@linux.intel.com> Reviewed-by: He Junyan <junyan.he@inbox.com>
2014-12-29Separate flush and invalidate in function intel_gpgpu_pipe_control.Yang, Rong2-2/+36
HSW has a limitation when PIPECONTROL with RO Cache Invalidation: Prior to programming a PIPECONTROL command with any of the RO cache invalidation bit set, program a PIPECONTROL flush command with CS stall bit and HDC Flush bit set. So must use two PIPECONTROL commands to flush and invalidate L3 cache in HSW. This patch fix some random fails which has very heavy DC read/write in HSW. Signed-off-by: Yang, Rong <rong.r.yang@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-12-29Use libdrm interface to get device idZhenyu Wang2-22/+2
Remove own ioctl call for device id but use libdrm interface instead. This not only saves one extra ioctl call as it's already been read when gem bufmgr inits, and also would allow to override device id with libdrm helper environment 'INTEL_DEVID_OVERRIDE'. To combine with aub dump, you can do device debugging with fulsim emulator by choosing any device you want and don't need hw metal at all. Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-12-29Add aub dump supportZhenyu Wang1-1/+16
Use current libdrm interface to dump aub file for debug in emulator. This adds new driver environment of OCL_DUMP_AUB=1 to enable this. Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-12-25Remove deprecated fulsim codeZhenyu Wang7-274/+2
Remove pretty old fulsim code which seems having no users also used interfaces not in open source libdrm, and call windows fulsim binary instead of linux. We will use current libdrm interface instead. Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-12-25replace hash_map with mapGuo Yejun5-91/+5
there is no strong evidence to show hash_map makes better performance for beignet, since hash_map requires std::hash which is not supported in some g++ old versions, so replace hash_map with map. Signed-off-by: Guo Yejun <yejun.guo@intel.com> Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
2014-12-25add collectImageArgs to handle image count limitations.Luo Xionghu1-0/+28
read only images in a kernel should be LE than MAX_READ_IMAGE_ARS; write only images in a kernel should be LE than MAX_WRITE_IMAGE_ARS. Signed-off-by: Luo Xionghu <xionghu.luo@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
2014-12-25fix min_max_read_image_args and min_max_parameter_size issue.Luo Xionghu8-10/+13
this patch revert fb4bced99b7c08d0d43386abf33448860fb7fc41 as the spec defined the min_max_parameter_size's min value is 1024; the BTI_MAX_NUM and btiBase could be 130 because of 128 images with 1 const surface and 1 private surface. v2: add BTI_MAX_READ_IMAGE_ARGS and BTI_MAX_WRITE_IMAGE_ARGS in backend. change the BTI_MAX_ID to 253. the image numbers will be calculated in later patch and check its limitation. Signed-off-by: Luo Xionghu <xionghu.luo@intel.com> Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>