Age | Commit message (Collapse) | Author | Files | Lines |
|
SKL's qpitch is difference with BDW. And SURFTYPE_1D's qpitch means distance in pixels between array slices.
So add two parameters slice_pitch and bpp to calculate it.
|
|
The skl's cache control field in the surface state changed index to the pre-defined registers.
Because index 9 is what beignet need, use it directly.
Skl's select_pipeline command need the mask, add intel_gpgpu_select_pipeline_gen9 for it.
Signed-off-by: Yang Rong <rong.r.yang@intel.com>
|
|
Correct stuct gen8_interface_descriptor.
Add function intel_gpgpu_build_idrt_gen9 for difference slm size setting.
Disable skl's global barrier now.
|
|
From BDW, pipe control need 6 DW, correct it. Also affect BDW.
Signed-off-by: Yang Rong <rong.r.yang@intel.com>
|
|
3D Image can't use TILE_X in skl so change to default TILING MODE to TILE_Y.
Signed-off-by: Yang Rong <rong.r.yang@intel.com>
|
|
Add the intel_gpgpu_set_base_address_gen9 for SKL, the other functions are same as BDW in intel_GPGPU.
And the SKL's backend just same as BDW. Should derive from GEN8 later.
With this commit, some utests pass.
|
|
SKL add the new GT4 type device.
|
|
the current limitation is both host_ptr and buffer size should be
page aligned, loose the limitation of host_ptr to be cache line
size (64byte) alignment, and no limitation for the size.
Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
|
|
the correct value of cache line size is 64 bytes, not 128.
Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
|
|
We can not know the accurate size of the printf buffer
size before run the kernel. Sometimes, especially when
the global work items size is huge, the output buffer
is not enough and the print message logic will cause the
segment fault.
We increase the printf buffer to 16M at most and add out
of range check to avoid crash.
Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
|
|
At some platforms with old c/c++ environment, C++11 features are not
supported, it results in the failure to build the gbe compiler part
which depends on LLVM/clang using C++11 features.
The way to resolve is to build a standalone gbe compiler within another
feasible system, and build beignet with the already built standalone
gbe compiler by setting USE_STANDALONE_GBE_COMPILER=true. The path of
the standalone compiler is /usr/local/lib/beignet as default or could
be specified by STANDALONE_GBE_COMPILER_DIR.
Once USE_STANDALONE_GBE_COMPILER is given, all the gbe compiler relative
code will not be built any longer, only libcl.so and libgebinterp.so are
built. And libcl.so is special for GEN_PCI_ID, which is queried from the
building machie or could be specified as CMake option.
v2: separate the CMake option name.
update the commit comments.
add back the script for gen pci id, and build driver with it.
v3: add file FindStandaloneGbeCompiler.cmake to make the main cmakefile clean.
Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
|
|
This is to workaround a bug we found with darktable. After this patch,
darktable could work fine on HSW. And based on the test result, most
of the benchmarks haven't been affected much by this patch.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Tested-by: "Meng, Mengmeng" <mengmeng.meng@intel.com>
|
|
On Gen8, we also need an extra pipe control after the
MEDIA_STATE_FLUSH.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
|
|
This could get more than 10x boost for some atomic stress workloads.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
|
|
This is caught in emulator debug that MI_FLUSH is obsolete from
IVB/HSW and beignet used wrong flush bit too, so don't go risk
but remove it. Current kernel would take care to flush ring
after each request, so shouldn't need extra flush.
Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com>
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
|
|
If the kernel is compiled under simd8 mode, the maximum work group
size should be 8 * 6 * 6 = 288. The original 512 is too large for it.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Tested-by: "Meng, Mengmeng" <mengmeng.meng@intel.com>
|
|
Increase the maximum memory allocation size to at least 512MB and
will set it to larger if the system has more total memory.
This tweak will make darktable happy to handle big pictures.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
v2:
reduce max constant buffer to 128MB.
v3:
fix the sysinfo usage.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Tested-by: "Meng, Mengmeng" <mengmeng.meng@intel.com>
|
|
HSW has a limitation when PIPECONTROL with RO Cache Invalidation:
Prior to programming a PIPECONTROL command with any of the RO cache invalidation bit set,
program a PIPECONTROL flush command with CS stall bit and HDC Flush bit set.
So must use two PIPECONTROL commands to flush and invalidate L3 cache in HSW.
This patch fix some random fails which has very heavy DC read/write in HSW.
Signed-off-by: Yang, Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
|
|
Remove own ioctl call for device id but use libdrm interface instead.
This not only saves one extra ioctl call as it's already been read
when gem bufmgr inits, and also would allow to override device id with
libdrm helper environment 'INTEL_DEVID_OVERRIDE'.
To combine with aub dump, you can do device debugging with fulsim
emulator by choosing any device you want and don't need hw metal at
all.
Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
|
|
Use current libdrm interface to dump aub file for debug in emulator.
This adds new driver environment of OCL_DUMP_AUB=1 to enable this.
Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
|
|
Remove pretty old fulsim code which seems having no users
also used interfaces not in open source libdrm, and call
windows fulsim binary instead of linux.
We will use current libdrm interface instead.
Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
|
|
this patch revert fb4bced99b7c08d0d43386abf33448860fb7fc41 as the spec
defined the min_max_parameter_size's min value is 1024;
the BTI_MAX_NUM and btiBase could be 130 because of 128 images with 1
const surface and 1 private surface.
v2: add BTI_MAX_READ_IMAGE_ARGS and BTI_MAX_WRITE_IMAGE_ARGS in backend.
change the BTI_MAX_ID to 253. the image numbers will be calculated in
later patch and check its limitation.
Signed-off-by: Luo Xionghu <xionghu.luo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
|
|
this value should depend on the pointer size according to the system.
Signed-off-by: Luo Xionghu <xionghu.luo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
|
|
Per OpenCL spec, the minimum CL_DEVICE_IMAGE_MAX_BUFFER_SIZE is 65536
which is too large for 1D surface on Gen platforms.
Have to use a 2D surface to implement it. As OpenCL spec only allows
the image1d_t to be accessed via default sampler, it is doable as it
will never use a float coordinates and never use linear non-nearest
filters.
Signed-off-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
|
|
CLANG has sampler_t support since LLVM 3.3, let's switch to that type
rather than the old hacky way. One major problem is the sampler static
checking. As Gen platform has some hardware restrication and if the
sampler value is a const defined at kernel side, we need to use the
value to optimize the code path. Now the sampler_t becomes an obaque
type now, the CLANG doesn't support any arithmatic operations on it.
So we have to introduce a new pass to do this optimization.
v2:
fix comments.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
|
|
Because accessing global memory by uchar16/char16 will fully utilize
memory bandwidth, so change CL_DEVICE_PREFERRED_VECTOR_WIDTH_CHAR from
8 to 16. Three OpenCV cases will speedup from this patch:
OCL_ThreshFixture_Threshold, 25% improvement
OCL_MaxFixture_Max, 105% improvement
OCL_MinFixture_Min, 105% improvement.
Signed-off-by: Chuanbo Weng <chuanbo.weng@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
|
|
when user ptr is enabled, allocates page aligned system memory for
CL_MEM_ALLOC_HOST_PTR inside the driver and wraps it as GPU memory
to avoid the copy between GPU and CPU.
and also do some code refine for the relative user_ptr code.
tests verified: beignet/utest, conformance/basic, buffers, mem_host_flags
Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
|
|
Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
|
|
1. Return the expected error code.
2. Don't destroy cl_program object after comile error because it
may be used still in the future.
Signed-off-by: Yan Wang <yan.wang@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
|
|
To create cl image from bo name with offset, the offset needs to
be added into surface_base_addr_lo/hi.
Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Tested-by: "Zhu, BingbingX" <bingbingx.zhu@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
|
|
Beignet accepts buffer object name to share data between libva,
it supports to create cl image from the bo name with a non-zero
offset, but it does not work at some platforms.
The driver calls intel_bo_gem_create_from_name to retrieve the
dri_bo, and the offset of dri_bo is changed by the non-zero offset.
At some platforms, the change of the offset has side effect when
the kernel is executed again and so intel_bo_gem_create_from_name
is called for the second time.
So, do not change the offset of dri_bo, but maintain the non-zero
offset in cl_image, and maintain the non-zero offset until we write
the surface state into batch buffer.
V2: correct the offset parameter passed to dri_bo_emit_reloc
Signed-off-by: Guo Yejun <yejun.guo@intel.com>
|
|
To decide the kernel's work group size, application should get
CL_DEVICE_MAX_WORK_GROUP_SIZE first, and then get the CL_KERNEL_WORK_GROUP_SIZE
after clBuildProgram.
But some application only check the CL_DEVICE_MAX_WORK_GROUP_SIZE, and if kernel run
simd8 mode or other cause, may exceed the CL_KERNEL_WORK_GROUP_SIZE.
So change to CL_DEVICE_MAX_WORK_GROUP_SIZE to the minimum CL_KERNEL_WORK_GROUP_SIZE.
Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
|
|
If call cl_event_delete before call back, then event will be deleted if
application release event in the call back. So must move the cl_event_delete at the last.
V2: V1 will not delete event if not user event, also need delete it.
Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
|
|
TILING_Y's performance is better than TILING_X'S on BDW, but almost same
on IVB/HSW. Using the TILING_Y as default tiling mode temporary, still need
to find out the root cause why different behavior between BDW and IVB/HSW.
V2: still using static and only initialize once.
Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
|
|
Also need align height when CL_NO_TILING. This patch can fix some tiling_y error.
Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
|
|
1. the wait logic is integrated into function cl_mem_map/unmap_auto
2. use cl_mem_map/unmap_auto for userptr inside clEnqueueRead/WriteBuffer
3. do not use cl_buffer_subdata for userptr, use cl_mem_map/memcpy instead
Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
|
|
Set the L3SQ General Priority Credit to max, and L3SQ High Priority Credit
to zero, it can slightly improve the performacne, about 2% of luxmark.
Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
|
|
Master branch is for the next major release. 1.0.x series will be
maintained on Release_v1.0 branch.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
|
|
The cl_thread has a potential problem.
If the threads are created and destroyed very fast,
while the queue remain avaible, the resource of
destroyed thread will not be free correctly and will
be wrongly reused by later created thread.
V2:
Use a easy way to handle this case. We do not clear
the resource and just keep it. The later thread will
not wrongly reuse it. The thread number will not be
very huge, so it is reasonable to clear all the
resource when the command queue is destroyed.
Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
|
|
If the buffer is a userptr buffer, we should copy it directly.
Otherwise, it fails in libdrm. As drm_intel_gem_bo_subdata() refuses
to read a userptr buffer object.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Guo, Yejun" <yejun.guo@intel.com>
|
|
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
|
|
As we still have the image 1d array workaround, we need to
fix it for BDW as well.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Junyan He <junyan.he@linux.intel.com>
|
|
This reverts commit f2c57a46de4f51fa5d4c8e02cc751fce7ff417c8.
|
|
To make the license statement consistent to each other, adjust
all license versions to v2.1+. Thus beignet should have a pure
LGPL v2.1+ license.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
|
|
We found this patch cause some serious regressions. Considering it is not
part of the OCL standard API, we choose to revert it for 1.0 release.
This reverts commit b6660fa343e4e80231123695834cc24e3fc5487b.
|
|
At some systems, function aligned_alloc is not supported.
From Linux Programmer's Manual:
The function aligned_alloc() was added to glibc in version 2.16.
The function posix_memalign() is available since glibc 2.1.91.
V2: add check for return value of posix_memalign
Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
|
|
TILING_Y's performance is better than TILING_X'S on BDW, but almost same
on IVB/HSW. Using the TILING_Y as default tiling mode temporary, still need
to find out the root cause why different behavior between BDW and IVB/HSW.
Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
|
|
Beignet accepts buffer object name to share data between libva,
it is supposed to support to create cl image from the bo name
with a non-zero offset, but it does not work at some platforms.
The driver calls intel_bo_gem_create_from_name to retrieve the
dri_bo, and the offset of dri_bo is changed by the non-zero offset.
At some platforms, the change of the offset has side effect when
the kernel is executed again and so intel_bo_gem_create_from_name
is called for the second time.
So, do not change the offset of dri_bo, but maintain the non-zero
offset in cl_image, and use the non-zero offset until we fill the
surface state.
Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Tested-by: Zhigang Gong <zhigang.gong@linux.intel.com>
|
|
passing a binary program to clCompileProgram() should return
CL_INVALID_OPERATION.
Signed-off-by: Luo Xionghu <xionghu.luo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
|
|
the program should be deserialized and loaded when created from a
EXECUTABLE binary.
Signed-off-by: Luo Xionghu <xionghu.luo@intel.com>
Reviewed-by: Zhigang Gong <zhigang.gong@linux.intel.com>
|