Age | Commit message (Collapse) | Author | Files | Lines |
|
Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
We use slm to store the value which will be broadcasted
to the whole work group.
Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
It's basic image test for uchar, ushort and uint.
v2:
convert uint to float before do median filter and
use intermediate variable in if loop.
Signed-off-by: Meng Mengmeng <mengmeng.meng@intel.com>
Reviewed-by: Ruiling Song <ruiling.song@intel.com>
|
|
It's basic buffer test for uchar, ushort and uint.
Signed-off-by: Meng Mengmeng <mengmeng.meng@intel.com>
Reviewed-by: Ruiling Song <ruiling.song@intel.com>
|
|
Get FPS of the two benchmarks in place of GB/S.
v2:
Operating 1000 frame instead of 100.
Signed-off-by: Meng Mengmeng <mengmeng.meng@intel.com>
Reviewed-by: Ruiling Song <ruiling.song@intel.com>
|
|
For benchmarks, the units are varied e.g. GB/S, FPS, score and so on.
So we need to make a choice for every benchmark.
Signed-off-by: Meng Mengmeng <mengmeng.meng@intel.com>
Reviewed-by: Ruiling Song <ruiling.song@intel.com>
|
|
Move the printfs of PrintfParser into the ir::Unit to make the gbe
thread safe. The old static printfs will be cleared by othrer thread
when running in multithread.
V2:
Rebase the patch
Signed-off-by: Pan Xiuli <xiuli.pan@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
SIMD_WIDTH.
It makes sense to set CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE to the
corresponding SIMD size. Then it provides a way for intel's OCL application
to get SIMD width at runtime and make some SIMD width dependant optimization
possible.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Ruiling Song <ruiling.song@intel.com>
|
|
1024 is some how too large for some kernels and may cause
some kernels fail to build due to lack of enough scratching
space.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Ruiling Song <ruiling.song@intel.com>
|
|
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Ruiling Song <ruiling.song@intel.com>
|
|
We should not assert even if the application triggers a internal limitation
such as lack of scratch space. We should return error to the application and
let the application to make further decision.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Ruiling Song <ruiling.song@intel.com>
|
|
Because the range of scratch size exceed the int16_t's
maximum size. We have to extent these elements to 32 bit.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Ruiling Song <ruiling.song@intel.com>
|
|
We do not have native half type support on X86 platforms.
The half math functions on CPU side are just used in utests,
so we do not want to import the soft imitation code or add
dependency on some math libs for half. We just use float to
to calculate the reference value. This causes the diff between
CPU results and GPU results. We use random func to generate src
value but when this src value is very close to pi or pi/2,
the truncation diff imported by float -> half will be magnified
a lot in the result of some math functions, e.g. sin, cos and tan.
We now just use a float table as src to fix this.
Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
Gen9 have a different context to emit BarrierInst that contains
wait instruction, and wait instruction need to be no predication.
Signed-off-by: Pan Xiuli <xiuli.pan@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
Use wait function to extend a debug function:
void debugwait(void)
This function can hang the gpu unless gpu reset
or host send something to let it go.
EXTREMELY DANGEROUS for machines turn off hangcheck
v2:
Fix some bugs, and add setting predicate and execwidth,
also modify some inst scheduling
v3:
Add push and pop in insturction selection, and set nomask
with execwidth.
v4:
Fix barrier predicate setting bugs, and rebase the patch
Signed-off-by: Pan Xiuli <xiuli.pan@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
There are 3 notification can be used by wait, so we
should be able to choose which one we'd like to use.
Also the 3 reg is n0.0 n0.1 and n0.2 so also change
the function name.
Signed-off-by: Pan Xiuli <xiuli.pan@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
Return CL_INVALID_CONTEXT if the context associated with
command_queue and events in event_wait_list are not the same.
Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Luo Xionghu <xionghu.luo@intel.com>
|
|
The mergeable field was define as an uint32_t, but MAX_SRC_NUM is now
40, so we need at least an uint64_t.
Signed-off-by: Giuseppe Bilotta <giuseppe.bilotta@gmail.com>
Reviewed-by: Ruiling Song <ruiling.song@intel.com>
|
|
The work dim information related registers and timestamp registers
are always needed in curbe. We need to set the correct life interval
for them.
Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
The offset 0 of the profiling buffer contains the log number.
We will use atomic instruction to inc it every time a log
is generated.
We will generate one log for each HW gpu thread. The log
contains the XYZ range of global work items which are executed
on this thread, the EU id, the Sub Slice id, thread number,
and 20 points' timestamp which we are interested in.
Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
We will maintain a real clock to record the real execute time
of the orginal code. We do not want to introduce overhead
because of adding the profiling instructions, so every time
we enter the proliling instructions block, we will calculate the
real time clock value and update the real clock, and when leave
this the proliling instructions block, we will record the time
stamp of that leave point.
Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
The timestamps are calculated by Long type. Before BDW,
there is no Long type support and we use i32 operations
to implement them.
Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
We do not want CALC_TIMESTAMP and STORE_PROFILING to be scheduled
with other instructions, because it will get the wrong timestamps.
Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
1. rename __gen_ocl_timestamp_buf to __gen_ocl_profiling_buf
2. printfbptr printfiptr and profilingbptr should be 64 bits
on BDW later platforms. So just set them to QWORD.
Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
The profilingProlog will collect useful information
for profiling, including XYZ global range and prolog
timestamp.
Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
We add OCL_PROFILING_LOG as a int type, because there may be
different types of profiling format in the future.
Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
When in profiling, the profiling inserter function will
insert calc_timestamp for each point which we are interested
in. At the end of the kernel, just before return, we will
insert a store_profiling function call. The function will
hold a reference to the global val profiling_buf and avoid
it being released when run optimization passes.
Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
After the lowering return pass, a new block which just
has one RET instruction will be generated, and all RET
INSTs in the middle will be replaced by BRA INST.
We want our store_profiling instruction to be inserted
just before that return instruction and out of any
condition blocks. So we postpone the STORE_PROFILING
here.
Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
The Unit will hold profiling infomation. The profiling
infomation may be needed throughout the whole backend
processing, so it is suitable to add it to unit.
Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
Add five timestamp reigsters and one pointer register
into curbe. The five timestamp reigsters will hold
all the infomation of profiling timestamps, includes
20 uint timestamps for each point, 1 ulong prolog holding
the start time and and 1 ulong epilog holding the
end time of that kernel. The pointer reigster will hold
the log buffer address.
Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
When user enables profiling feature, we need to insert
extra instructions to record and store the timestamps.
By now, the function pass will just insert the requred
instructions at the head of first 20 blocks. Later, we
will support to insert timestamps at any point in the code.
Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Signed-off-by: Bai Yannan <yannan.bai@intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
Add two instructions for profiling usage. CalcTimestamp will
calculate the timestamps and update the timestamp in the
according slot. StoreProfiling will store the information
to buffer and generate logs.
Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
ProfilingInfo will play important role in output
the profiling log. It will record the profiling
information and generate the logs after clfinish.
Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: Yang Rong <rong.r.yang@intel.com>
|
|
This is to fix build error when new intel extension is added
into beignet. As current cmake rule will use old system installed
CL headers instead of beignet ones, which leads to compile failure
as new extension definition won't be found. So this trys to simply
always prefer to use beignet's CL header in build/install.
Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
|
|
The clock_gettime will cause the linkage error on some
version of GCC, we need to add -lrt at the end of the
link command line.
Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
|
|
The following items are supported in this commit:
1. Return residuals.
2. All types of mb_block_type, subpixel_mode, sad_adjust_mode in
cl_motion_estimation_desc_intel.
After this commit, cl_intel_motion_estimation is fully supported.
Signed-off-by: Chuanbo Weng <chuanbo.weng@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
|
|
llvm 3.7 may generate cast instructions "%13 = uitofp i1 %12 to float",
while the dst type is float or double , should call the coresponding
newXXXimmediate function.
Signed-off-by: Luo Xionghu <xionghu.luo@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
|
|
define a MACRO to hold the value.
v2: use same MACRO in cl_extensions.h; add header file protection for
cl_extension.h.
Signed-off-by: Luo Xionghu <xionghu.luo@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
|
|
v3:
Fix two typos.
Signed-off-by: Chuanbo Weng <chuanbo.weng@intel.com>
Reviewed-by: Ruiling Song <ruiling.song@intel.com>
|
|
If the CL device does not support this builtin kernel, the test returns
PASS.
Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: Ruiling Song <ruiling.song@intel.com>
|
|
v2:
1. Just upload the first vme_state.
2. Remove duplicated code in check_opt1_extension.
3. Check image format before cl_gpgpu_bind_image_for_vme.
4. Fix error of getting mv. Because we suppose this kernel run in SIMD16
mode, so dword 0 of grf 1 should be
__gen_ocl_region(8,vme_result.s0), not
__gen_ocl_region(0,vme_result.s1).
v3:
Return CL_IMAGE_FORMAT_NOT_SUPPORTED if image format is not the required
one.
v4:
Fix two conflicts after code rebase and wordaround a curbe related bug.
v6:
Treat simd8 and simd16 differently when getting mv.
Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Signed-off-by: Chuanbo Weng <chuanbo.weng@intel.com>
Reviewed-by: Ruiling Song <ruiling.song@intel.com>
|