Age | Commit message (Collapse) | Author | Files | Lines |
|
To find out an instruction scheduling policy to achieve the theoretical minimum
registers required in a basic block is a NP problem. We have to use some heuristic
factor to simplify the algorithm. There are many researchs which indicate a
bottom-up list scheduling is much better than the top-down method in turns of
register pressure. I choose one of such research paper as our target. The paper
is as below:
"Register-Sensitive Selection, Duplication, and Sequencing of Instructions"
It use the bottom-up list scheduling with a Sethi-Ullman label as an
heuristic number. As we will do cycle awareness scheduling after the register
allocation, we don't need to bother with cycle related heuristic number here.
I just skipped the EST computing and usage part in the algorithm.
It turns out this algorithm works well. It could reduce the register spilling
in clBlas's sgemmBlock kernel from 83+ to only 20.
Although this scheduling method seems to be lowering the ILP(instruction level parallism).
It's not a big issue, because we will allocate as much as possible different registers
in the following register allocation stage, and we will do a after allocation
instruction scheduling which will try to get as much ILP as possible.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
|
|
This is a long standing bug, and is exposed by my latest register
allocation refinement patchset. ir::ocl::zero and ir::ocl::one are
global registers, we have to compute its liveness information carefully,
not just get a local interval ID.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
|
|
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
|
|
Use liveness information, we can only allocate them
on demand. And they could be treated as non-curbe-payload
register.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
|
|
Btiutil should be just a normal temporary register and only
alive for those specific laod/store instructions with mixed
BTI used.
Although btiutil only takes one DW register space, but in
practice, it may waste one entire 32-byte register space
as it has very long live range.
This patch fix this issue completely.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
|
|
v2:
simplify the logic in function.hpp. Let the user to
prepare correct start and end point. Fix the incorrect
start/end point for one forward jump and one backward
jump case.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
|
|
The major motivation is to normalize the curbe payload's
allocation and prepare to use liveness information
to avoid unecessary payload register allocation and avoid
fragments when allocate curbe registers. For an example,
for GBE_CURBE_LOCAL_ID_Y/Z, many one dimention
kernels don't need them. But previous curbe allocation
occurs before the liveness interval computing, thus it
will allocate that curbe anyway. Altough it will be expired
soon but it still need us to prepare those payload at
host side. After this patch, this type of overhead
has been eliminated easily.
Another purpose is to eliminate the ugly curbe patch list
handling in backend. After this patch, the curbe register
handling is much cleaner than before.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
|
|
If the reservedSpillRegs is not zero, it indicates we are in a
very high register pressure. Use register vector will likely
increase that pressure and will cause significant performance
problem which is much worse than use a short-live temporary
vector register with several additional MOVs.
So let's simply avoid use vector registers and just use a
temporary short-live-interval vector.
v2:
remove out-of-date comments.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
|
|
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
|
|
As we want to avoid liveness update all the time, we maintain the liveness
information dynamically during the phi mov optimization. Instruction(self-copy)
remving bring unecessary complexity here. Let's avoid do that here, and do
the self-copy removing latter in removeMOVs().
v2:
forgot to remove incorrect liveness checking for special registers.
Now remove them.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
|
|
More aggresive interfering check, even if both registers are in
Livein set or Liveout set, they are still possible not interfering
to each other.
v2:
Liveout interfering check need to take care those BBs which has only one
register defined.
For example:
BBn:
...
MOV %r1, %src
...
Both %r1 and %r2 are in the BBn's liveout set, but %r2 is not defined or used
in BBn. The previous implementation ignore this BB which is incorrect. As %r1
was modified to a different value, it means %r1 could not be replaced with %r2
in this case.
v3:
Add comments and assertion to restrict the usage of interleve
check functions of DAG class.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
|
|
analysis.
The previous phi mov optimization try to reduce the phi copy source register
and the phi copy register if the phi copy source register is a normal SSA value.
But for some cases, many phi copy source registers are also phi copy value which
has multiple definitions. And they could all be reduced to one phi copy register
if there is no interfering in all BBs. This patch with the previous patches could
reduce the whole spilled register from 200+ to only 70 for a SGEMM kernel and the
performance could boost about 10 times.
v2:
Add one FIXME tag to indicate one more optimization opportunity we missed in current
implementation. Could be solved in the future.
v3:
Disable postPhi mov optimization for now as there is a liveness bug
need to be fixed.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
|
|
These helper function will be used in further phi mov optimization.
v2:
remove the useless debug message code.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
|
|
We don't need to recompute the entire liveness information for
all cases. This is a preparation patch for further phi copy
optimization.
v2:
also need to update varKill set.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
|
|
Only in gen backend stage, we need to take care of the
special extra liveout and uniform analysis. In IR stage,
we don't need to handle them.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
|
|
If the PHI source register's definition instruction uses the
phi register, it is not a interfere. For an example:
MOV %phi, %phicopy
...
ADD %phiSrcDef, %phi, tmp
...
MOV %phicopy, %phiSrcDef
...
The %phi and the %phiSrcDef is not interering each other.
Simply advancing the start of the check to next instruction is
enough to get better result. For some special case, this patch
could get significant performance boost.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
|
|
It is a drm related bug. As the drm driver changed the time to free their test
userptr to bufmgr destroy(30921483c70c6939f017476eac13da6aa26b3b3c), we need
anothr order to release our driver to make sure the test userptr can be freed
with a valid fd.
Signed-off-by: Pan Xiuli <xiuli.pan@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
|
|
Fix to calculate the current cpu monotonic raw timestamp in nanoseconds
for enqueued,submitted,start and finshed and send this to application
based on the parameter queries.
Signed-off-by: Midhun Kodiyath <midhunchandra.kodiyath@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
|
|
Signed-off-by: Luo Xionghu <xionghu.luo@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
|
|
as the long type data layout is not continous on platform gen7/gen75,
the indirect address access pattern is a bit different than gen8.
Signed-off-by: Luo Xionghu <xionghu.luo@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
|
|
if the source is uniform and dst is non-uniform, no need to add the
indirect address index.
v2: missing a uniform check in gen8 context UD bswap.
Signed-off-by: Luo Xionghu <xionghu.luo@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
|
|
v2: check cl_khr_image2d_from_buffer support first;
use CL_DEVICE_IMAGE_BASE_ADDRESS_ALIGNMENT to allocate memory.
v3: fix clGetDeviceInfo use.
Signed-off-by: Luo Xionghu <xionghu.luo@intel.com>
Reviewed-by: Guo, Yejun <yejun.guo@intel.com>
|
|
this patch allows create 2d image with a cl buffer with zero copy.
v2: should use reference to manage the release the buffer and image.
After being created, the buffer reference count is 2, and image reference
count is 1.
if image is released first, decrease the image reference count and
buffer reference count both, release the bo when the buffer is released
at last;
if buffer is released first, decrease the buffer reference count only,
release the buffer when the image is released.
add CL_DEVICE_IMAGE_BASE_ADDRESS_ALIGNMENT in cl_device_info.
v3: move is_image_from_buffer to _cl_mem_image; return
CL_INVALID_IMAGE_SIZE if image size is larger than the buffer.
v4: pitchalignment set to 2.
Signed-off-by: Luo Xionghu <xionghu.luo@intel.com>
Reviewed-by: Guo, Yejun <yejun.guo@intel.com>
|
|
Signed-off-by: Luo Xionghu <xionghu.luo@intel.com>
Reviewed-by: Guo, Yejun <yejun.guo@intel.com>
|
|
catch the error: out of host memery.
Signed-off-by: Luo Xionghu <xionghu.luo@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
|
|
let's just keep things simple.
Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
|
|
suboffset() will not set .subnr correctly, as vec1() will get a horizontal
stride 0 register.
Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
|
|
All programs or none programs specified by input_programs contain a compiled binary or library
for the device. Otherwise return CL_INVALID_OPERATION.
Correct this condition check.
Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Luo, Xionghu <xionghu.luo@intel.com>
|
|
cl_buffer_get_subdata sometime is very very very slow in linux kernel, in skl and chv,
and it is random. So temporary disable it, use map/copy/unmap to read.
Should re-enable it after find root cause.
Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Luo, Xionghu <xionghu.luo@intel.com>
|
|
1. return CL_INVALID_LINKER_OPTIONS when invalid options, using clang to check the options.
2. return CL_INVALID_OPERATION when the binary type is not same.
3. When link fail, will not return CL_LINK_PROGRAM_FAILURE, fix it.
4. Should not delete program in genProgramBuildFromLLVM, the program is new and delete from runtime.
Signed-off-by: Yang Rong <rong.r.yang@intel.com>
Reviewed-by: Luo, Xionghu <xionghu.luo@intel.com>
|
|
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
|
|
Signed-off-by: Ruiling Song <ruiling.song@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
|
|
LLVM provides powerful string-remapped feature which could be used
to map a string to an input file name, thus we don't need to create
a temporary cl source file any more.
This patch not only make things much clear and avoid the unecessary
file creation. It only fixes some weird directory related problems.
Because beignet creates the temoprary file at the /tmp directory.
Then the clang will search the include files in that directory by
default, but the developer expects it to search the working directory
firstly. This causing two weird things:
1. If a .cl file is including a .h file in the current directory, beignet
will not find it.
2. Even if the probram add a "-I." option manually, beignet will search /tmp
firstly, and if there is a .h file in /tmp/ with the eaxct same file
name, beignet will the file located in /tmp.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Luo, Xionghu <xionghu.luo@intel.com>
|
|
This patch adds 2 new tests to the unit tests. It uses the existing
framework and data structures and tests the llvm/asm dump generation
when these flags (-dump-opt-llvm, -dump-opt-asm) are passed as build
options along with the dump file names.
Methods added:
1) get_build_llvm_info() tests LLVM dump generation
2) get_build_asm_info() tests ASM dump generation
Signed-off-by: Sirisha Gandikota <sirisha.gandikota@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
|
|
Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
|
|
There is no NULL pointer check for kernel->program->build_opts.
This will cause utest test_get_arg_info crash.
In fact, we will add -cl-kernel-arg-info flag for compiling
ever time, and so the arg info is always avaible.
But some test case deliberately unset this flag and expect the ERR
return value, so we really need a check here.
Signed-off-by: Junyan He <junyan.he@linux.intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
|
|
1.Change the code for null param_value
2.Add the return value check for build option "-cl-kernel-arg-info"
3.Correct one return value typo
Signed-off-by: Pan Xiuli <xiuli.pan@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
|
|
ENDIF should be treated as barrier-like instruction
in instruction scheduling.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: Luo, Xionghu <xionghu.luo@intel.com>
|
|
Need to take care of the uniform cases.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
|
|
We need to test large image 1d buffer read and write testing.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
|
|
We should treat it as a 2D image as image 1d buffer may be
exceed the 1D image size restrication.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
|
|
originally, the dst of simd_shuffle is not uniform, but if it is
optimized as scalar, just use simd_width=1 to generate sel_op/asm
Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
|
|
initialize the data inside kernel with packed integer vector
V2: call functions from ctx, instead of ctx.registerAllocator
Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: "Yang, Rong R" <rong.r.yang@intel.com>
|
|
clBuildProgram OpenCL API
This is a resubmission of the patch with support for LLVM 3.4
Allows the user to request a dump of the LLVM-generated IR to the file
specified in [PATH] through clCompileProgram options
Signed-off-by: Manasi Navare <manasi.d.navare@intel.com>
Reviewed-by: Guo, Yejun <yejun.guo@intel.com>
|
|
this could fix the bug: https://bugs.freedesktop.org/show_bug.cgi?id=90472
v2: the vector "slots" stores the pointer of PrintfSlot from vector "fmts",
but the push_back operation of "fmts" will cause resize if capacity is not
enough and call the copy constructor and destructor of that PrintfSlot,
leading to a illegal pointer in "slots", so this patch change to store the
variable instead of pointer.
update the destructor of PrintfSlot according to the SLOT_TYPE.
Signed-off-by: Luo Xionghu <xionghu.luo@intel.com>
Reviewed-by: Junyan He <junyan.he@inbox.com>
|
|
llvm 3.3 has a different constructure of llvm::raw_fd_ostream
V2: refine the code
Signed-off-by: Guo Yejun <yejun.guo@intel.com>
Reviewed-by: "Song, Ruiling" <ruiling.song@intel.com>
|
|
Open the file specified for the ASM dump and write the assembly to it.
Signed-off-by: Manasi Navare <manasi.d.navare@intel.com>
Signed-off-by: Laura Ekstrand <laura.d.ekstrand@intel.com>
Reviewed-by: Song, Ruiling <ruiling.song@intel.com>
|
|
Part of the plumbing that passes the ASM file name from the compiler options
level down to the emitCode level so that the assembly can be written to that
file.
Signed-off-by: Manasi Navare <manasi.d.navare@intel.com>
Signed-off-by: Laura Ekstrand <laura.d.ekstrand@intel.com>
Reviewed-by: Song, Ruiling <ruiling.song@intel.com>
|
|
Part of the plumbing that passes the ASM file name from the compiler options
level down to the emitCode level so that the assembly can be written to that
file.
Signed-off-by: Manasi Navare <manasi.d.navare@intel.com>
Signed-off-by: Laura Ekstrand <laura.d.ekstrand@intel.com>
Reviewed-by: Song, Ruiling <ruiling.song@intel.com>
|
|
Part of the plumbing that passes the ASM file name from the compiler options
level down to the emitCode level so that the assembly can be written to that
file
Signed-off-by: Manasi Navare <manasi.d.navare@intel.com>
Signed-off-by: Laura Ekstrand <laura.d.ekstrand@intel.com>
Reviewed-by: Song, Ruiling <ruiling.song@intel.com>
|