docs/optimization-guide.mdwn


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106

Optimization Guide
====================

All the SIMD optimization principles such as avoid branching and don't waste 
SIMD lanes are also applied to Beignet optimization on Gen platform. Furthermore,
there are some special tips for Beignet optimization.

1. It is recommended to choose multiple of 16 work group size. Too much SLM usage may reduce parallelism at group level. 
   If kernel uses large amount SLM, it's better to choose large work group size. Please refer the following table for recommendations
   with some SLM usage.

| Amount of SLM | 0  | 4K | 8K  | 16K | 32K |  
| WorkGroup size| 16 | 64 | 128 | 256 | 512 |  

   Actually, a good method is to pass in a NULL local work size parameter to let the driver to determine the best work group size for you.

1. Use shorter data type could get better performance. There are also some detail tips as below.
  1. Use uchar16/ushort8/uint4 as much as possible.
  1. If the data has to be DWORD(4 bytes) unaligned, it's better to use vload16(for char), vload8(for short) to load the data.
  1. Read/write scalar char/short will be very slow and may lead to be even worse performance than use DW type.

1. Avoid too strided global/constant memory access.
  
  Some examples are as below (assume the data is a cache line aligned global/constant uint buffer, and the work group size is 16 with SIMD16 mode):  
  `uint x = data[get_global_id(0)];      //best, only read one cache line, no bandwidth waste`  
  `uint x = data[get_global_id(0) + 1]; //bad, cross 2 cache lines, not good, waste half of the bandwidth`  
  `uint x = data[get_global_id(0) * 16]; //worst, cross 16 cache lines, waste 15/16 bandwidth.`  

1. Avoid dynamic indexed private buffer if possible.
  Currently, private buffer access in beignet backend is very slow. Many small private buffer could be optimized by the compiler.
  But the following type of dynamic indexed private buffer could not be optimized:

`
     uint private_buffer[32];
     for (i = 0; i < xid; i++)    {
       int dynamic_idx = src[xid];
       private_buffer[dynamic_idx % 10] = src1[xid];
       ...
     }
`

   The following case is OK.

`
     ...
     uint private_buffer[32];
     for (i = 0; i < xid; i++)    {
       private_buffer[xid % 32] = src1[xid];
       ...
     }
`


1. Use SLM to reduce the memory bandwidth requirement if possible.
  
  On Gen platform, SLM is in GPU's L3 cache, if it could be used to  
  share data between work items, it could reduce the memory bandwidth  
  on the system memory bus. This will be a big win for many I/O intensity  
  kernels.

1. Avoid SLM bank conflicts.

  SLM is banked at a DWORD granularity, totally 16 banks. Access on the same
  bank with different addresses will lead to a conflicts. It should be avoided.
  The worst case is:

  Some examples are as below (assume the data is a cache line aligned global/constant uint buffer, and the work group size is 16 with SIMD16 mode):  
  `uint x = data[get_global_id(0)];      //best, no bank conflicts, no bandwidth waste`  
  `uint x = data[get_global_id(0) + 1]; //best, no bank conflicts, no bandwidth waste`  
  `uint x = data[get_global_id(0) * 2]; //bad, work item (id) and (id + 8) conflict to each other, waste half of the bandwidth`  
  `uint x = data[get_global_id(0) * 16]; //worst, all work items conflicts on the zero bank, waste 15/16 bandwidth.`  

1. Zero copy on buffer creation. (Only avaliable in git master branch and Release\_v1.0 branch).

  Use CL\_MEM\_USE\_HOST\_PTR to create buffer, and pass in a page  
  aligned host pointer which also has multiple page size. Beignet  
  will leverage userptr to create a buffer object by using that  
  host buffer directly. If possible, you can also use CL\_MEM\_ALLOC\_HOST\_PTR  
  flag to let the driver to allocate a userptr qualified buffer which could  
  guarantee zero copy on the buffer.

  Please be noted, this feature requires the kernel is newer than 3.16 and the libdrm version is newer than 2.4.57.

1. Use float data type as much as possible.

  The two ALUs of one EU could both handle float data,but only one of them could handle non-float type data.

1. Avoid using long.

  GEN7 and Gen7.5 doesn't support long natively. And Gen8's native long support is still under development.

1. Declare small constant buffer with content in the kernel if possible.

  For a small constant buffer, it's better to declare it in the kernel directly with "const \_\_constant". The compiler may optimize it if the buffer is defined inside kernel.

1. Avoid unnecessary synchronizations.

  Both in the runtime and in the kernel. For examples, clFinish and clWaitForEvents in runtime and barrier() in the kernel.

1. Consider native version of math built-ins, such as native\_sin, native\_cos, if your kernel is not precision sensitive.

1. Use fma()/mad() as much as possible.

1. Try to eliminate branching as much as possible.

  For example using min, max, clamp or select built-ins instead of if/else if possible.