summaryrefslogtreecommitdiff
path: root/SharedMemoryTransport.mdwn
blob: c4bcf566d80d52a4f83105daee103d8b39bb0b55 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454


# Shared Memory Transport for XFree86

Rickard E. Faith, Precision Insight, Inc. 

$Date: 2000/03/01 21:09:09 $, $Revision: 1.9 $ 

Shared Memory Transport (SMT) is a mechanism for using shared memory to communicate X protocol information between the client application and the X server. This paper reviews existing SMT implementations, defines design criteria for SMT in XFree86, outlines a staged implementation of SMT for XFree86, and analyzes the performance of this implementation. On workstations, SMT has historically provided a significant improvement in performance. However, on modern workstations and on modern PC-class hardware, SMT improves overall performance by less than 10%. On modern hardware, the performance of the host CPU (including the X server and operating system implementations) is well-matched to the performance of the graphics hardware. Because of this, the performance of the typical X operation is almost completely limited by the performance of the graphics hardware, and the improvement in transport speed provided by SMT cannot provide large gains in rendering performance. Because of these observations, I do not recommend devoting more engineering time to the active improvement of the current SMT implementation for XFree86. 

[[!toc ]] 


## Preamble


### Copyright

Copyright © 2000 by Precision Insight, Inc., Cedar Park, Texas. 

Permission is granted to make and distribute verbatim copies of this document provided the copyright notice and this permission notice are preserved on all copies. 


### Trademarks

Unix is a registered trademark of The Open Group. The 'X' device and X Windows System are trademarks of The Open Group. XFree86 is a trademark of The XFree86 Project. Linux is a registered trademark of Linus Torvalds. Intel is a trademark of Intel Corporation. SGI and Indigo2 High Impact are trademarks of Silicon Graphics, Inc. HP is a registered trademark of Hewlett-Packard Company. All other trademarks mentioned are the property of their respective owners. 


## Introduction


### X11 Transport

X11 supports various methods for transporting information between the client application and the X server. For example, if the client and server are on different Unix machines, INET Domain Socket transport (i.e., via a common TCP/IP socket) might be used for the connection. Machines that support DECnet might use an alternative DECnet-based transport. 

If the client and server are on the same machine, however, Unix Domain Socket (UDS) transport (i.e., via a named pipe) may have lower overhead than a socket-based transport (because of lower operating-system implementation overhead). If it is more efficient, an X server will use UDS transport when the client and server are on the same machine. Some vendors may have special kernel-level interfaces which are even more efficient that UDS. 


### Shared-Memory Transport

Shared memory is memory that can be accessed by more than one process. For Unix-like operating systems, shared memory is usually managed via the System V shm system calls. Although early BSD systems did not have these calls, most modern Unix-like operating systems have implemented them in a portable fashion. 

Shared-memory transport (SMT) is an enhancement (or alternative) to pipe-based transport that uses shared memory in lieu of the pipe for the transport of X11 protocol between the client and the X server. XFree86 does not currently have a shared memory transport implementation because one was never included in the X11 sample server upon which XFree86 is based, and no one has subsequently donated an implementation. However, some vendors (e.g., DEC, HP, SGI, Sun) have implemented proprietary shared-memory transports that use shared-memory segments to communicate some or all of the X11 protocol data between the client and the server. 

The next section discusses key design issues regarding SMT. This discussion is followed by a review of current shared-memory transports. 


### Design Issues

Pipes (and local sockets) are implemented in the operating system kernel as FIFO buffers. The writer makes a call that causes the kernel to copy data from a user-space buffer into a kernel-space buffer. The reader makes another call that causes the kernel to copy the data from the kernel-space buffer into another user-space buffer. Therefore, two memory-to-memory copies are involved. When using shared memory, the client's user-space buffer and the server's user-space buffer can be the same memory region. Hence, one of the potential performance gains obtained by using SMT is the elimination of two memory-to-memory copies. Other issues that complicate the design will be discussed below. 


#### Synchronization

The most basic complicating issue is that of synchronization: when a process writes to a pipe that is full, the kernel puts the process asleep until the reader has read part of the kernel-side buffer. Similarly, when a reader reads from an empty pipe, the reader is put to sleep until the writer writes more data. The X server and client both put the pipes in non-blocking mode so that they can perform other processing (e.g., reading from one pipe while waiting to write another pipe) while using `select(2)` or `poll(2)` to determine when the pipe is ready. 

In marked contrast, a shared memory segment is written and read from user-space and does not provide any sort of kernel-mediated synchronization. So, any SMT design must solve the synchronization problem. As we will see later in this paper, this problem is substantial. 


#### Size of X Requests

When using pipes, the client assembles 1-3 different buffers into requests that are written to the pipe. The server then reads the pipe into a contiguous buffer. Since an X request (with the big requests extension [BIGREQ]) can be 4MB in length, an SMT implementation must either use a very large buffer (to assemble contiguous requests for use on the server side) or must provide fall-back involving memory-to-memory copies for large requests (thereby losing some fraction of the speedup provided by eliminating the copy). 

If an X request cannot fit contiguously in the shared memory buffer, other synchronization problems must be solved: 

1. If the request does not fit in the space remaining in the buffer, but would fit if the buffer was empty, then the client may request notification from the server when the buffer is empty. 
1. If the request does not fit in the buffer at all, the client must write pieces into the buffer and wait for notification from the server that another piece can be written. 

#### Security and Resource Management

Since shared memory is a finite system resource, sharing a pool of shared-memory segments among several clients may be a reasonable resource management approach. However, if this means that one client can write into the shared-memory segment of another client, there is a potential security risk. 

Since X lacks any notion of privileges and all clients are treated equally [EP91], a rogue client cannot use access to the SMT buffer of another client to do anything that could not already be done via the rogue client's own pipe. In this sense, the use of a shared pool of shared-memory segments is acceptable. From a practical debugging standpoint, however, poorly written application-level code in one application should not be able to crash another application by writing into the wrong shared memory segment. 

From a resource management standpoint, using a shared pool of memory requires careful implementation so that the X server can always tell when a client is no longer using a segment. 


### Related Work


#### The MIT-SHM Extension

The MIT-SHM extension [MIT-SHM] can provide some of the benefits of SMT. The MIT-SHM extension improves client performance by allowing for the creation of pixmaps in shared memory. In contrast to SMT, which is transparent to the client and has the potential to impact all protocol requests, MIT-SHM: 

1. requires special code in the application, and 
1. only improves code which uses XPutImage and XGetImage (and then only when certain restrictions apply to the pixmap format -- see [MITSHM] for details). 
The MIT-SHM extension is compatible with SMT, although its use will increase overall shared-memory resource utilization. In implementations of SMT which do not optimize [[GetImage|GetImage]], the MIT-SHM extension may still be an important tool for implementation of efficient clients. 


#### Other Shared-Memory Transport Implementations

There is no example implementation of SMT available in the freely-available X11 source tree, and all vendor-specific SMT implementations are proprietary. Hence, implementation details are not readily available. 

Information on the web [DEC, HP] and in man pages (e.g., for SGI and Sun) can provide some insight into implementation details, as discussed below. Note, however, that this information may be out of date, and our inferences may be incorrect regarding the details of a specific vendor's implementation. This discussion, however, will outline some of the major design decisions faced by any implementation of shared-memory transport. 


##### Use of the Unix Domain Socket

Vendor implementations of SMT [DEC, HP] use the usual UDS transport to provide: 

* SMT Flow control: UDS transport is used to notify the server when a request (or set of requests) is available for processing. This implies that a special X protocol extension must be used. 
* Responses: Minimally, all events and errors are sent from the server to the client using UDS transport. Many vendors also send all replies via the UDS transport, reserving SMT for requests. 

##### Use of the Shared Memory Segment

Vendors [DEC, HP] generally use the shared-memory segment only for protocol requests, although [[GetImage|GetImage]] has the potential to be dramtically improved by using shared-memory for replies. For some implementations, synchronization overhead can be dramatic (e.g., XNoOp may take significantly longer with SMT than with UDS [DEC]). 


##### Activation of SMT

Most vendors key off the `DISPLAY` environment variable to determine if SMT should be used. For example, if `DISPLAY=local:0` [DEC] or `DISPLAY=shmlink:0` [HP], then SMT will be used. If `DISPLAY=unix:0`, then UDS transport will be used. And, if `DISPLAY=:0`, then the best possible transport will be used (e.g., SMT until the limit of SMT connections is reached, and UDS transport thereafter). 

Other vendors use the presence of an environment variable to alert the client that shared-memory transport should be requested. 


##### Tuning SMT

SGI/IRIX man pages note that the number of SMT clients allowed by a particular server invocation is a tunable X server command-line parameter. 

Sun/Solaris man pages note that the size of the shared-memory segment can be tuned with an environment variable. This allows for per-client adjustment of the shared-memory segment. 


#### Profile of an HP X Server

`x11perf` data was collected for HP's SMT-aware X server running under under HP-UX 10.20 on an HP 9000/735. Highlights of the output of `x11perfcomp` are presented here. (The means were computed from the raw rate (operations per second) data.) 

       No SMT         SMT    Ratio   Operation
    2390000.0   3320000.0     1.39   Dot 
    1260000.0   1640000.0     1.30   1x1 rectangle 
     280000.0    293000.0     1.05   10x10 rectangle 
       9150.0      9150.0     1.00   100x100 rectangle 
        437.0       437.0     1.00   500x500 rectangle
    1860000.0   2300000.0     1.24   1-pixel line 
     936000.0   1020000.0     1.09   10-pixel line 
     123000.0    124000.0     1.01   100-pixel line 
      25300.0     25300.0     1.00   500-pixel line 
      16900.0     20900.0     1.24   PutImage 10x10 square 
        988.0      1620.0     1.64   PutImage 100x100 square 
         40.0        33.9     0.85   PutImage 500x500 square 
      20100.0     23300.0     1.16   ShmPutImage 10x10 square 
       3390.0      3540.0     1.04   ShmPutImage 100x100 square 
        131.0       131.0     1.00   ShmPutImage 500x500 square 
     247000.0    252000.0     1.02   X protocol NoOperation
    
    Geometric mean for all operations:  1.06
    Maximum ratio:                      1.79 (1-pixel solid circle)
    Minimum ratio:                      0.80 (GetProperty)

Note the trend over the rectangle and line operations: each of these operations requires the same amount of protocol to be transmitted, but the operations that can be rendered faster are accelerated more by SMT. Here, rendering time is the limiting factor on SMT performance improvement. Other operations with small protocol footprints showed a similar profile. 

For the [[PutImage|PutImage]] operations, images that will fit in the shared-memory area are accelerated by SMT, with larger images being accelerated more, presumably because of the copy elimination. When the image no longer fits in the shared-memory area, the synchronization overhead of SMT slows the operation. 


#### Profile of an SGI X Server

`x11perf` data was collected for SGI's SMT-aware X server running under IRIX 6.5 on an SGI Indigo2 High Impact. Highlights of the output of `x11perfcomp` are presented here. (The means were computed from the raw rate (operations per second) data. The SMT run did not complete, so all operations are not represented in the aggregate results.) 

       No SMT         SMT    Ratio   Operation
    1940000.0   5270000.0    2.72    Dot 
    1400000.0   2370000.0    1.69    1x1 rectangle 
    1070000.0   1510000.0    1.41    10x10 rectangle 
      18900.0     18900.0    1.00    100x100 rectangle 
       1110.0      1110.0    1.00    500x500 rectangle 
    1450000.0   3180000.0    2.19    1-pixel line 
    1450000.0   3110000.0    2.14    10-pixel line 
     409000.0    446000.0    1.09    100-pixel line 
      92000.0     91900.0    1.00    500-pixel line 
      34900.0     53100.0    1.52    PutImage 10x10 square 
        601.0      1530.0    2.55    PutImage 100x100 square 
         41.5        75.1    1.81    PutImage 500x500 square 
      30400.0     40600.0    1.34    ShmPutImage 10x10 square 
       1990.0      1940.0    0.97    ShmPutImage 100x100 square 
        303.0       298.0    0.98    ShmPutImage 500x500 square 
     529000.0    771000.0    1.46    X protocol NoOperation
    
    [See comment above: these aggregate values based on incomplete data.]
    Geometric mean for executed operations:  1.08
    Maximum ratio:                           2.72 (Dot)
    Minimum ratio:                           0.62 (10-pixel wide partial circle)

Again, there is a trend over the rectangle and line operations: operations that take less time to render have a great improvement when SMT is used. 


## Design Criteria

The previous section outlines some of the possible trade-offs involved in the design of an SMT system. This section outlines the design choices that were made for the XFree86 SMT implementation: 


### Performance Goals

The SMT implementation must have the following performance characteristics (performance will be measured using x11perf): 

* When SMT is not in use, the SMT-aware X server should have performance that is identical with a non-SMT-aware X server. This will ensure that when the SMT patches are enabled, they do not impact the performance of any non-SMT clients. 
* The performance of an SMT client should, on average, be better than that of a non-SMT client. 
* The performance of every `x11perf` tests should not be significantly worse when using SMT as compared to not using SMT (e.g., 50% performance from XNoOp is not acceptatble). 
* The addition of support for SMT should not impact the performance of other performance-related extensions, such as MIT-SHM [MITSHM] and the big requests extension [BIGREQ]. Indeed, big requests should transparently use the capabilities of the SMT implementation. 

### Security and Stability


#### Authenticated Connection

A client can only initiate SMT after another local transport (e.g., UDS) has been connected. Any client authentication (i.e., via xauth) will be performed using the other transport, so the SMT implementation does not have to do any additional authentication. 


#### No Segment Sharing

Some SMT implementations have used a shared pool of shared-memory segments to which all clients have access. This style of implementation is difficult because of the complexity of resource sharing issues, and should be avoided for an initial implementation. Therefore, the initial implementation of SMT for XFree86 will use a separate shared-memory segment for each client. 

If additional segments are required, they will either be private (per-client) read-write segments, as described in this section; or they will be public segments that allow read-only access by all of the clients and read-write access only by the X server. 


#### Server Creates and Destroys

The X server will create all shared-memory segments. At the request of the client, the X server will create a segment, change the owner of the segment to the user ID of the client, and provide segment identifier to the client. This will ensure that a rogue client cannot attach to arbitrary shared memory segments. A rogue client could attach to the segment of other clients that shared its user id, but there is no advantage to this type of attack: without any SMT, the rogue could still open a standard connection to the X server and issue commands that impact the other clients [EP91]. 

After the client attaches to the segment, the client will request that the X server start using the segment for transport. At this point, the X server can mark the segment as destroyed. This will prevent any other clients from attaching to the segment, and will ensure that the segment is returned to the system when the client and the X server detach. The server will also destroy and detach the segment if the client closes the non-SMT X protocol connection. 


### Resource Management

Shared memory is a precious system-wide resource. The `XF86Config` file should specify: 

* minimum and maximum values for the amount of shared memory that may be used by each client, and 
* a maximum value for the amount of shared memory that may be used by all clients. 
When the shared memory limits are exceeded, a request for SMT should fall-back to a non-SMT connection. 

A client will request SMT if the `XF86SMT` environment variable is set to a non-zero value that specifies the number of bytes requested for the shared memory segment. The shared memory segment actually used will be the lesser of the requested size and the minimum specified in the config file. SMT will not be activated if there is no more shared memory available. 


## Staged Implementation

This section discusses a staged development process for adding SMT to XFree86. Early stages will lead to the rapid implementation of a usable shared-memory transport for XFree86. Later stages will require more implementation work, but will increase the performance improvement provided by SMT. 


### Stage 1: Simple Two-Copy Implementation

The first stage of SMT implementation will implement: 

* New Shared-Memory Transport: The X transport interface [XTRANS] has been encapsulated to simplify the addition of new transports. The SMT transport is a special case, however, since it relies on another transport being available for synchronization and, possibly, events, errors, and replies. 
* New X Protocol Extension: A protocol extension is required to establish the SMT connection. 
Stage 1 will demonstrate the feasibility of a simple, straight-forward SMT implementation that touches a minimum amount of code in the [[X11R6|X11R6]] source tree. This stage of implementation will send all X protocol commands through the shared-memory segment, using two user-space memory-to-memory copies (one on the client-side, and one on the server-side). The shared-memory segment is divided into two parts. The first part contains read and write pointers into the second part. The second part is treated as a large circular buffer. 

This implementation stage avoids moving data through the kernel, but still uses two copies to move the data between the client and the server. Since these two copies approximate the amount of work performed by a UDS-based implementation, this stage is not expected to be faster than pure UDS transport. Indeed, because of the simplistic synchronization used for this stage of implementation, performance will suffer. 


#### Performance Evaluation

The initial implementation used the UDS (pipe) transport to send the size of the SMT data to the server after each write, and polled when the pipe was full. Because of this polling, this stage had extremely poor performance: 

       No SMT         SMT    Ratio   Operation
    5720000.0   1320000.0    0.23    Dot 
    2060000.0    497000.0    0.24    1x1 rectangle 
     808000.0    311000.0    0.38    10x10 rectangle 
      24300.0     24200.0    1.00    100x100 rectangle 
       1070.0      1070.0    1.00    500x500 rectangle 
      57100.0      8020.0    0.14    PutImage 10x10 square 
       1110.0       488.0    0.44    PutImage 100x100 square 
         32.0        22.6    0.71    PutImage 500x500 square 
    2150000.0   2180000.0    1.01    X protocol NoOperation

After the polling was eliminated, the performance improved: 

       No SMT         SMT    Ratio   Operation
    5720000.0   4240000.0    0.74    Dot 
    2060000.0   1810000.0    0.88    1x1 rectangle 
     808000.0    768000.0    0.95    10x10 rectangle 
      24300.0     24200.0    1.00    100x100 rectangle 
       1070.0      1070.0    1.00    500x500 rectangle 
      57100.0     52400.0    0.92    PutImage 10x10 square 
       1110.0       818.0    0.74    PutImage 100x100 square 
         32.0        28.4    0.89    PutImage 500x500 square 
    2150000.0   2160000.0    1.00    X protocol NoOperation

Without polling, operations that take longer to render perform at a speed similar to pipes. Operations that render quickly are slower because the overhead of copying and synchronization is starving the rendering engine. The Stage 2 implementation will concentrate on removing this overhead. 


### Stage 2: Copy and Synchronization Elimination

One of the well-known ways that SMT improves X performance is via copy-elimination: without SMT, the client does a copy from its buffer into the kernel-side pipe, and the server makes a copy of the pipe contents into its own buffer. However, use of the pipe has some constant overhead. Experiments that I've done suggest that pipe overhead can dominate copy time (especially when each buffer is already in the L2 cache, which may be a significant amount of the time). Hence, elimination of synchronization via the pipe is an extremely important part of SMT implementation. 

Stage 2 contains several sub-stages which are relatively independent: 

* Stage 2a: Elimination of server-side copy 
* Stage 2b: Partial elimination of synchronization 
* Stage 2c: Elimination of client-side copy 
* Stage 2d: Further elimination of synchronization 
Note that the elimination of client-side copies (Stage 2c) requires the introduction of additional synchronization. For the current implementation, Stage 2d is a replacement for Stage 2c. In this implementation, both the Stage 2c and the Stage 2d implementations sit on top of Stages 2a and 2b. 


### Stage 2a: Elimination of Server-Side Copy

Elimination of server-side copies requires solving two problems: 

* handling requests that are too big to fit in the shared-memory segment, or that are non-contiguous in the segment, and 
* notifying the client when the request is finished so that that part of the segment can be reused. 
Any request that falls off the end of the shared-memory segment causes the server to revert to a copy-based buffer. This same method also handles the case of a request that is too big for the buffer. 

Whenever the server asks for a new request, it is guaranteed that the server has finished processing all previous requests. At this time, the read pointer for the circular buffer in the shared-memory segment can be advanced. Currently, the pointer is advanced only after a complete subsequent request is received (or immediately, if reversion to a copy-based buffer has taken place). 


### Stage 2b: Partial Elimination of Synchronization

In the Stage 2a implementation, the pipe was used to notify the server whenever a copy into the shared-memory area was performed. This use of the pipe approximates the frequency at which the pipe would be used by a non-SMT implementation. In order to reduce use of the pipe in these situations, the following changes had to be made: 

* The client should use the pipe only when the server is sleeping. 
* Before doing a `select(2)`, the server should check all shared-memory areas to determine if there is pending input. 
The first problem was solved by creating another shared-memory segment that was writable by the server and readable by all of the clients. This segment contains a flag that the server sets before going to sleep (i.e., before calling `select(2)`). After copying information to the shared-memory segment, the client examines this flag and uses the pipe only if the server is asleep. Also, before the client goes to sleep (i.e., before calling `select(2)` or `poll(2)`), it makes sure the server is awake. 

Obviously, this scheme could introduce a race condition that would create deadlock with both the client and server sleeping. This race condition is avoided using strict client-side and server-side ordering. The client will: 

1. Update the write pointer for the shared-memory segment. 
1. Check the flag and write to the pipe if the server is sleeping. 
1. Go to sleep if the client must wait on the server. 
The server will: 

1. Set the flag. 
1. Check all shared-memory areas for data (this can be optimized using a bitmap). If data is available, reset the flag and process the data, skipping the next step. 
1. Go to sleep. 

#### Performance Evaluation

       No SMT         SMT    Ratio   Operation
    5720000.0   5650000.0    0.99    Dot 
    2060000.0   2270000.0    1.10    1x1 rectangle 
     808000.0    838000.0    1.04    10x10 rectangle 
      24300.0     24300.0    1.00    100x100 rectangle 
       1070.0      1070.0    1.00    500x500 rectangle 
      57100.0     63900.0    1.12    PutImage 10x10 square 
       1110.0      1200.0    1.08    PutImage 100x100 square 
         32.0        28.4    0.89    PutImage 500x500 square 
    2150000.0   2460000.0    1.14    X protocol NoOperation

### Stage 2c: Elimination of Client-Side Copy

The client-side library uses several buffers for protocol requests to the server. The primary buffer is used for protocol headers and for many complete protocol requests. Elimination of the copy is easy for this buffer, since it can be directly replaced with the shared-memory segment. The only difficult problem is dealing with protocol requests that would wrap off the end of the buffer. These are handled by detecting when a wrap would have occurred, writing a special protocol request to the buffer that tells the server to wrap the pointers, and either writing into the first part of the buffer (if it is available) or waiting for the server to free up space in the buffer. 

The client also mallocs auxiliary buffers on-the-fly to handle the data portion of larger requests. Copies involving these buffers were not eliminated in this implementation, but would be a potential area for future improvement. 


#### Performance Evaluation

    5720000.0   5940000.0    1.04    Dot 
    2060000.0   2260000.0    1.10    1x1 rectangle 
     808000.0    837000.0    1.04    10x10 rectangle 
      24300.0     24300.0    1.00    100x100 rectangle 
       1070.0      1070.0    1.00    500x500 rectangle 
      57100.0     77700.0    1.36    PutImage 10x10 square 
       1110.0      1430.0    1.29    PutImage 100x100 square 
         32.0        28.5    0.89    PutImage 500x500 square 
    2150000.0   2410000.0    1.12    X protocol NoOperation

### Stage 2d: Further Elimination of Synchronization

Since Stage 2c added additional synchronization points, and since the use of the pipe appears to dominate any potential performance gain obtained from copy-elimination, Stage 2d will focus on elimination of synchronization without the elimination of client-side copies: 

* Optimize server-side reversion to copy-based buffer: only copy the current request, not all available bytes in the shared-memory segment. This should allow a rapid return to the copy-free state. 
* Clean up code to avoid unnecessary function calls (e.g., to test if the client is using SMT). 

#### Performance Evaluation

       No SMT         SMT    Ratio   Operation
    5720000.0   6590000.0 (  1.15)   Dot 
    2060000.0   2290000.0 (  1.11)   1x1 rectangle 
     808000.0    842000.0 (  1.04)   10x10 rectangle 
      24300.0     24300.0 (  1.00)   100x100 rectangle 
       1070.0      1070.0 (  1.00)   500x500 rectangle 
      57100.0     79900.0 (  1.40)   PutImage 10x10 square 
       1110.0      1260.0 (  1.14)   PutImage 100x100 square 
         32.0        28.4 (  0.89)   PutImage 500x500 square 
    2150000.0   2480000.0 (  1.15)   X protocol NoOperation

These results are as good as or better then the Stage 2c results for most operations, without the additional complexity required for reduction of client-side copies. 


## Evaluation

The data above were collected during the implementation phase using versions of XFree86 prior to 3.9.18. The following data were collected using XFree86 3.9.18 after a reboot on a 350MHz Pentium II with an ATI Rage 128 Pro card and 64MB of PC-100 SDRAM running Linux 2.3.42. 

The following table shows the means of `x11perf` output with the SMT patches, for clients that do or do not use SMT as the transport mechanism, compared with `x11perf` output for an X server built from a pristine XFree86 3.9.18 tree. Comparisons are made for 8 bit (8 bits per pixel) and 24 bit (32 bits per pixel) color depths. 

                            No SMT      SMT
    8bit geometric mean:      1.01     1.03
    24bit geometric mean:     1.01     1.03

    While some operations are more than 40% faster with SMT enabled, enabling SMT makes other operations up to 20% slower. The overall gain provided by the current SMT implementation is less than 5%.

Clearly, SMT cannot improve operations that are render-bound, so the amount of overall improvement will depend on the speed of the graphics card compared with the speed of the host CPU: as the mismatch in relative speed increases, the improvement provided by SMT will also increase. 

Data from a small sample of mature proprietary SMT implementations support this conclusion, showing an overall 6-8% performance improvement when using SMT, with some operations up to 80-170% faster and other operations 20-40% slower. Workstation graphics subsystems often cost more, relative to the rest of the machine, than do PC-class graphics solutions. Since the relative mismatch in speed is greater, the relative improvement in SMT is also greater. This was demonstrated by gathering test data using a slower graphics card: the results, as expected, showed that the performance gain from SMT was under 3%. 

A fairer evaluation of SMT would examine only those operations which are not render bound. To illustrate, the following data was computed from `x11perf` operations that contain less than 101 pixels or less than 5 kids: 

    8bit geometric mean:      1.06
    24bit geometric mean:     1.05

These data suggest that applications that use small rendering operations frequently (e.g., window managers) would have a larger performance improvement that other applications. However because the improvement is so small, it may be dominated by other factors. 


## Recommendations

SMT improves X11 performance, as measured by `x11perf`, by improving the speed of operations that are transport bound but that are not render bound. Operations that render tens or hundreds of pixels can be improved, but operations that render larger numbers of pixels (and are, therefore, already limited by the speed of the graphics adapter) are not improved. The amount of improvement for transport-bound operations is a function of the mismatch in speed between the host CPU and the graphics subsystem. When considering the host CPU the quality of the implementation of the X server and the operating system must also be included, since these impact the amount of work that the host CPU must do. 

On older workstations, the graphics subsystem often cost more than the rest of the machine, and was relatively much faster than the host CPU. On machines like this, marketing numbers for the improvement provided by SMT were often in the 15-25% range. Unfortunately, none of these machines were available for testing. 

A very small sample of newer workstations were tested, and the overall performance improvement provided by SMT was found to be in the 6-8% range. 

On typical PC-class hardware, the cost of the graphics adapter usually represents 5-15% of the cost of the rest of the machine and the rendering engine is relatively well matched to the speed of the host CPU. For systems like this, the current SMT implementation for XFree86 provided an overall speed improvement of less than 5%. This improvement will vary: 

* as a function of the mismatch in speeds between the host CPU and the graphics renderer, and 
* as a function of the optimization of the SMT implementation. 
However, on modern PC-class hardware an extremely optimized SMT implementation will probably improve performance by less than 10% (based on observations of workstation implementations and current graphics hardware) and a significant portion of this improvement can be obtained by building well-matched machines (e.g., using a faster host CPU when using a faster graphics adapter). 

Therefore, I do not recommend continued work on the current SMT project. The current SMT implementation will be submitted to XFree86, either for archival purposes or for inclusion in the distribution. SMT may be of significant benefit to those users who have mismatched hardware (e.g., an older CPU with a newer graphics card), but will probably not benefit high-end graphics users. 


## Future Work

As noted above, I do not recommend pursuing further SMT optimizations since they are not expected to provide a large performance win for the vast majority of PC-class systems. However, there may be some niche where further development of SMT is justified, so some possible avenues of exploration are discussed in this section. 


### Elimination of Client-Side Copies

Elimination of client-side copies is difficult and adds additional synchronization points between the client and the X server (because the request must be contiguous in the buffer and cannot wrap). Further, many operations use an auxiliary buffer that is allocated on-the-fly. Elimination of client-side copies should address the synchronization problem and should eliminate copying of the auxiliary buffer. These problems were not addresses fully in the Stage 2c prototype discussed above. 


### PutImage and GetImage

[[PutImage|PutImage]] and [[GetImage|GetImage]] are commonly-used operations that transport large amounts of protocol to or from the X server. These operations are candidates for performance improvements that may be orthogonal to a general SMT implementation (i.e., they could be transparently improved using shared memory without using a full shared-memory transport). However, there are several factors that may eliminate the helpfulness of these optimizations: 

* Any performance improvement would depend on the size of the shared-memory region that was used for these operations. For example, a 500x500 color pixmap can require nearly 1MB of memory. All of the SMT data in this paper that was collected using XFree86 used a 256kB shared-memory buffer. Using a buffer that was smaller or larger degraded performance (perhaps because of L2 cache issues). So, using a 1MB (or larger) buffer for specialized [[PutImage|PutImage]] and [[GetImage|GetImage]] transport may not provide the expected gains. (Note, however, that if 256kB of image transport memory was typically available, applications programmers would be willing to break larger images up into pieces if there was a significant performance boost.) 
* The expected gains, based on observed SMT data, is in the 10-40% range. Improvements in this range may not justify the cost of the initial implementation. 
* Applications that are [[PutImage|PutImage]] intensive probably already use the MIT shared-memory extension [MITSHM] 

## Acknowledgements

Special thanks to Jens Owen and Kevin Martin for discussing SMT implementation issues and for reviewing early versions of this document. Thanks to Red Hat for funding this project. 


## References Cited

[BIGREQ] Bob Scheifler. Big Requests Extension, Version 2.0. Available from xc/doc/specs/Xext/bigreq.ms and xc/doc/hardcopy/Xext/bigreq.PS. 

[DEC] X Window System Environment (Digital UNIX Version 4.0 or higher, March 1996). Digital Equipment Corporation, 1996. Available from [[http://h30097.www3.hp.com/docs/base_doc/DOCUMENTATION/HTML/AA-Q7RNB-TE_html/TITLE.html|http://h30097.www3.hp.com/docs/base_doc/DOCUMENTATION/HTML/AA-Q7RNB-TE_html/TITLE.html]]. Section 4.1.10, "SMT - Shared Memory Transport Extension (Digital provided)" is available from [[http://h30097.www3.hp.com/docs/base_doc/DOCUMENTATION/HTML/AA-Q7RNB-TE_html/xenvCH4.html|http://h30097.www3.hp.com/docs/base_doc/DOCUMENTATION/HTML/AA-Q7RNB-TE_html/xenvCH4.html]]. 

[EP91] Jeremy Epstein and Jeffrey Picciotto. Trusting X: Issues in Building Trusted X Window Systems -or- What's not Trusted About X? In Proceedings of the 14th Annual National Computer Security Conference, Washington, DC, October 1991. 

[HP] [[http://hpcc940.external.hp.com/xwindow/noFrames/features/smt.html|http://hpcc940.external.hp.com/xwindow/noFrames/features/smt.html]]. 

[MITSHM] Keith Packard. MIT-SHM -- The MIT Shared Memory Extension. Available from xc/doc/specs/Xext/mit-shm.ms and xc/doc/hardcopy/Xext/mit-shm.PS. 

[XTRANS] Stuart Anderson and Ralph Mor. X Transport Interface. Available from xc/doc/specs/xtrans/xtrans.mm and xc/doc/hardcopy/xtrans/Xtrans.PS. 

[XPROTO] Adrian Nye, editor. X Protocol Reference Manual for X11 Version 4, Release 6 (X Volume Zero). Sebastopol, California: O'Reilly & Associates, 1995.