summaryrefslogtreecommitdiff
path: root/CompositeSwap.mdwn
blob: da748148a3d26ef9fdbe41ef73c434cfbcf92232 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211


## Issues

X and GL applications need to control how their data is displayed for several reasons:  

   * preventing distracting artifacts - ugly "tearing" or partial drawing for example 
   * smoothness - constant, predictable frame rates are a requirement for animation, video and high end simulation programs 
   * performance - apps need to know how well they're keeping up with their desired frame rate so they can throttle back or simplify drawing as needed 
Plain X doesn't have many ways of doing the above (it mostly assumes immediate mode drawing to the visible frame buffer), but GLX provides many extensions related to handling the actual display of buffers on the user visible screen, these extensions need to be supported in DRI2 for the Linux graphics stack to really shine. 

There are also issues related to memory consumption, swap behavior, and performance that can be addressed: 


### Memory savings

   * memory consumption could be reduced if the private back/front pair was reduced to just a private back with the compositor copy acting as front (though there are issues with front buffer rendering in this case). 
         * This can be implemented purely client side through the addition of compositor<->client protocols. 
   * it could also be reduced by throwing away the private back buffer in between frame rendering. 
         * This can happen automatically with some additions to the DRI2 protocol and client/server behavior. 

### Performance

   * performance could be significantly improved on low bandwidth platform if buffer swaps could be simple pointer exchanges when windowed (similar to the way page flips work for full screen applications) 
         * requires window managers to draw decorations independent of application window (i.e. exchange is only possible if front & back window pixmaps are the same size) 
   * page flipping for full screen swaps is also a significant win on bandwidth limited platforms 

### Behavior

   * some applications want to control how buffer swaps occur (e.g. to preserve back buffer contents after a swap) 
   * triple buffering should be available to applications that need it, with configurable behavior for discard vs. queue if the buffers are rendered faster than can be displayed 

## OpenGL and GLX swap and throttling related extensions

Under a compositor, the behavior of these routines could change, or additional compositor<->client protocol added to support similar behavior.  Changes noted below, though in general "video frame" should be thought of as a virtualized compositor frame rate rather than a monitor refresh (e.g. the compositor may report 60fps to applications even though it only updates the screen when recompositing is needed): 

   * [[SGI_swap_control|http://www.opengl.org/registry/specs/SGI/swap_control.txt]] - controls how frequently glXSwapBuffers swaps occur (in frames) 
         1. for redirected windows, interval is in compositor frames rather than monitor video frames 
   * [[SGI_video_sync|http://www.opengl.org/registry/specs/SGI/video_sync.txt]] - allows clients to query frame counts and wait on specific counts or divisor/remainders thereof 
         1. for redirected windows, glXGetVideoSyncSGI returns the compositor frame count rather than the monitor frame count 
         1. for redirected windows, glXWaitVideoSyncSGI will block until the compositor frame count satisfies the specified conditions 
   * [[OML_sync_control|http://www.opengl.org/registry/specs/OML/glx_sync_control.txt]] - combines the above and adds the notion of a "swap count" and frame timestamp, allowing applications to closely monitor their performance and swap frequency 
         1. for redirected windows, MSC represents the compositor frame count; likewise UST indicates the time when the compositor last generated a frame 
         1. for redirected windows, SBC could be incremented when the compositor copies a buffer rather than when the buffer is copied from back to front 
   * [[OML_swap_method|http://www.opengl.org/registry/specs/OML/glx_swap_method.txt]] - exposes swap method (whether copy, exchange or undefined) to the FBconfig 
         1. the way X and compositors work now make it hard to report anything other than 'unknown' as the swap method; page flipping is opportunistic, and exchange is difficult for windows (redirected or not) due to reparenting by window managers 
   * [[SGIX_swap_group|http://www.opengl.org/registry/specs/SGIX/swap_group.txt]]/[[NV_swap_group|http://www.opengl.org/registry/specs/NV/glx_swap_group.txt]] - allow clients to swap in a synchronized manner 
         1. for redirected windows, swap groups could depend on compositor copies (see SBC count above) 
   * [[SGIX_swap_barrier|http://www.opengl.org/registry/specs/SGIX/swap_barrier.txt]] - controls swap group behavior 
         1. should be unaffected 
   * TBD_swap_control - allow selection triple/N buffering 
         1. needs to be defined, can fail if driver can't handle requested number of buffers 
   * [[INTEL_swap_event|http://people.freedesktop.org/~jbarnes/swapbufferevent.txt]] - deliver swap information to clients after a swap completes, useful for integrating swap based throttling into client event loops 

## Compositor extensions

To support the above, feedback from the currently active compositor is necessary.  Compositor<->client protocol is needed for: 

   * [[CompositeNotifyPixmapCopied|CompositeNotifyPixmapCopied]](pixmap) - compositor notifies server that pixmap has been copied, unblocks clients blocked on swapping or rendering to front 
         * should be doable with xsync already, as a defined protocol between the compositor and clients 
   * [[NotifyPixmapReady|NotifyPixmapReady]](pixmap) - server notifies compositor that an application has a new frame is ready (e.g. after a glXSwapBuffers) 
         * should be similar to a damage event, maybe damage is sufficient? 
   * [[CompositeNotifyFrameDone|CompositeNotifyFrameDone]] - compositor notifies client that the compositor has finished drawing a new frame; clients blocked on frame related events can continue. 
         * again, should be doable with xsync 

## Triple buffering

Triple buffering means different things to different people.  For convenience, we list the types we're concerned about for this discussion here.  All have a high memory cost and should generally only be enabled for a small number of clients at a time. 

   1. compositor based - compositor keeps a private copy of each application's front buffer; this means it always has a consistent, fully drawn pixmap to use for creating screen frames.  glXSwapBuffers updates the redirected front and notifies the compositor a new one is ready to pick up as its private copy. 
            * good for avoiding ugly partial drawing artifacts 
   1. server based - server keeps last ready client buffer around and returns available buffers to the client after a glXSwapBuffers occurs 
            * can help keep clients busy at the cost of extra memory (good to keep frame rates up while preserving vblank sync'd swapping) 
   1. client based - server returns 3 buffers to the client, which somehow requests copies between various of them using glXSwapBuffers 
            * like (2) but totally client side. 
Another factor for triple buffering is how to handle extra frames.  If frames are rendered faster than can be displayed, some applications may want to discard the extra frames, while others may want to queue them (and likely be throttled). 


## New code

All of this means new DRI2, display server and compositor code, but it should be doable.  In particular we may want: 

   * DRI2 proto for waiting on a given swap or frame count 
   * DRI2 swapbuffers support for frame count & divisor/remainder delayed swaps 
   * new server code for handling swap groups 
   * new server code for handling indirect clients doing frame count or swap buffer count waits 

### Implementation: SGI_swap_control for the X server

The SGI_swap_control extension allows applications to control their glXSwapBuffers frequency.  The glXSwapIntervalSGI call lets clients specify how frequently, in frames, their buffer swaps should occur.  Implementing this requires server support, since with DRI2 the server is responsible for performing swaps.  The basic flow is as follows: 

   1. When an application calls glXSwapBuffers, a swap is scheduled in the server for the current frame count plus the interval count (though if the swap is to be a page flip, it's scheduled to be scheduled, since flips occur at the next vblank after being queued). 
   1. The server calls into the DDX driver's ->[[ScheduleSwap|ScheduleSwap]] routine, which is responsible for requesting a kernel frame event for when the swap should occur 
   1. Control is returned to the client immediately 
         1. Note: any futher GLX calls requiring a GLX context to be bound will block until the swap completes 
   1. When the DDX receives the associated frame event, it will perform all the scheduled swap 
   1. The swap count will increase 
   1. The client will be unblocked if necessary 

### Implementation: SGI_video_sync for the X server

SGI_video_sync gives clients control over their framerate by exposing the frame count and allowing apps to wait on a given frame count.  glXGetVideoSyncSGI returns the current frame count for the display the drawable is on, and glXWaitVideoSyncSGI allows a client to block until a given frame count is reached on its drawable's display.  Flow in the server & client for glXGetVideoSyncSGI direct rendered case: 

   1. Client calls glXGetVideoSyncSGI 
   1. Mesa code receives the request and turns it into a DRI2GetMSCReq protocol request 
   1. display server receives the request and calls DDX driver's ->GetMSC hook 
   1. DDX driver returns frame count based on drawable location (count is returned for the CRTC with the greatest intersection with the drawable), or 0 if the drawable is currently offscreen (the GLX spec should be updated to reflect this), note this could also return an error ([[BadDrawable|BadDrawable]] possibly). 
   1. display server returns a DRI2GetMSCReply with the current MSC count to the client 
Flow for glXWaitVideoSyncSGI direct rendered case: 

   1. Client calls glXWaitVideoSyncSGI 
   1. Mesa code receives the request and turns it into a DRI2WaitMSCReq protocol request 
   1. display server receives the request and calls DDX driver's ->ScheduleWaitMSC hook which is responsible for requesting a kernel frame event for the specified [[WaitVideoSync|WaitVideoSync]] values 
   1. DDX blocks the client until the requested frame is received (which could be immediate if the frame has already passed) 
   1. DDX receives the event and unblocks the client, calling into the server to complete the reply 
   1. display server returns a DRI2WaitMSCReply with current MSC, SBC to the client 
Again, if the window is redirected or the drawable is offscreen, the client won't block; an MSC reply with all zeros will be returned (again, could also return [[BadDrawable|BadDrawable]]). 


### Testing

The code for the above is present in several repos and patches: 

   1. kernel - 2.6.33-rc 
   1. libdrm - 2.4.17 or newer 
   1. dri2proto - 2.2 or newer 
   1. glproto - 1.4.11 or newer 
   1. mesa - master branch (will be in 7.8) 
   1. xserver - master branch (will be in 1.9) 
   1. xf86-video-intel - master branch (will be in 2.11) 
See [[Graphics stack git development|http://wiki.x.org/wiki/Development/git]] for information on how to build a stack with the above. 

The direct rendered cases outlined in the implementation notes above are complete, but there's a bug in the async glXSwapBuffers that sometimes causes clients to hang after swapping rather than continue. 


#### Open issues

This implementation does not guarantee tear free drawing in the non-composited, non-fullscreen (flip) case, but does provide the throttling feature implicit in SGI_swap_control.  For a tear-free guarantee, the application must be performing full screen swaps eligible for page filpping or a compositor using page flipping must be present and the application's window redirected.  Alternately, the driver can synchronize its blit activity with the scanout position to avoid tearing.  However this approach can negatively affect performance. 

The implementation also needs screen from the the compositor in the case of redirected windows, so it can request a vblank event from the kernel for the correct CRTC. 


# Using the new code

Some Linux applications may assume that glXSwapBuffers blocks until the swap has completed.  With the above code, that's no longer the case (it's also not the case on other GLX implementations, so it's a non-portable assumption). 

Similarly, some code may assume no throttling of swaps occurs.  Now this behavior can be controlled with glXSwapInterval or through using glXSwapBuffersMscOML. 

See below for specific use cases. 


## Avoiding tearing

Tearing occurs when blits or scanout buffer changes aren't synchronized with vertical retrace, causing two frames to appear adjacent to one another in the vertical (see [[Screen_tearing|http://en.wikipedia.org/wiki/Screen_tearing]] for an example).  Since vertical blank periods are often very short (especially in LCD panels) and CPU scheduling between processes can be highly variable, simply blocking until a vertical retrace completes (e.g. using glXWaitForMscOML or glXWaitVideoSyncSGI) is not a reliable way of avoiding tearing. 

Some DDX drivers provide an option to synchronize DRI2CopyRegion requests (generated by glXCopySubBufferMESA calls and some paths in glXSwapBuffers calls); this can prevent tearing at a potentially significant performance cost since some GPUs will stall until the vertical retrace is outside the region to be copied. 

Another way to avoid tearing, assuming you're running a kernel with page flipping support, is to run your application full screen or under a compositing manager.  If your app or compositing manager uses glXSwapBuffers to display new frames (as opposed to using glXCopySubBufferMESA without driver vertical retrace synchronization), the DDX and server should coordinate to flip whole new scanout buffers through the kernel, which synchronizes the flip to vertical retrace. 


## Throttling rendering

Use SGI_video_sync, SGI_swap_control, OML_sync_control or ARB_sync extensions to render at a constant rate.  Depending on the application, it may be appropriate to throttle your rendering to a factor of the refresh rate (using SGI_swap_control or OML_sync_control) if you can't keep up with it; this avoids a variable frame rate which can be visually distracting.  However, for many animations, especially those simulating physical activities (e.g. a bounce or slide), maintaining refresh rate rendering is critical to visual quality, so reducing quality may be a better option than dropping frames or displaying every other frame in those cases where your application can't keep up with the refresh rate. 


## Controlling buffer swap behavior

Use a TBD_swap_method_control to select triple buffering, blit or exchange methods. 


## Mutter

* When memory isn't a concern: 
   * Triple buffer all composited applications: 
      * 1) busy being used as part of the compositors next render. 
      * 2) a front buffer for the application to queue render commands against; swaping with 3 when done. 
      * 3) a back buffer waiting to be picked up by the compositor, and swap with 1) 
      * Ideally the compositor never has to wait to pick up the applications next front buffer, and no copying is required. 
   * use the SGI_swap_control extension to set glXSwapInterval (1) in applications and compositor. 
      * compositor "video frame periods" are defined - as normal - so we flip at the first vblank after a render completes.  
      * Allow the compositor to drive the video frame period of redirected applications (I.e. consider the compositor to be a pseudo display for redirected drawables) 
      * (Lets assume the compositor is rendering to multiple windows to cover multiple displays - because the full size of the monitors exceeds the GPU render target limits) 
      * The compositor can use a swap group to ensure each of these windows presents in sync. 
      * A new fence/sync object like extension could be implemented that allows the compositor to say: "when my sync group becomes ready and swaps, please notify all these composited-drawables of a video frame progression". 
      * I'm not sure what the best way to link composited drawables to a compositor are a.t.m 
      * when the compositors group swap completes I imagine it would be possible to avoid having to wait for the compositor to be scheduled  to send a DRI2 request to the X server, to send events to clients. I.e. instead of using new DRI2 protocol to send the notification can it not be dealt with by the drm driver that would presumably know when the compositors group swap completes, it would then know to increment the video frame period for some other set of associated composited-drawables and if they pass their designated swap interval they can be unblocked. (or if we have asynchronous swap buffers - see below - an event could be sent by writing to the device file) 
      * asynchronous glXSwapBuffers and swap-buffers-complete events. 
         * Although the applications shouldn't run ahead of the compositor, since there's no point rendering more frames than the compositor can keep up with, applications also shouldn't be blocked from queueing up commands for the next frame. If we had asynchronous swapping + poll-able event notifications for swap-completion the application could stop itself painting when it knows it is two frames ahead of the compositor.  

### Misc notes (rib Thu Sep  3 20:40:43 BST 2009)

* It seems that composited apps should never need to know about real world screen vblank issues, that's only relevant to non-redirected windows including the compositor's. When dealing with a redirected window it seems it would be acceptable to come up with an entirely fake number for all existing extensions that care about vblanks. Somehow tying it to the render/swap-complete of the current compositor seems reasonable. 
* Assuming we have the compositor generating a fake swap interval as above, and the compositor itself is responsible for synchronizing all windows it's responsible for it seems like all the swap group related extensions may just work. 
   * Walking through a hypothetical example of a composited flightgear simulator across multiple monitors seems to add up... 
   * Lets say the compositor has two windows across two monitors (assuming it would exceed render target limits to just have one) 
   * Say flightgear also creates two windows for the same reason and wants to use a swap group to ensure they get presented at the same time. 
   * Assume the compositor is itself also using a swap group to ensure all it's windows get presented at the same time and it drives the video frame period according to it own swap group becoming ready and completing. 
   * If the first flightgear window is drawn too and a swap issued it becomes ready but doesn't actually swap yet (so the compositor wont see it) 
   * The compositor may at this point complete it's current frame and swap and the latest flightgear window won't be shown. 
   * The second flightgear window can be drawn too and a swap issued which makes the group ready so now both windows are swapped and become available to the compositor. 
   * The compositor will pick up the new window contents and since it is itself using a a swap group both windows will be presented in sync. 
* I can't see how GLX_OML_swap_method can be supported at all, given that GLX doesn't know ahead of time if any glx window will be later redirected? 

## Reference

For reference: 

   * [[Apple GL Programming Guide|http://developer.apple.com/mac/library/documentation/GraphicsImaging/Conceptual/OpenGL-MacProgGuide/opengl_intro/opengl_intro.html]] - covers best practices for GL programmers in Apple's composited environment 
   * [[Overview of triple buffering from a gamer perspective|http://www.ocworkbench.com/2006/articles/DXtweaker/]] 
   * [[Android Developer's Guide|http://developer.android.com/guide/index.html]] 
   * [[DirectX Programmer's Guide|http://msdn.microsoft.com/en-us/library/bb173024(VS.85).aspx]]