Linpicker-on-Xen Interprocess Communication (IPC) David A. Wheeler 2008-05-09 This document describes how Linpicker communicates on top of Xen, in great detail. The goal is to make exactly what happens clear, and to ensure that we have not missed anything regarding security. Linpicker involves the interaction of two types of virtual machines (VM)s: * Server. This is trusted, runs the Linpicker server, and interacts directly with the user. For now it will run on VM0, though in theory it could run elsewhere. * Client(s). Each client is responsible for setting up its graphical environment, and communicating with the server to express what it would LIKE to have displayed. The server does NOT trust the clients, and clients typically do not trust each other either. From here on, we will discuss one client at a time, but with the acknowledgement that there often are several. Clients talk directly with the server: Linpicker does not provide any communication channels between clients, except for timing. The drag-and-drop feature of Nitpicker is not part of the current implementation, and in any case, that would be mediated by the server. Inside each VM there is a kernel layer and user layer: * Kernel: Some operations can only be done (or done efficiently) at kernel level, so they are done here. The current implementation presumes that the Linux kernel is running on both ends. * User layer: As much as practicable, work is done at the user-process layer. The server and the clients are split into components; they currently execute with root privileges because they need privileged operations, but we plan to eventual sandbox them to a small set of operations. So there are 4 major components: * Linpicker-server-module: Linux-kernel-level module on server VM. * Linpicker-server-app: Userspace application process on server VM. * Linpicker-client-module: Linux-kernel-level module on each client VM. * Linpicker-client-app: Userspace application process on each client VM, typically fed by an X-windows driver and an X application monitoring user windows. There may be many user applications and window managers, but these are external for purposes of this document. Generally, they would communicate with X, which communicates with the linpicker-client-app. These 4 components communicate via shared memory areas and events: * For each client there is a shared memory area called "commo" between each client VM and the server VM. There is one of these for each client VM. These get mmapped into shared memory at user level on both server and client. This is used for all message-passing between server and client. Most kernel drivers would know a lot about the message formats, but in our case, this would involve a lot of code duplication between the kernel and the userspace application. So instead, the kernel driver provides a very minimal "shared memory" construct, and most communication is done via shared memory between userspace programs. The userspace programs DO need to implement the specific message formats, of course, but this means that ONLY the sender and receiver at the user level must understand the formats. The kernel driver merely makes shared memory areas available for communication. * For each display buffer, on each client, there is are shared memory (display) "buffer" areas between client and server VMs; these get mmapped into shared memory at the user level. These store bitmaps of the screen(s) of that VM. * Events are used to tell clients and servers that data is ready. Xenstore is used to share some information and status, particularly the locations of the shared memory areas. (This simplifies starting up the "commo" area used for message-passing). The implementation intentionally uses shared memory for communication, which means that it cannot DIRECTLY be (reasonably) used for over-network use. That is not really a limitation, though. For network access, use a local VM (to use the shared memory), and then use an application designed for network transport on that VM (e.g., VNC, X-windows commands). This allows high-speed trusted interaction, while still allowing the use of techniques that reduce network load using complicated compression techniques and protocols. What's more, network communication and data compression involves a lot of code; keeping that code out of the trusted computing base is a good idea. The userspace and kernelspace much communicate themselves. The client userspace communicates with the client kernelspace via: /dev/linpicker_client_co: a character special device representing the "commo" shared memory area (used for message-passing between server and client). The client userspace opens, mmaps, and then reads/writes this file to communicate with the server. This area contains both input and output rings, as well as producer/consumer values. To send message, write to this memory area and then write to "event". To read a message, read from this memory area. /dev/linpicker_client_event_co: a character special device representing an event. Write a character to this device to send an event to the server (notifying the server that data is ready). /dev/linpicker_client_bufferNUM_co: a character special device representing a display buffer memory area. Write to this to change the display contents. Note that you must then send a refresh message before it will be displayed. /sys/class/linpicker/client/commo: a sysfs pseudo-file; after mmapping and reading the commo area, it contains (in hex text form) the starting address of the commo shared memory area, as this VM's physical address. /sys/class/linpicker/client/numbuffers: a sysfs pseudo-file; Read to determine, write to change. /sys/class/linpicker/client/NUM/: where NUM is 0, 1, etc. start: start position of that display buffer's memory area. length: length of that display buffer's memory area (in bytes). TODO: Rewrite this to be consistent with above. Here are the /sysfs kobjects of linpicker-server-module, which live under /sysfs/linpicker-server/... and are visible to linpicker-server-app (in all cases, r/w only by root): newvm - write number to create it. These CAN exist, but don't exist on startup: In /sysfs/linpicker-server/...: #/: 1, 2, etc. - for each VM. TODO: Should they be UUIDs? commo: Memory area shared with Linpicker client for communication. Same format at "commo" description for client. event: Wait on read to wait for event from client; write to send event to client numbuffers: Number of buffers. #/: Where # is 0, 1, 2, etc.; represents buffer 0, buffer 1, etc. #/region: display buffer's memory are; see above for format There are a few major types of operations: * Initialize client VM. When started up, client VMs allocate a shared memory area "commo" to be used for future messages between it and the Linpicker server VM (in either direction), and contact the Linpicker server VM to establish this communication link. The userspace programs on the client and server mmap this area. * Create new display buffer. Every time a client VM wants to create a "buffer" (which normally represents a display), it eventually creates and tells the Linpicker VM about that display - which is again a shared memory area between the client and server VMs. Again, the userspace programs on client and server mmap this area. * Send/receive message. The client and server user programs communicate with each other by (1) modifying shared memory areas, and then (2) sending events (using their kernel modules) to alert the "other" VM of changes. In particular, movements/resizes of windows and "damage" to screen display areas convert to messages from the client to the server. Each of these operations is described in detail, below, along with "initialize Linpicker server" which gets things ready. Initialize server VM: * (Possibly) Linpicker-server-module inserted into server kernel * Linpicker-server-app started + Checks if linpicker-server-module inserted - if not, tries to start; app fails if kernel module can't be started + Query Xenstore - do we have _existing_ commo and buffer areas set up (presumably by a previously-crashed linpicker-server-app)? In the long term these could be reset - in the short term, complain. + Set up watchpoints on Xenstore, so we'll know when we have a new Linpicker client VM Initialize client VM: * Linpicker-client-module inserted into client kernel (note: this can happen at client boot time, or later) + Allocates "commo" memory area for message-passing between client and server. Rationale: In theory, the client could allocate memory via userspace, but this turns out to be complicated. You need to do grant table manipulation, and this requires knowledge of the _kernel_ view of memory addresses (which the kernel has and userspace doesn't). In addition, these have to allocate page-at-a-time; trivial to do in the kernel, but in userspace this requires the use of valloc() (which is obsolete) or posix_memalign() (which requires extra effort to get the right page size). And it's better for the _client_ kernel to do this, not the server, so that if there's movement later, the memory addresses are stable on the client (Xen apps in general allocate on the client). + Sets up grant table exporting it to VM0 + Ensures that information about it is exposed to (root-user) userspace, particularly the "commo" memory area (so it can be mmapped) and information on its starting location. This is via the /sys filesystem. * Linpicker-client-app starts + Checks if linpicker-client-module inserted - if not, tries to start; app fails if kernel module can't be started - if it _IS_ started, it will allocate the "commo" area; see above. + Opens /dev/linpicker_client_co, mmaps is, reads a character (this forces the underlying page to be mapped to a physical page) + From /sys fs, gets info on commo's starting memory address. This is in: /sys/class/linpicker/client/commo + Sets Xenstore values to identify where "commo" area is to VM0. Note: This is done from userspace, not kernelspace. It's much easier to send information from userspace (even though it takes more CPU) because there are nice shell tools to do this. Besides, we're trying to maximize what's done in userspace. If Xenstore goes away, this can be easily changed at the user level to some other method. + Linpicker-client-app waits until Xenstore shows "ready for communication" Note: if waiting doesn't happen, some data from the client may not be sent to the server, but that is the client's fault, and causes no harm to the rest of the system. * Linpicker-server-app notices Xenstore change (because of watchpoints) + Reads commo starting location from Xenstore + Sets /sys fs values; this tells linpicker-server-module to: + Examine claimed shared memory area - require that it belong to that VM in a race-free way (make sure client can't remove it or itself). This check is critical; otherwise, a VM could claim it owned a memory area that it didn't, and mess up another VM. A client VM can mess itself up, of course, but a client VM can do that at any time WITHOUT using Xen. + Set grant table of commo, so we share the commo memory + Set up (bidirectional) event kobject for messages + Expose new mmap-able kobjects that represent commo to server userspace + Mmaps the new userspace, so that it can read/write a shared memory area with the client VM's userspace. + Sets Xenstore, saying that server is ready for communication * Linpicker-client-app sees the Xenstore "ready for communication" + From here on, communication is primarily through "commo" area using "send/receive message", as discussed below. Create new display buffer: * linpicker-client-app writes to linux-client-module's /sysfs area to create new buffer * linpicker-client-module allocates new memory area in kernel, sets other /sysfs kobjects so that its location, etc., are revealed. * linpicker-client-app: + waits for creation of new display buffer area by linux-client-module + mmaps new buffer into its userspace area + Sets Xenstore info, identifying buffer memory area Rationale: This COULD be sent via message, but it seems useful to have this kind of status info put in Xenstore (e.g., for debugging) + sends message "new display buffer" to linpicker-server, using the usual "send/receive message" below * linpicker-server-app receives "new display buffer" message per usual "send/receive message" call. Processes as usual, but unlike essentially all other calls, it will use Xenstore to get some info (namely, the client's buffer address) * linpicker-server-app sets some values on linux-server-module /sysfs to establish shared memory for the new buffer * linpicker-server-module receives on /sysfs new buffer shared-memory address (from linpicker-server-app, which it got from Xenstore) + Checks to ensure that buffer memory area really IS owned by that VM, in a race-free way (make sure client can't remove it or itself) + If so, sets grant tables so that server VM/client VM share it + Sets kobjects so linpicker-server-app can get info * linpicker-server-app mmaps new shared memory of display buffer Send/Receive message (using the Xen rings in the "commo" shared memory area): * (Userspace) application: + sets the "commo" shared-memory area with the data of the new message, using the existing Xen macros. Note: Server MUST NOT BLOCK, and must be cautious since the untrusted client could make arbitrary modifications to the shared memory. + Writes to the "event" pseudofile exposed by its kernel module * sending linpicker-*-module receives the 'event' writing + Sends event to "other" VM, using Xen hypercall, alerting the other VM that a new message is available * receiving linpicker-*-module receives the event from the other VM + Translates to its "event" pseudofile (which can be waited on) * Receiving userspace application notes that info is available + (It was already waiting on "event" pseudofile) + Examines the shared memory to get needed data + Server must not trust client. Thus, it MUST copy data before checking and using it (to prevent client from changing data after it's been checked). It then must check client data with great care. Note: Sending and receiving a message involves several context switches. E.G., from a client VM, it goes from client userspace, to client kernel (when the event is signalled), to server kernel, to server userspace (again signalling the event). This does add effort to service a message, but is a necessary consequence because (1) we want to maximize the functions does in userspace and (2) userspace cannot signal events. There has been some discussion about removing limitation #2 in Xen. (Note: this has nothing to do with the Linux CONFIG_KOBJECT_UEVENT option, which allows _Linux_ _kernel_ events which are different). The alternative is to do much more in kernelspace, but for both security and reliability reasons these seems to be a worse alternative. TODO: Describe what happens on VMx shutdown/crash. TODO: Describe "remove buffer". Maybe we don't need that right now. The "commo" area includes two ring buffers and their housekeeping information (indexes for producers and consumers): * One ring buffer is for client-to-server messages; it also includes the reply messages from the server back to the client (when there are replies). * One ring buffer is for server-to-client messages, which are currently mouse movements and keyboard actions. There is no reply. There are indexes but not pointers-as-such in the commo area. The server must perform bitwise operations on indexes it retrieves, to ensure that it will never attempt to perform an out-of-bound access (due to malicious data values set by a client). Indeed, the "apparant" addresses will be different between client and server. The "commo" and display-frame memories are shared between the client and server, since this speeds communication, and these are shared at the user levels of client and server (to minimize privilege). The userspace programs share this memory using mmap() calls, which are a standard POSIX call. This enables userspace programs in two different VMs to share a view of the same region of memory. There are a number of IPC implementation approaches we considered. Our design requires that user apps work with kernel code, which then work with each other. Here are the alternatives that were considered: * sysfs-alone. The Linux kernel developers strongly encourage all kernel modules to expose their capabilities through the "sysfs" filesystem, which uses kobjects. Unfortunately, while reading and writing values can be intercepted, it does NOT support controlling mmap() on such values. In addition, Linux kernel developers strongly recommend that only text values be read or written in sysfs; while not strictly enforced, exposing binary information this way would likely be rejected. * /proc. The kernel modules _could_ expose the shared memories in /proc, because /proc _does_ allow the overriding of mmap, but the use of /proc in this way for new modules is STRONGLY discouraged by the Linux kernel maintainers. We want it to be POSSIBLE to submit this code to the Linux kernel maintainers, and this would almost certainly cause it to fail. * /dev/kmem. Doing an mmap() on /dev/kmem is possible, but many systems do NOT support /dev/kmem. See: http://lwn.net/Articles/147901 * /dev/mem. Doing an mmap() on /dev/mem is very simple to do, enables quick implementation, and has been historically well-supported. We have run test code to ensure that it works, and it does. We were squeemish about doing this, because this means that the user processes would have full access to ALL of underlying memory. Of course, root-level programs normally can mmap this file anyway, but SELinux (etc.) could prevent this normally, and that is no longer true when /dev/mem is mapped. It was believed that doing this would be easy to implement, and could be later switched to using special files if needed. Almost immediately after making this decision, on 2008-04-30 it was announced that a new patch was accepted in the Linux kernel's development line so that, by default, the Linux kernel will PREVENT memory access using /dev/mem, making this approach harder to use and unlikely to be acceptable to a broad range of users (See: http://lwn.net/Articles/279557/ ). Their rationale was hard to argue against: Today, the only primary users appear to be rootkit authors, and backwards-compatibility for them is not high on the kernel developer's priority list :-). At first we decided to go forward anyway, but it turns out that the code necessary for the "usual" approach could be rather different. So we switched to using device files. * Device files. The kernel modules _could_ create special files (block or character types) and then implement mmap() on them. This is not trivial, but it turns out that using the "lower-level" kobject routines directly (instead of using, e.g., device files) is not easy either and is discourages their direct use without larger constructs (see "Linux Device Drivers" 3rd edition, page 365). Should "char" or "block" devices be used? Since these are block memory, you'd think these are block devices, but block devices' primary purpose is to support filesystems inside them.. something that is nonsense for this driver. In addition, block devices have a number of complications to enable command queing and reordering, which are useless for this kind of driver. In contrast, the "char" driver directly supports mmap(). So, it turns out that "char" driver is the best kind of driver for this capability. The default hotplug/udev mechanisms will automatically create /dev special files once appropriate driver information is created in the kernel. Xenstore is used for communicating some information and status. We require that the areas of Xenstore specific to some untrusted VMx cannot be written to by another untrusted VMy. Trusted VMs, like VM0, _can_ write to specific VM-specific areas to signal that the server is ready, but this can be permitted by access controls (so a trusted VM doesn't actually need a broad privilege to "write anywhere in Xenstore"). Note that userspace CAN set and receive watchpoints. More information about Xenstore is here: http://wiki.xensource.com/xenwiki/XenStoreReference In the Linux kernel drivers, you would think that mmap() should be implemented using remap_pfn_range(). However, remap_pfn_range() has a subtle (non-obvious) limit - it cannot map real memory. see "Linux Device Drivers" 3rd edition, page 430. It turns out that remap_pfn_range is widely used in drivers, but only because drivers usually don't need to allocate normal "real" memory in a fixed location (as we must). Since we MUST map "real" memory (to share it between VMs), this is fatal. Instead, we must use the "nopage" approach. This is more complicated, esp. when mapping multiple pages that correspond to a single memory block. Multiple pages are necessary for supporting display framebuffers, but they really aren't required for the communication ring buffers and their index information, so it'd be easier to make them separate devices (this also reduces the number of pages that must be allocated together, making allocation failure less likely). TODO: Document the Xenstore keynames. To implement the Xen rings, it's important to note that in /usr/include: * asm/page.h defines PAGE_SHIFT and PAGE_SIZE * asm/system.h defines wmb() {write memory barrier} and related calls For more information on the specific messages that are sent between the client and server, see nitpicker.idl and fbif.h. There are many details involving kernel calls; see "Linux Device Drivers" 3rd edition and the documentation that comes with the Linux kernel (in Fedora, install kernel-doc). Here's more information about Xenbus: * http://wiki.xensource.com/xenwiki/XenIntro Esp. the "adding new devices" section * http://wiki.xensource.com/xenwiki/XenStoreReference More about the Xenstore virtual file system * http://wiki.xensource.com/xenwiki/XenArchitecture Overall pointer for other good info * "Architecture for Split Drivers Within Xen" http://lists.xensource.com/archives/html/xen-devel/2005-11/txt1HIkY5oGLD.txt old doc; key points superceded by: http://wiki.xensource.com/xenwiki/XenSplitDrivers ORIGINAL Xen Framebuffer implementation: The original Xen framebuffer is poorly documented; here are some of its highlights. * /root/xen-unstable/linux-2.6.18-xen.hg/drivers/xen/fbfront/xenfb.c implements the front end. This is a kernel module, running in an untrusted VM. Front end starts up, calls xenfb_connect_backend which puts lots of info on the Xenbus. (Need to document relation to Xenstore) * /root/xen-unstable/linux-2.6.18-xen.hg/drivers/xen/fbfront/xenkbd.c is a related front-end kernel module, which also runs in an untrusted VM. This implements keyboard AND mouse input (not just keyboard) * /root/xen-unstable/tools/ioemu/hw/xenfb.c implements a middle-end - an intermediary between the framebuffer front-end and the QEMU back-end. It runs in userspace (note that it includes stdio.h) on a trusted VM (VM0). (QEMU also runs on the trusted VM). It waits for front end in xenfb_wait_for_frontend (note: there's a xenfb_wait_for_backend, which waits for QEMU). In xenfb_map_fb(), it uses xc_map_foreign_pages(), which eventually wends down to xc_map_foreign_batch(), which uses ioctl(xc_handle, IOCTL_PRIVCMD_MMAPBATCH, &ioctlx) to do the inter-VM memory mapping. NOTE: There are two DIFFERENT files named "xenfb.c", which is confusing.