Linpicker-on-Xen Interprocess Communication (IPC)
David A. Wheeler
2008-05-09

This document describes how Linpicker communicates on top of Xen, in
great detail.  The goal is to make exactly what happens clear, and
to ensure that we have not missed anything regarding security.

Linpicker involves the interaction of two types of virtual machines (VM)s:
* Server. This is trusted, runs the Linpicker server, and interacts directly
  with the user.  For now it will run on VM0, though in theory it could
  run elsewhere.
* Client(s).  Each client is responsible for setting up its graphical
  environment, and communicating with the server to express what it would
  LIKE to have displayed.  The server does NOT trust the clients, and
  clients typically do not trust each other either.  From here on, we will
  discuss one client at a time, but with the acknowledgement that there
  often are several.

Clients talk directly with the server: Linpicker does not provide any
communication channels between clients, except for timing. The drag-and-drop
feature of Nitpicker is not part of the current implementation, and in
any case, that would be mediated by the server.

Inside each VM there is a kernel layer and user layer:
* Kernel: Some operations can only be done (or done efficiently)
  at kernel level, so they are done here.  The current implementation
  presumes that the Linux kernel is running on both ends.
* User layer: As much as practicable, work is done at the user-process layer.
  The server and the clients are split into components; they currently
  execute with root privileges because they need privileged
  operations, but we plan to eventual sandbox them to a small set of
  operations.

So there are 4 major components:
* Linpicker-server-module: Linux-kernel-level module on server VM.
* Linpicker-server-app: Userspace application process on server VM.
* Linpicker-client-module: Linux-kernel-level module on each client VM.
* Linpicker-client-app:  Userspace application process on each client VM,
  typically fed by an X-windows driver and an X application monitoring user
  windows.
There may be many user applications and window managers, but
these are external for purposes of this document.  Generally, they would
communicate with X, which communicates with the linpicker-client-app.

These 4 components communicate via shared memory areas and events:
* For each client there is a shared memory area called "commo"
  between each client VM and the server VM. There is one of these for
  each client VM.  These get mmapped into shared memory at user level
  on both server and client.  This is used for all message-passing between
  server and client.  Most kernel drivers would know a lot about the message
  formats, but in our case, this would involve a lot of code duplication
  between the kernel and the userspace application. So instead, the kernel
  driver provides a very minimal "shared memory" construct, and most
  communication is done via shared memory between userspace programs.
  The userspace programs DO need to implement the specific message formats,
  of course, but this means that ONLY the sender and receiver at the
  user level must understand the formats.  The kernel driver merely makes
  shared memory areas available for communication.
* For each display buffer, on each client, there is are shared memory
  (display) "buffer" areas between client and server VMs; these get mmapped
  into shared memory at the user level.  These store bitmaps of the
  screen(s) of that VM.
* Events are used to tell clients and servers that data is ready.

Xenstore is used to share some information and status, particularly
the locations of the shared memory areas. (This simplifies starting
up the "commo" area used for message-passing).

The implementation intentionally uses shared memory for communication,
which means that it cannot DIRECTLY be (reasonably) used for over-network use.
That is not really a limitation, though.
For network access, use a local VM (to use the shared memory),
and then use an application designed for network transport on that VM
(e.g., VNC, X-windows commands).  This allows high-speed
trusted interaction, while still allowing the use of techniques that
reduce network load using complicated compression techniques and protocols.
What's more, network communication and data compression involves a lot
of code; keeping that code out of the trusted computing base is
a good idea.

The userspace and kernelspace much communicate themselves.
The client userspace communicates with the client kernelspace via:
 /dev/linpicker_client_co: a character special device representing the
   "commo" shared memory area (used for message-passing between
   server and client).  The client userspace opens, mmaps, and then
   reads/writes this file to communicate with the server.
   This area contains both input and output rings, as well
   as producer/consumer values.  To send message, write to this memory area
   and then write to "event".  To read a message, read from this memory area.
 /dev/linpicker_client_event_co: a character special device representing an
   event.  Write a character to this device to send an event to the server
   (notifying the server that data is ready).
 /dev/linpicker_client_bufferNUM_co: a character special device representing
   a display buffer memory area.  Write to this to change the display
   contents.  Note that you must then send a refresh message before
   it will be displayed.
 /sys/class/linpicker/client/commo: a sysfs pseudo-file;
   after mmapping and reading the commo area, it contains
  (in hex text form) the starting address of the commo shared memory area,
  as this VM's physical address.
 /sys/class/linpicker/client/numbuffers: a sysfs pseudo-file;
   Read to determine, write to change.
 /sys/class/linpicker/client/NUM/: where NUM is 0, 1, etc.
     start: start position of that display buffer's memory area.
     length: length of that display buffer's memory area (in bytes).

TODO: Rewrite this to be consistent with above.
Here are the /sysfs kobjects of linpicker-server-module,
which live under /sysfs/linpicker-server/... and are visible to
linpicker-server-app (in all cases, r/w only by root):
     newvm  - write number to create it.
 These CAN exist, but don't exist on startup:
 In /sysfs/linpicker-server/...:
     #/: 1, 2, etc. - for each VM. TODO: Should they be UUIDs?
       commo: Memory area shared with Linpicker client for communication.
              Same format at "commo" description for client.
       event: Wait on read to wait for event from client;
              write to send event to client
       numbuffers: Number of buffers.
        #/:  Where # is 0, 1, 2, etc.; represents buffer 0, buffer 1, etc.
         #/region: display buffer's memory are; see above for format


There are a few major types of operations:
* Initialize client VM.
  When started up, client VMs allocate a shared memory area "commo" to
  be used for future messages between it and the Linpicker server VM
  (in either direction), and contact the Linpicker server VM to
  establish this communication link.
  The userspace programs on the client and server mmap this area.
* Create new display buffer.
  Every time a client VM wants to create a "buffer" (which normally
  represents a display), it eventually creates and tells the Linpicker VM
  about that display - which is again a shared memory area between
  the client and server VMs.  Again,
  the userspace programs on client and server mmap this area.
* Send/receive message.
  The client and server user programs communicate with each other by
  (1) modifying shared memory areas, and then (2) sending events
  (using their kernel modules) to alert the "other" VM of changes.
  In particular, movements/resizes of windows and "damage" to screen
  display areas convert to messages from the client to the server.

Each of these operations is described in detail, below, along with
"initialize Linpicker server" which gets things ready.


Initialize server VM:
* (Possibly) Linpicker-server-module inserted into server kernel
* Linpicker-server-app started
  + Checks if linpicker-server-module inserted
    - if not, tries to start; app fails if kernel module can't be started
  + Query Xenstore - do we have _existing_ commo and buffer areas
    set up (presumably by a previously-crashed linpicker-server-app)?
    In the long term these could be reset - in the short term, complain.
 + Set up watchpoints on Xenstore, so we'll know when we have a new
   Linpicker client VM

Initialize client VM:
* Linpicker-client-module inserted into client kernel
  (note: this can happen at client boot time, or later)
  + Allocates "commo" memory area for message-passing
    between client and server.
    Rationale: In theory, the client could allocate memory via userspace,
    but this turns out to be complicated.  You need to do grant table
    manipulation, and this requires knowledge of the _kernel_ view of
    memory addresses (which the kernel has and userspace doesn't).
    In addition, these have to allocate page-at-a-time; trivial to do in
    the kernel, but in userspace this requires the use of valloc() (which
    is obsolete) or posix_memalign() (which requires extra effort
    to get the right page size).
    And it's better for the _client_ kernel to do this, not the server, so
    that if there's movement later, the memory addresses are stable
    on the client (Xen apps in general allocate on the client).
  + Sets up grant table exporting it to VM0
  + Ensures that information about it is exposed to (root-user) userspace,
    particularly the "commo" memory area (so it can be mmapped) and
    information on its starting location.  This is via the /sys filesystem.
* Linpicker-client-app starts
  + Checks if linpicker-client-module inserted
    - if not, tries to start; app fails if kernel module can't be started
    - if it _IS_ started, it will allocate the "commo" area; see above.
  + Opens /dev/linpicker_client_co, mmaps is, reads a character
    (this forces the underlying page to be mapped to a physical page)
  + From /sys fs, gets info on commo's starting memory address.
    This is in: /sys/class/linpicker/client/commo
  + Sets Xenstore values to identify where "commo" area is to VM0.
    Note: This is done from userspace, not kernelspace.  It's much easier
    to send information from userspace (even though it takes more CPU)
    because there are nice shell tools to do this.  Besides, we're trying
    to maximize what's done in userspace.  If Xenstore goes away,
    this can be easily changed at the user level to some other method.
  + Linpicker-client-app waits until Xenstore shows "ready for communication"
    Note: if waiting doesn't happen, some data from the client may not be
    sent to the server, but that is the client's fault, and causes
    no harm to the rest of the system.
* Linpicker-server-app notices Xenstore change (because of watchpoints)
  + Reads commo starting location from Xenstore
  + Sets /sys fs values; this tells linpicker-server-module to:
    + Examine claimed shared memory area - require that it belong to that VM
      in a race-free way (make sure client can't remove it or itself).
      This check is critical; otherwise, a VM could claim it owned
      a memory area that it didn't, and mess up another VM.
      A client VM can mess itself up, of course, but a client VM
      can do that at any time WITHOUT using Xen.
    + Set grant table of commo, so we share the commo memory
    + Set up (bidirectional) event kobject for messages
    + Expose new mmap-able kobjects that represent commo to server userspace
  + Mmaps the new userspace, so that it can read/write a shared memory
    area with the client VM's userspace.
  + Sets Xenstore, saying that server is ready for communication
* Linpicker-client-app sees the Xenstore "ready for communication"
  + From here on, communication is primarily through "commo" area
    using "send/receive message", as discussed below.

Create new display buffer:
* linpicker-client-app writes to linux-client-module's /sysfs area
  to create new buffer
* linpicker-client-module allocates new memory area in kernel, sets other
  /sysfs kobjects so that its location, etc., are revealed.
* linpicker-client-app:
  + waits for creation of new display buffer area by linux-client-module
  + mmaps new buffer into its userspace area
  + Sets Xenstore info, identifying buffer memory area
    Rationale: This COULD be sent via message, but it seems useful to
    have this kind of status info put in Xenstore (e.g., for debugging)
  + sends message "new display buffer" to linpicker-server, using the
    usual "send/receive message" below
* linpicker-server-app receives "new display buffer" message per usual
  "send/receive message" call. Processes as usual,
  but unlike essentially all other calls, it will use Xenstore to get
  some info (namely, the client's buffer address)
* linpicker-server-app sets some values on linux-server-module /sysfs
  to establish shared memory for the new buffer
* linpicker-server-module receives on /sysfs new buffer shared-memory address
  (from linpicker-server-app, which it got from Xenstore)
  + Checks to ensure that buffer memory area really IS owned by that VM,
    in a race-free way (make sure client can't remove it or itself)
  + If so, sets grant tables so that server VM/client VM share it
  + Sets kobjects so linpicker-server-app can get info
* linpicker-server-app mmaps new shared memory of display buffer

Send/Receive message (using the Xen rings in the "commo" shared memory area):
* (Userspace) application:
  + sets the "commo" shared-memory area with the data of the new message,
    using the existing Xen macros.  Note: Server MUST NOT BLOCK, and
    must be cautious since the untrusted client could make
    arbitrary modifications to the shared memory.
  + Writes to the "event" pseudofile exposed by its kernel module
* sending linpicker-*-module receives the 'event' writing
  + Sends event to "other" VM, using Xen hypercall,
    alerting the other VM that a new message is available
* receiving linpicker-*-module receives the event from the other VM
  + Translates to its "event" pseudofile (which can be waited on)
* Receiving userspace application notes that info is available
  + (It was already waiting on "event" pseudofile)
  + Examines the shared memory to get needed data
  + Server must not trust client. Thus, it MUST copy data before checking and
    using it (to prevent client from changing data after it's been checked).
    It then must check client data with great care.

  Note: Sending and receiving a message involves several context switches.
  E.G., from a client VM, it goes from client userspace, to client kernel
  (when the event is signalled), to server kernel, to server userspace
  (again signalling the event).  This does add effort to service a message,
  but is a necessary consequence because (1) we want to maximize the functions
  does in userspace and (2) userspace cannot signal events.
  There has been some discussion about removing limitation #2 in Xen.
  (Note: this has nothing to do with the Linux
  CONFIG_KOBJECT_UEVENT option, which allows _Linux_ _kernel_ events which
  are different).  The alternative is to do much more in kernelspace,
  but for both security and reliability reasons these seems to be a worse
  alternative.

TODO: Describe what happens on VMx shutdown/crash.
TODO: Describe "remove buffer".  Maybe we don't need that right now.

The "commo" area includes two ring buffers and their housekeeping information
(indexes for producers and consumers):
* One ring buffer is for client-to-server messages; it also includes the
  reply messages from the server back to the client (when there are replies).
* One ring buffer is for server-to-client messages, which are currently
  mouse movements and keyboard actions.  There is no reply.
There are indexes but not pointers-as-such in the commo area.
The server must perform bitwise operations on indexes it retrieves, to ensure
that it will never attempt to perform an out-of-bound access
(due to malicious data values set by a client).
Indeed, the "apparant" addresses will be different between client and server.

The "commo" and display-frame memories are shared between the client and
server, since this speeds communication, and these are shared at the
user levels of client and server (to minimize privilege).
The userspace programs share this memory using mmap() calls, which are
a standard POSIX call.  This enables userspace programs in two different
VMs to share a view of the same region of memory.

There are a number of IPC implementation approaches we considered.
Our design requires that user apps work with kernel code, which then
work with each other.  Here are the alternatives that were considered:
* sysfs-alone.  The Linux kernel developers strongly encourage all
  kernel modules to expose their capabilities through the
  "sysfs" filesystem, which uses
  kobjects.  Unfortunately, while reading and writing values can be
  intercepted, it does NOT support controlling mmap() on such values.
  In addition, Linux kernel developers strongly recommend that only
  text values be read or written in sysfs; while not strictly enforced,
  exposing binary information this way would likely be rejected.
* /proc.  The kernel modules _could_ expose the shared memories in /proc,
  because /proc _does_ allow the overriding of mmap, but the use of /proc in
  this way for new modules is STRONGLY discouraged by the Linux kernel
  maintainers.  We want it to be POSSIBLE to submit this code to the
  Linux kernel maintainers, and this would almost certainly cause it to fail.
* /dev/kmem.  Doing an mmap() on /dev/kmem is possible, but many systems do NOT
  support /dev/kmem. See: http://lwn.net/Articles/147901
* /dev/mem.  Doing an mmap() on /dev/mem is very simple to do, enables quick
  implementation, and has been historically well-supported.
  We have run test code to ensure that it works, and it does.
  We were squeemish about doing this, because this means that
  the user processes would have full access to ALL of underlying memory.
  Of course, root-level programs normally can mmap this file anyway, but
  SELinux (etc.) could prevent this normally, and that is no longer true
  when /dev/mem is mapped.  It was believed that doing this
  would be easy to implement, and could be later switched to using
  special files if needed.  Almost immediately after
  making this decision, on 2008-04-30 it was announced that
  a new patch was accepted in the Linux kernel's development line so that,
  by default, the Linux kernel will PREVENT memory access using /dev/mem,
  making this approach harder to use and unlikely to be acceptable to
  a broad range of users (See: http://lwn.net/Articles/279557/ ).
  Their rationale was hard to argue against: Today, the only primary users
  appear to be rootkit authors, and backwards-compatibility for them is
  not high on the kernel developer's priority list :-).
  At first we decided to go forward anyway, but it turns out that the
  code necessary for the "usual" approach could be rather different.
  So we switched to using device files.
* Device files.  The kernel modules _could_ create special files
  (block or character types) and then implement mmap() on them.
  This is not trivial, but it turns out that using the "lower-level"
  kobject routines directly (instead of using, e.g., device files) is
  not easy either and is discourages their direct use without larger
  constructs (see "Linux Device Drivers" 3rd edition, page 365).

  Should "char" or "block" devices be used?  Since these are block memory,
  you'd think these are block devices, but block devices' primary
  purpose is to support filesystems inside them.. something that is
  nonsense for this driver.  In addition, block devices have a number
  of complications to enable command queing and reordering, which are
  useless for this kind of driver.  In contrast, the "char" driver
  directly supports mmap().  So, it turns out that "char" driver
  is the best kind of driver for this capability.

  The default hotplug/udev mechanisms will automatically
  create /dev special files once appropriate driver information is
  created in the kernel.

Xenstore is used for communicating some information and status.
We require that the areas of Xenstore specific to some untrusted VMx
cannot be written to by another untrusted VMy.  Trusted VMs, like VM0,
_can_ write to specific VM-specific areas to signal that the server
is ready, but this can be permitted by access controls (so a trusted VM
doesn't actually need a broad privilege to "write anywhere in Xenstore").
Note that userspace CAN set and receive watchpoints.
More information about Xenstore is here:
http://wiki.xensource.com/xenwiki/XenStoreReference


In the Linux kernel drivers, you would think that mmap() should be
implemented using remap_pfn_range().  However, remap_pfn_range()
has a subtle (non-obvious) limit - it cannot map real memory.
see "Linux Device Drivers" 3rd edition, page 430.
It turns out that remap_pfn_range is widely used in drivers, but
only because drivers usually don't need to allocate normal "real" memory
in a fixed location (as we must).
Since we MUST map "real" memory (to share it between VMs), this is
fatal.  Instead, we must use the "nopage" approach.  This is more
complicated, esp. when mapping multiple pages that correspond to
a single memory block.  Multiple pages are necessary for supporting
display framebuffers, but they really aren't required for the
communication ring buffers and their index information, so it'd be
easier to make them separate devices (this also reduces the number
of pages that must be allocated together, making allocation failure
less likely).

TODO: Document the Xenstore keynames.

To implement the Xen rings, it's important to note that in /usr/include:
* asm/page.h defines PAGE_SHIFT and PAGE_SIZE
* asm/system.h defines wmb() {write memory barrier} and related calls

For more information on the specific messages that are sent between
the client and server, see nitpicker.idl and fbif.h.
 
There are many details involving kernel calls; see
"Linux Device Drivers" 3rd edition and the documentation that comes with
the Linux kernel (in Fedora, install kernel-doc).

Here's more information about Xenbus:
* http://wiki.xensource.com/xenwiki/XenIntro
  Esp. the "adding new devices" section
* http://wiki.xensource.com/xenwiki/XenStoreReference
  More about the Xenstore virtual file system
* http://wiki.xensource.com/xenwiki/XenArchitecture
  Overall pointer for other good info
* "Architecture for Split Drivers Within Xen"
  http://lists.xensource.com/archives/html/xen-devel/2005-11/txt1HIkY5oGLD.txt
  old doc; key points superceded by:
  http://wiki.xensource.com/xenwiki/XenSplitDrivers


ORIGINAL Xen Framebuffer implementation:
The original Xen framebuffer is poorly documented; here are some
of its highlights.
* /root/xen-unstable/linux-2.6.18-xen.hg/drivers/xen/fbfront/xenfb.c
  implements the front end.  This is a kernel module, running in an
  untrusted VM.
  Front end starts up, calls xenfb_connect_backend which puts lots of
  info on the Xenbus.  (Need to document relation to Xenstore)
* /root/xen-unstable/linux-2.6.18-xen.hg/drivers/xen/fbfront/xenkbd.c
  is a related front-end kernel module, which also runs in
  an untrusted VM. This implements keyboard AND mouse input
  (not just keyboard)
* /root/xen-unstable/tools/ioemu/hw/xenfb.c
  implements a middle-end - an intermediary between the framebuffer
  front-end and the QEMU back-end.
  It runs in userspace (note that it includes stdio.h) on a trusted VM (VM0).
  (QEMU also runs on the trusted VM).
  It waits for front end in xenfb_wait_for_frontend
  (note: there's a xenfb_wait_for_backend, which waits for QEMU).
  In xenfb_map_fb(), it uses xc_map_foreign_pages(), which
  eventually wends down to xc_map_foreign_batch(), which uses
    ioctl(xc_handle, IOCTL_PRIVCMD_MMAPBATCH, &ioctlx)
  to do the inter-VM memory mapping.
  NOTE: There are two DIFFERENT files named "xenfb.c", which is confusing.