summaryrefslogtreecommitdiff
path: root/doc
diff options
context:
space:
mode:
authorDenis Steckelmacher <steckdenis@yahoo.fr>2011-08-18 17:19:47 +0200
committerDenis Steckelmacher <steckdenis@yahoo.fr>2011-08-18 17:19:47 +0200
commitad1a15f9e70ce12539a807a0da96cafdb2f2e8de (patch)
treea885a3c6b52dec6d01f4daa9be5d474c9a53a87e /doc
parent6ddd30c4540984a703d161c398b7ff6fa6f39165 (diff)
Begin the documentation
This commit adds a Doxyfile and some pages of documentation describing the main parts of Clover.
Diffstat (limited to 'doc')
-rw-r--r--doc/barrier.dox105
-rw-r--r--doc/events.dox61
-rw-r--r--doc/llvm.dox202
-rw-r--r--doc/logo.pngbin0 -> 9142 bytes
-rw-r--r--doc/mainpage.dox41
5 files changed, 409 insertions, 0 deletions
diff --git a/doc/barrier.dox b/doc/barrier.dox
new file mode 100644
index 0000000..59ab82d
--- /dev/null
+++ b/doc/barrier.dox
@@ -0,0 +1,105 @@
+/**
+ * \page barrier Implementing barriers
+ *
+ * barrier() is an OpenCL C built-in function allowing synchronization between the work-items of a work-group. When a work-item encounters a barrier(), it must wait for all the other work-items of its work-group to also encounter the same barrier() call.
+ *
+ * An example of use-case could be encrypting a buffer. A work-item would be the operation run on a single byte, and work-groups blocks of data that must be processed in chunks. Each work-item will first read its byte, then encrypt it using maybe the next and previous bytes in the work-group. After that, the data must be written back into the buffer, but if the work-item does so, another one could try to read what it have just written (because the work-item reads the next and previous bytes), and that will not be correct.
+ *
+ * So, the work-item first calls barrier(), to let the others read their sane data before overwriting it with encrypted data.
+ *
+ * \section workgroups How Clover handles work-groups and work-items
+ *
+ * Before beginning with the explanation of barrier(), a first note about how Clover handles the work-groups and work-items.
+ *
+ * A work-group is a set of work-items. In the spec, work-groups can be run in parallel, and their work-items can also be run in parallel. This allows massively parallel GPUs to launch kernels efficiently (they are slower than a CPU but made of thousands of cores). A CPU isn't very parallel, so it makes no sense to have one thread per work-item, it would require up to hundreds of thousands of threads for kernels running on a huge amount of data (for example converting an image from RGB to sRGB, it's the same computation for every pixel, so each pixel can be run in parallel).
+ *
+ * Clover uses another technique: each work-group is run in parallel, but the work-items are run sequentially, one after the other. This allows Clover to be pretty fast for most of the cases, as all the CPU cores are used and no time is lost in thread switch and synchronization primitives. An interesting function here is Coal::CPUKernel::guessWorkGroupSize(). It tries to divide the number of work-items by a number the closest possible to the amount of CPU cores, falling back to one work-group if the number of work-items is prime (or not easily divisible, to avoid having 517583527 work-groups of only one work-item).
+ *
+ * In short, the work-items are run sequentially in Clover.
+ *
+ * \section problem The problem
+ *
+ * The problem is that barrier() must oblige the current work-item to wait for the others. But as they are run sequentially, a work-item cannot wait for the others as it must have finished for them to be run.
+ *
+ * The solution is to pause the current work-item and to jump into the next. When the next also encounters barrier(), it also jumps to the next, and so forth. When the last work-item encounters a barrier(), it jumps back to the first one, that can continue its execution past its barrier() call.
+ *
+ * \dot
+ * digraph G {
+ * rankdir=LR;
+ * w1 [shape=record,label="<f0>work-item 1|<f1>barrier()|<f2>work-item 1"];
+ * w2 [shape=record,label="<f0>work-item 2|<f1>barrier()|<f2>work-item 2"];
+ * w3 [shape=record,label="<f0>work-item 3|<f1>barrier()|<f2>work-item 3"];
+ * w4 [shape=record,label="<f0>work-item 4|<f1>barrier()|<f2>work-item 4"];
+ *
+ * w1:f1 -> w2:f0 [color=red];
+ * w2:f1 -> w3:f0 [color=red];
+ * w3:f1 -> w4:f0 [color=red];
+ * w4:f1:e -> w1:f2:w [color=red];
+ *
+ * w1:f2 -> w2:f2 [color=blue];
+ * w2:f2 -> w3:f2 [color=blue];
+ * w3:f2 -> w4:f2 [color=blue];
+ * }
+ * \enddot
+ *
+ * This graphs shows a special case: when an item finishes, the next resumes where it left, not at its beginning. This explain why there is a test for barriers in Coal::CPUKernelWorkGroup::run() :
+ *
+ * \code
+ * do
+ * {
+ * p_kernel_func_addr(p_args);
+ * } while (!p_had_barrier && // <== here
+ * !incVec(p_work_dim, p_dummy_context.local_id, p_max_local_id));
+ * \endcode
+ *
+ * \section contexts Technical solution
+ *
+ * Now that the problem is solved on paper, a working solution has to be found. What Clover wants to achieve is stopping a function in the middle of it, and then resuming it.
+ *
+ * Some people may know the setjmp() and longjmp() functions. They do nearly what is needed but are not considered secure to use for resuming a function (that means that we can longjmp() from a function to another, but we cannot then resume the function that called longjmp()).
+ *
+ * Another solution is POSIX contexts, managed by the functions setcontext(), getcontext() and swapcontext(). These are what Clover uses. When a barrier() call is encountered, the current work-item is saved in its context, and then the next is executed (using swapcontext()). This is done in Coal::CPUKernelWorkGroup::barrier().
+ *
+ * \section stack The problem of stacks
+ *
+ * In fact, it's a bit more complex than that. The first reason is that Clover doesn't want to slow down kernels not using barrier() (the majority), so the barrier() overhead in the case this built-in is never called must be as low as possible. Creating a context for each work-item takes time, it's not good.
+ *
+ * Another thing to keep in mind is that a function (and kernels are functions) stores parameters, local variables and temporaries on the stack. If a work-item is halted, its stack mustn't be clobbered by another work-item. So, each work-item must have a separate stack, and these stacks must be created.
+ *
+ * Clovers uses for that mmap(), a function that can be used to allocate large chunks of data, way faster than malloc() (malloc() uses mmap() internally, with also a memory pool). Stacks are in fact "large", currently 8 KB (but future version of OpenCL will run an analysis pass to count the maximum amount of data alloca'ed() by the kernel and its sub-functions), and each work-item must have its one.
+ *
+ * For kernels designed for barrier(), that is to say with a known number of work-items per work-group (usually low), there is no problem. Even 512 work-items take only 4 MB of stack, a single Huge Page on x86, and nothing with regard to the amount of RAM currently found on modern computers.
+ *
+ * But the problem is for kernels not designed for barrier(). These ones use higher work-groups, or even let Clover decide how to split the work-items into work-groups (using Coal::CPUKernel::guessWorkGroupSize()). For a 1024x1024 image, with one work-item per pixel, and a 4-core CPU, Clover will create work-groups of 262144 work-items ! If each of them must have its own 8KB stack, that means a memory usage of 2 GB !
+ *
+ * So, Clover cannot use independent work-items when barrier() is never called. This is achieved by using a tiny dummy context at the beginning of Coal::CPUKernelWorkGroup::run(), holding only the data needed by the built-in functions like get_global_id(), that is to say the work-item index. Then, a while() loop is used to execute the work-items sequentially, incrementing the work-item index of the dummy context at each loop iteration. This makes the whole thing working when no barrier() calls are issued, and without any slowdown (maybe about ten CPU cycles, but JITing the caller functions takes way more time than that).
+ *
+ * \section implementation Implementation
+ *
+ * How is barrier() implemented then ? Simple, when the first call to barrier() is made, Coal::CPUKernelWorkGroup::barrier() begins by allocating memory for the stacks (and puts its pointer in a TLS memory location). This function is also able to reuse the memory allocated by a previous work-group run on the same thread if it is large enough to hold the stacks. This way, it's a bit faster.
+ *
+ * A note about this memory : it doesn't contain only the stacks, but also a Coal::CPUKernelWorkGroup::Context structure. This memory is accessed using Coal::CPUKernelWorkGroup::getContextAddr() that is given an index and returns a pointer to a Coal::CPUKernelWorkGroup::Context. The stack of this context is located just after the Context structure.
+ *
+ * Once the memory is allocated, Coal::CPUKernelWorkGroup::barrier() can proceed as if all was normal. It first tries to take the next context, and checks if it is initialized (that is to say its work-item has already begun and is currently halted somewhere in its execution). It's easy to do as mmap() zero-initializes the memory. If the work-item isn't yet initialized, it is created and initialized to point to the kernel function, with the correct \c args.
+ *
+ * After the context creation, all that is needed is to swap the contexts, that is to say to save the current context in the memory location, and to jump to the next. If this is the first time barrier() is called, the current context is simply the "main context" of the thread, and it gets saved like any other context: barrier() successfully achieved to work even when a dummy context is used.
+ *
+ * The context being swapped, the execution can begin or continue in the next work-item.
+ *
+ * \section finishing End of execution
+ *
+ * One thing remains to be done, as pointed out at the end of \ref problem : when a barrier() has been encountered and a work-item finishes, we cannot just launch the next, as it has already begun before. So, we need to separately handle the case of a barrier() having been called. Coal::CPUKernelWorkGroup::run(), if a barrier() was called and when the first work-item finishes, doesn't launch the next one but goes directly in another loop.
+ *
+ * This loop simply calls swapcontext() for each remaining work-item. The other work-items will each terminate. An interesting property of contexts is used here: when a context finishes its execution, the execution returns to the context that have created it, the context given here in Coal::CPUKernelWorkGroup::barrier() :
+ *
+ * \code
+ * Context *main = getContextAddr(0); // The context not created with makecontext
+ *
+ * if (next->initialized == 0)
+ * {
+ * // [...]
+ * next->context.uc_link = &main->context; // <== here
+ * \endcode
+ *
+ * That means for clover that when a work-item finishes, the execution flow will return to Coal::CPUKernelWorkGroup::run() where it has left, that is to say at the swapcontext() call. This allows Clover to simply terminate all the work-items.
+ */ \ No newline at end of file
diff --git a/doc/events.dox b/doc/events.dox
new file mode 100644
index 0000000..0e6c879
--- /dev/null
+++ b/doc/events.dox
@@ -0,0 +1,61 @@
+/**
+ * \page events Command Queues, Events and Worker Threads
+ *
+ * Command queues are the core of OpenCL. It's by using them that client applications can ask the OpenCL implementation to actually do things. For example, a function call can be used to create a memory object, but a command queue has to be used in order to write into this object.
+ *
+ * \section overview Overview
+ *
+ * A command queue is an object storing a list of events. In Clover, Coal::CommandQueue stores a list of Coal::Event objects.
+ *
+ * An event is an action that can be performed. The Coal::Event documentation page contains a big inheritance graph, so you can see that there are many events inheriting each other. An event is for instance a write to a memory object, a copy between an image and a buffer, or the execution of an OpenCL kernel.
+ *
+ * There are also events that are called "dummy" in Clover: they do nothing and are simple information for the command queue. They are Coal::BarrierEvent, Coal::UserEvent, Coal::WaitForEventsEvent and Coal::MarkerEvent.
+ *
+ * \section queuing Queuing events
+ *
+ * Queuing an event is the action of adding it to a command queue. A queued event will be executed by the command queue when certain conditions are met. A client application queues events by calling \ref src/api/api_enqueue.cpp "clEnqueueXXX()", for example clEnqueueCopyImageToBuffer(). These function create an object inheriting Coal::Event and call Coal::CommandQueue::queueEvent().
+ *
+ * In Clover, a Coal::Event object doesn't do anything besides checking its arguments. The work is all done in a worker thread, discussed later on this page.
+ *
+ * \section ordering Ordering the events
+ *
+ * Events are meant to be executed on a device. Clovers uses two event queues in order to do that, as there may be several Coal::CommandQueue objects for one single Coal::DeviceInterface. That means that Coal::CommandQueue has to keep a list of its events, and that Coal::DeviceInterface must do the same, with a separate list.
+ *
+ * In order to have events executed, the command queue "pushes" events on the device using Coal::DeviceInterface::pushEvent().
+ *
+ * \dot
+ * digraph G {
+ * client [label="Client application"];
+ * queue1 [label="CommandQueue 1", URL="\ref Coal::CommandQueue"];
+ * queue2 [label="CommandQueue 2", URL="\ref Coal::CommandQueue"];
+ * device [label="Device", URL="\ref Coal::DeviceInterface"];
+ * worker [label="Worker thread", URL="\ref worker()"];
+ *
+ * client -> queue1 [label="clEnqueueReadImage()", URL="\ref clEnqueueReadImage()"];
+ * client -> queue2 [label="clEnqueueNDRangeKernel()", URL="\ref clEnqueueNDRangeKernel()"];
+ * queue1 -> device [label="pushEvent()", URL="\ref Coal::DeviceInterface::pushEvent()"];
+ * queue2 -> device [label="pushEvent()", URL="\ref Coal::DeviceInterface::pushEvent()"];
+ * device -> worker [label="getEvent()", URL="\ref Coal::CPUDevice::getEvent()"];
+ * }
+ * \enddot
+ *
+ * The semantics are also different: on the device, the events are unordered. That means that there is no guarantee that an event stored after another in the device's event list will actually be executed after the previous one. This allows worker threads to pick up events without having to check their order: if an event is available, take it and run it.
+ *
+ * On a Coal::CommandQueue object, events are ordered. If the queue has the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE property disabled, it's simple: the queue waits for an event to complete before pushing the next to the device. When this property is enabled, more complex heuristics are used. They are explained in the Coal::CommandQueue::pushEventsOnDevice() function documentation.
+ *
+ * Roughly, every time an event completes, the Coal::CommandQueue explores its list of events and pushes all of them that meet certain conditions. For example, all the evens they are waiting on must be completed (see Coal::Event::waitEvents()), they have not to be already pushed, and they must be before any Coal::BarrierEvent() event.
+ *
+ * \section worker Worker threads
+ *
+ * This section is specific to Coal::CPUDevice. The only thing a device has to do is to re-implement Coal::DeviceInterface::pushEvent(). An hardware-accelerated device can then simply push them to a ring buffer or something like that.
+ *
+ * Coal::CPUDevice has to do the ordering and dispatching between CPU cores itself. When a Coal::CPUDevice is first instantiated, it creates in Coal::CPUDevice::init() one "worker thread" per CPU core detected on the host system.
+ *
+ * These worker threads are a simple loop polling for events to execute. The loop is in the worker() function. At each loop, Coal::CPUDevice::getEvent() is called. This function blocks until an event is available in the CPU event list, and then returns the first.
+ *
+ * For the vast majority of the events, once they are returned, they are removed from the event list. Doing that ensures that an event gets executed only one time, by one worker thread. Coal::KernelEvent objects are different, as a kernel is executed in chunks of work-items called "work groups". Each work-group can be executed in parallel with the others. For these events, the event is not removed from the event list until all the work-groups are executed.
+ *
+ * In fact, Coal::CPUKernelEvent::reserve() is called. This function locks a mutex in the Coal::CPUKernelEvent object, and returns whether the worker thread is about to run the last work-group of the kernel. If this is the case, the event is removed from the event list. If not, it is kept, and other worker threads will be able to run other work-groups. When the worker thread has its work-group reserved, it calls Coal::CPUKernelEvent::takeInstance(). This function unlocks the mutex, allowing the other worker threads to get other work-groups, and returns a Coal::CPUKernelWorkGroup object. These objects are described at the end of \ref llvm.
+ *
+ * As said above in this document, the Coal::Event objects don't do anything. They are simple device-independent pieces of information, with an optional "device-data" field (Coal::Event::deviceData()). The actual work is done in worker(), in a big switch structure.
+ */ \ No newline at end of file
diff --git a/doc/llvm.dox b/doc/llvm.dox
new file mode 100644
index 0000000..2525e6c
--- /dev/null
+++ b/doc/llvm.dox
@@ -0,0 +1,202 @@
+/**
+ * \page llvm Using Clang and LLVM to Launch Kernels
+ *
+ * Unlike OpenGL and its GLSL language, OpenCL uses a subset of C easily implementable with little compiler support.
+ *
+ * It is known that at least Apple and nVidia use Clang and LLVM to compile OpenCL programs. Furthermore, Intel recently released a SDK advertised as using LLVM (with the possibility to also use an in-house JIT).
+ *
+ * This widespread usage of Clang is very good for Clover, as Clang already supports the vast majority of what is needed in order to have a good OpenCL compiler. LLVM is very efficient at using vectors (it isn't used by Apple and Mesa GL for their software acceleration path without reason), and its API is very well done. It takes an average of one or two days to implement of modify something using LLVM in Clover.
+ *
+ * This page explains how Clang and LLVM are used in Clover to compile, manage and launch OpenCL kernels.
+ *
+ * \section compilation Compiling OpenCL C to LLVM IR
+ *
+ * The first step when one wants to launch a program is to compile it. It is done API-wise by the clCreateProgramWithSource() and clBuildProgram() functions.
+ *
+ * The first function creates a Coal::Program object, using Coal::Program::loadSources(). You can see at the top of this function a line prepending the OpenCL C standard header to the source being built. This function consists mainly of a concatenation of the strings given (they may be zero-terminated or not).
+ *
+ * Once the Coal::Program objects holds the source, clBuildProgram() can be used to compile it. It does so by invoking Coal::Program::build().
+ *
+ * This big function compiles and links the program, so it will be explained later. The actual compilation job is done by Coal::Compiler. It does that in Coal::Compiler::compile, and then keeps the compilation log and options at hand for future use.
+ *
+ * When a program is compiled, the client application can retrieve it by using clGetProgramInfo().
+ *
+ * \section linking Linking the program
+ *
+ * The compilation step produced an "unlinked" module, that needs to be linked with the OpenCL C standard library, but only if the device for which the program is being built needs to. It's also possible to load a previously-compiled binary in a Coal::Program using Coal::Program::loadBinaries(). Doing this also loads an unlinked binary.
+ *
+ * The separation between the unlinked binary and the linked one is the reason for the existence of Coal::Program::DeviceDependent::unlinked_binary. The source is compiled to LLVM IR in a module (temporarily stored in linked_module, though it isn't linked yet), that is dumped to unlinked_binary and then linked to form a full executable program.
+ *
+ * So, Coal::Program::build() runs its code for every device for which a program must be built. These devices are either given at Coal::Program::loadBinaries, or as arguments to Coal::Program::build().
+ *
+ * The first step is to see if the program was loaded with sources. If it's the case, they have to be compiled (see \ref compilation).
+ *
+ * Then, if the device for which the program is being built asks for that (Coal::DeviceProgram::linkStdLib(), Coal::CPUDevice does so), the program is linked with the OpenCL C standard library of Clover. An hardware-accelerated device normally will not want to have stdlib linked, as it's easier to convert LLVM IR to hardware-specific instructions when OpenCL built-ins functions are left in the form "call foo" instead of being inlined with inefficient CPU-centric code.
+ *
+ * After this linking pass, optimization passes are created. The first ones are created by Coal::Program itself. They remove all the functions that are not kernels and are not called by a kernel. It allows LLVM to remove all the unused stdlib functions.
+ *
+ * Then, the device is allowed to add more optimization or analysis passes. Coal::CPUProgram::createOptimizationPasses() adds standard link-time optimizations, but hardware-accelerated devices could add autovectorizing, lowering, or analysis passes.
+ *
+ * Finally, Coal::DeviceProgram::build is called. It's a no-op function for Coal::CPUDevice as it uses directly the module with a LLVM JIT, but hardware devices could use this function to actually compile the program for the target device (LLVM to TGSI transformation for example).
+ *
+ * The program is now built and ready to be usable !
+ *
+ * \section kernels Finding kernels
+ *
+ * Now that the program is built, it's time to get its kernels. The functions declared as kernel in OpenCL C (with the \c __kernel attribute) are registered by Clang using the <em>!opencl.kernels</em> metadata. They are read from the LLVM module by Coal::Program::kernelFunctions(). Note that this function is device dependent, as it have to use the LLVM IR generated for the specified device.
+ *
+ * When the kernels are found, Coal::Kernel objects can be instantiated. These objects are again device-independent as requested by the OpenCL spec. A Coal::Kernel object is mainly a name and a list of device-specific information. There is for instance the llvm::Function object that will be called.
+ *
+ * Once the Coal::Kernel object is created, Coal::Kernel::addFunction() is called for every device for which the Coal::Program is built. This function has the responsibility to explore the arguments of the function and to create a list of device-independent Coal::Kernel::Arg objects (kernel arguments). For instance, when it sees an argument of type <4 x i32>, it converts it to a Coal::Kernel::Arg of kind Coal::Kernel::Arg::Int32 and vector dimension 4.
+ *
+ * \section arguments Setting kernel arguments
+ *
+ * The Coal::Kernel::Arg objects are interesting. They are an abstraction layer between the host CPU and the device. They also enable Coal::Kernel to implement its Coal::Kernel::setArg() function, that performs checks on the value given as argument.
+ *
+ * This class also contains semantic information specific to OpenCL. For instance, in LLVM, the address space qualifiers like __global or __local are represented as address spaces (0 = private, etc). Coal::Kernel::addFunction() translates these address spaces into Coal::Kernel::Arg::File values.
+ *
+ * When the users call clSetKernelArg(), the execution flow arrives at Coal::Kernel::setArg(). This function puts the correct value in the Coal::Kernel::Arg object, and does some checks. It is also his responsibility to recognize Coal::Sampler objects.
+ *
+ * Samplers are a bit special as they are pointers to Coal::Sampler objects on the host CPU, and plain uint32_t values on the kernel side. This makes their translation from LLVM type to Coal::Kernel::Kind a bit difficult, as Clover only sees an LLVM \c i32 type for a sampler and also for a normal \c uint32.
+ *
+ * The trick used in Clover is to store in memory a list of the known samplers. When a Coal::Sampler object is created, it is registered in this list. When it is deleted, its index is removed from the list. This in implemented in Coal::Object and shared between all the Coal classes. It allows the implementation of functions like Coal::Object::isa(), very useful to check that arguments given by the user are sane.
+ *
+ * So, Coal::Object::isa() is used to recognize when an argument passed to Coal::Kernel::setArg() is in fact a sampler. When it is the case, the pointer to Coal::Sampler is replaced by the sampler's "bitfield" representation, using Coal::Sampler::bitfield().
+ *
+ * \code
+ * // Special case for samplers (pointers in C++, uint32 in OpenCL).
+ * if (size == sizeof(cl_sampler) && arg_size == 4 &&
+ * (*(Object **)value)->isA(T_Sampler))
+ * {
+ * unsigned int bitfield = (*(Sampler **)value)->bitfield();
+ *
+ * arg.refineKind(Arg::Sampler);
+ * arg.alloc();
+ * arg.loadData(&bitfield);
+ *
+ * return CL_SUCCESS;
+ * }
+ * \endcode
+ *
+ * This trick is described in more detail at the end of this blog post : http://steckdenis.wordpress.com/2011/08/07/when-easy-is-difficult-and-vice-versa/ .
+ *
+ * \section event Queuing an event
+ *
+ * Once the Coal::Kernel object is created and its args set, the client application can call clEnqueueTask() or clEnqueueNDRangeKernel(). These functions create a Coal::KernelEvent object responsible for telling the device to execute the kernel.
+ *
+ * When the event arrives on the device (see \ref events), Coal::CPUDevice initializes the LLVM JIT (Coal::CPUProgram::initJIT()) and then does that in src/core/cpu/worker.cpp :
+ *
+ * \code
+ * KernelEvent *e = (KernelEvent *)event;
+ * CPUKernelEvent *ke = (CPUKernelEvent *)e->deviceData();
+ *
+ * // Take an instance
+ * CPUKernelWorkGroup *instance = ke->takeInstance();
+ * ke = 0; // Unlocked, don't use anymore
+ *
+ * if (!instance->run())
+ * errcode = CL_INVALID_PROGRAM_EXECUTABLE;
+ *
+ * delete instance;
+ * \endcode
+ *
+ * The first step is to use Coal::Event::deviceData() to get a Coal::CPUKernelEvent object. See Coal::Event::setDeviceData() and Coal::DeviceInterface::initEventDeviceData().
+ *
+ * This Coal::CPUKernelEvent holds information about the event needed by Coal::CPUDevice.
+ *
+ * \section workgroups Running the work groups
+ *
+ * The next line is interesting : Coal::CPUKernelEvent::takeInstance() is called. This function works in pair with Coal::CPUKernelEvent::reserve() called from Coal::CPUDevice::getEvent().
+ *
+ * A kernel is run in multiple "work groups", that is to say batches of work items. The worker threads (see \ref events) take work-groups one at a time, so there can be multiple work groups of a single kernel running concurrently on a multicore CPU.
+ *
+ * \ref events gives more details about that, but the main principle is that there is a list of events a worker thread can execute. For Coal::KernelEvent, a worker thread calls Coal::CPUKernelEvent::reserve() to see if there is a work-group available for execution (that is to say if the work groups aren't yet all executed). If there is one available, a mutex is locked and the function returns. Then, the worker thread calls Coal::CPUKernelEvent::takeInstance() to actually get the work-group, and runs it through Coal::CPUKernelWorkGroup::run().
+ *
+ * \section args Passing arguments to the kernel
+ *
+ * Once the work-group is taken, it is run and must call the kernel function (using the JIT) for every work-item. This is done very simply by getting a function pointer to the kernel using llvm::ExecutionEngine::getPointerToFunction(). This function must now be called with the needed arguments.
+ *
+ * The difficult thing is that C++ doesn't allow to give arbitrary arguments to a function. A function can receive arbitrary arguments, using <tt>void foo(int argc, ...)</tt>, but an arbitrary list of arguments cannot be passed like in <tt>foo(va_build(std_vector));</tt>. They must be known at compilation-time.
+ *
+ * The solution used by LLVM is to use a function like <tt>llvm::JIT::runFunction(function, vector of args)</tt>. This function internally creates a "stub" function taking zero arguments but calling the target function itself with its arguments passed as constants. That is to say, when we want to call <tt>bar(3, 4);</tt>, a stub is compiled like this :
+ *
+ * \code
+ * void stub() {
+ * bar(3, 4);
+ * }
+ * \endcode
+ *
+ * This stub is then JITed and run directly :
+ *
+ * \code
+ * void (*stub)() = getPointerToFunction(stub_function);
+ * stub();
+ * \endcode
+ *
+ * LLVM then destroys the stub. This is a waste of time as a stub is slow to generate and JIT compile.
+ *
+ * Clover also uses stubs, and not libffi (for Foreign Function Interface). Libffi is a library allowing to call a function with arguments only known at run-time, but it is too slow (the arguments are re-built for every call) and doesn't fully support vectors (it supports XMM registers, but slowly).
+ *
+ * The solution retained was to mimic the way LLVM does its stub. As a kernel can be run multiple times with different arguments (when the application explicitly does so, or when there are \c __local pointers needing to be reallocated between each work-group), the stub function cannot simply use constants, because rebuilding it for each set of arguments would be too slow.
+ *
+ * So, the Clover's stub takes exactly one parameter: a \c void* pointer. This pointer contains the actual parameters, carefully aligned by Coal::CPUKernel::typeOffset(). The stub itself is built by Coal::CPUKernel::callFunction() and is like that :
+ *
+ * \code
+ * void stub(void *args) {
+ * // We know the args the kernel takes and their types.
+ * kernel(
+ * *(int *)args, // For an int argument
+ * *(float **)((char *)args + 8), // For a float* argument, after the int in args, and aligned to sizeof(void*)
+ * *(sampler_t *)((char *)args + 16));
+ * }
+ * \endcode
+ *
+ * Each argument is simply built in LLVM IR like this :
+ *
+ * \code
+ * param = load(bitcast(getelementptr(args, offset_in_args) to param_type*))
+ * \endcode
+ *
+ * \section tls Built-ins and Thread Local Storage
+ *
+ * The OpenCL C language provides built-ins that can be called from the kernels. For the most of them, there is no problem: they can either be implemented as LLVM instructions and then compiled for the CPU, or the standard library (src/core/runtime/stdlib.c) provides an implementation.
+ *
+ * But there are cases where information outside the kernel is needed. For example, the get_work_dim() builtin takes no argument, but has to return a value dependent of the current Coal::KernelEvent being run.
+ *
+ * In order to handle that, a call is made from the kernel to the Clover library. It's made possible by a very handy LLVM function: llvm::ExecutionEngine::InstallLazyFunctionCreator() called by Coal::CPUProgram::initJIT(). This function allows Clover to register a function that will resolve function names to function addresses. This way, a function called "get_work_dim" in the kernel will be passed to this function creator, that will return a pointer to get_work_dim() in src/core/cpu/builtins.cpp.
+ *
+ * \code
+ * void *getBuiltin(const std::string &name)
+ * {
+ * if (name == "get_global_id")
+ * return (void *)&get_global_id;
+ * else if (name == "get_work_dim")
+ * return (void *)&get_work_dim;
+ *
+ * return (void *)&unimplemented_stub;
+ * }
+ * \endcode
+ *
+ * It's good, but one problem remains: get_work_dim() doesn't take any argument, but has to return an information about the currently-running kernel. Here, the internal structure of Coal::CPUDevice has to be taken into account. The device creates one worker thread per CPU core, and each of these worker threads can run only one work-group at a time, but multiple worker threads can run different kernels and work groups concurrently.
+ *
+ * So, the solution retained is a Thread-Local variable. Such a variable is like a global variable (shared among all the classes and functions of a project), but its value is private to the currently-running thread. As a thread always handles only one work-group, a TLS variable is what is needed, and what Clover uses. It's named \ref g_work_group.
+ *
+ * One of these built-ins is particularly interesting, see \ref barrier.
+ *
+ * \section call The call
+ *
+ * Finally, the work-items can be called in sequence. The stub and its kernel function are JITed only once, it's fast :
+ *
+ * \code
+ * do
+ * {
+ * // Simply call the "call function", it and the builtins will do the rest
+ * p_kernel_func_addr(p_args);
+ * } while (!p_had_barrier &&
+ * !incVec(p_work_dim, p_dummy_context.local_id, p_max_local_id));
+ * \endcode
+ *
+ * This code can be found in Coal::CPUKernelWorkGroup::run(). The incVec() call is there to handle the 3D global and local IDs. It returns true when the vector we are incrementing reaches \c p_max_local_id.
+ *
+ * More explanation of this part can be found on the \ref barrier page.
+ */
diff --git a/doc/logo.png b/doc/logo.png
new file mode 100644
index 0000000..6ee46fd
--- /dev/null
+++ b/doc/logo.png
Binary files differ
diff --git a/doc/mainpage.dox b/doc/mainpage.dox
new file mode 100644
index 0000000..cafa603
--- /dev/null
+++ b/doc/mainpage.dox
@@ -0,0 +1,41 @@
+/**
+ * \mainpage OpenCL Software Implementation
+ *
+ * Clover is a project started in June 2011 as a Google Summer of Code project. Its goal is to provide an Open Source OpenCL implementation usable by everyone wanting to use or develop for OpenCL without having to use proprietary drivers or SDKs (like the Intel's one).
+ *
+ * Clover currently only supports running OpenCL programs in software, on the host CPU, but an interface is there to allow future hardware-based execution engines.
+ *
+ * This documentation is meant to be a \b developer \b focused one. The public OpenCL API is thoroughly documented by the Khronos Group at http://www.khronos.org/registry/cl/ . What is documented here is the internal architecture of Clover.
+ *
+ * \section discover Discovering Clover
+ *
+ * Clover is a somewhat big project and this section will explain how to get started with the code.
+ *
+ * \subsection components Components
+ *
+ * Clover is split in several components:
+ *
+ * - \b core The core of OpenCL, implemented in C++ and containing all the logic used by an OpenCL implementation.
+ * - \b api The public OpenCL API. It's a binding exposing the Clover's C++ classes to C applications.
+ * - \b runtime The standard library available to OpenCL programs. An interesting and advanced page speaks about \subpage barrier.
+ *
+ * The following components are also present but currently not documented:
+ *
+ * - \b tests Check-based test-cases
+ * - \b examples Currently one small application playing with clGetXXXInfo functions, thus displaying information about Clover (it can also be used with any compliant OpenCL implementation)
+ *
+ * \subsection flow Flow of execution
+ *
+ * Discovering and reading a source code may be done in an order close to the "execution flow" of an application using this code. For Clover, it means that it's easier to explore the functions when reading them roughly in the order they get called by a client application.
+ *
+ * Here is the recommended order in which to read the functions:
+ *
+ * - A small self-contained API is used to get information about the "Platform", in src/api/api_platform.cpp
+ * - Then, the platform is used to query the devices, in src/api/api_devices.cpp . This is the first API file that uses C++ classes. The hardware abstraction layer used is Coal::DeviceInterface. Take a look at the Coal::CPUDevice class, implementing this interface, in src/core/cpu/device.cpp .
+ * - The device is then used to create a Coal::Context, in src/api/api_context.cpp then src/core/context.cpp .
+ * - When a context is built, one can use it to do many thing. For example, it's possible to create buffer objects. Take a look at clCreateBuffer(), and the Coal::Buffer and Coal::CPUBuffer classes.
+ * - A big part of Clover is dedicated to command queues, implemented in the Coal::CommandQueue class. Its special architecture is documented in detail in the \subpage events page.
+ * - There is also the Coal::Program class, that compiles OpenCL C code into LLVM IR using Coal::Compiler (a simple wrapper around Clang).
+ * - A program contains one or more kernels. Take a look at Coal::Kernel for how they are handled.
+ * - Coal::Program and Coal::Kernel are device-independent classes. They use device-specific classes like Coal::CPUProgram and Coal::CPUKernel. The former translates LLVM IR instructions into machine code using the LLVM's JIT, the latter is a bit more complex and described in the \subpage llvm page.
+ */