From 4de220c56f5812f42cc936b805ba94fa67dcbfe7 Mon Sep 17 00:00:00 2001 From: Denis Steckelmacher Date: Thu, 18 Aug 2011 19:43:20 +0200 Subject: Add \c in front of functions. --- doc/barrier.dox | 42 +++++++++++++++++----------------- doc/events.dox | 30 ++++++++++++------------- doc/llvm.dox | 68 ++++++++++++++++++++++++++++---------------------------- doc/mainpage.dox | 14 ++++++------ 4 files changed, 77 insertions(+), 77 deletions(-) diff --git a/doc/barrier.dox b/doc/barrier.dox index 59ab82d..73f3ca7 100644 --- a/doc/barrier.dox +++ b/doc/barrier.dox @@ -1,27 +1,27 @@ /** * \page barrier Implementing barriers * - * barrier() is an OpenCL C built-in function allowing synchronization between the work-items of a work-group. When a work-item encounters a barrier(), it must wait for all the other work-items of its work-group to also encounter the same barrier() call. + * \c barrier() is an OpenCL C built-in function allowing synchronization between the work-items of a work-group. When a work-item encounters a \c barrier(), it must wait for all the other work-items of its work-group to also encounter the same \c barrier() call. * * An example of use-case could be encrypting a buffer. A work-item would be the operation run on a single byte, and work-groups blocks of data that must be processed in chunks. Each work-item will first read its byte, then encrypt it using maybe the next and previous bytes in the work-group. After that, the data must be written back into the buffer, but if the work-item does so, another one could try to read what it have just written (because the work-item reads the next and previous bytes), and that will not be correct. * - * So, the work-item first calls barrier(), to let the others read their sane data before overwriting it with encrypted data. + * So, the work-item first calls \c barrier(), to let the others read their sane data before overwriting it with encrypted data. * * \section workgroups How Clover handles work-groups and work-items * - * Before beginning with the explanation of barrier(), a first note about how Clover handles the work-groups and work-items. + * Before beginning with the explanation of \c barrier(), a first note about how Clover handles the work-groups and work-items. * * A work-group is a set of work-items. In the spec, work-groups can be run in parallel, and their work-items can also be run in parallel. This allows massively parallel GPUs to launch kernels efficiently (they are slower than a CPU but made of thousands of cores). A CPU isn't very parallel, so it makes no sense to have one thread per work-item, it would require up to hundreds of thousands of threads for kernels running on a huge amount of data (for example converting an image from RGB to sRGB, it's the same computation for every pixel, so each pixel can be run in parallel). * - * Clover uses another technique: each work-group is run in parallel, but the work-items are run sequentially, one after the other. This allows Clover to be pretty fast for most of the cases, as all the CPU cores are used and no time is lost in thread switch and synchronization primitives. An interesting function here is Coal::CPUKernel::guessWorkGroupSize(). It tries to divide the number of work-items by a number the closest possible to the amount of CPU cores, falling back to one work-group if the number of work-items is prime (or not easily divisible, to avoid having 517583527 work-groups of only one work-item). + * Clover uses another technique: each work-group is run in parallel, but the work-items are run sequentially, one after the other. This allows Clover to be pretty fast for most of the cases, as all the CPU cores are used and no time is lost in thread switch and synchronization primitives. An interesting function here is \c Coal::CPUKernel::guessWorkGroupSize(). It tries to divide the number of work-items by a number the closest possible to the amount of CPU cores, falling back to one work-group if the number of work-items is prime (or not easily divisible, to avoid having 517583527 work-groups of only one work-item). * * In short, the work-items are run sequentially in Clover. * * \section problem The problem * - * The problem is that barrier() must oblige the current work-item to wait for the others. But as they are run sequentially, a work-item cannot wait for the others as it must have finished for them to be run. + * The problem is that \c barrier() must oblige the current work-item to wait for the others. But as they are run sequentially, a work-item cannot wait for the others as it must have finished for them to be run. * - * The solution is to pause the current work-item and to jump into the next. When the next also encounters barrier(), it also jumps to the next, and so forth. When the last work-item encounters a barrier(), it jumps back to the first one, that can continue its execution past its barrier() call. + * The solution is to pause the current work-item and to jump into the next. When the next also encounters \c barrier(), it also jumps to the next, and so forth. When the last work-item encounters a \c barrier(), it jumps back to the first one, that can continue its execution past its \c barrier() call. * * \dot * digraph G { @@ -42,7 +42,7 @@ * } * \enddot * - * This graphs shows a special case: when an item finishes, the next resumes where it left, not at its beginning. This explain why there is a test for barriers in Coal::CPUKernelWorkGroup::run() : + * This graphs shows a special case: when an item finishes, the next resumes where it left, not at its beginning. This explain why there is a test for barriers in \c Coal::CPUKernelWorkGroup::run() : * * \code * do @@ -56,41 +56,41 @@ * * Now that the problem is solved on paper, a working solution has to be found. What Clover wants to achieve is stopping a function in the middle of it, and then resuming it. * - * Some people may know the setjmp() and longjmp() functions. They do nearly what is needed but are not considered secure to use for resuming a function (that means that we can longjmp() from a function to another, but we cannot then resume the function that called longjmp()). + * Some people may know the \c setjmp() and \c longjmp() functions. They do nearly what is needed but are not considered secure to use for resuming a function (that means that we can \c longjmp() from a function to another, but we cannot then resume the function that called \c longjmp()). * - * Another solution is POSIX contexts, managed by the functions setcontext(), getcontext() and swapcontext(). These are what Clover uses. When a barrier() call is encountered, the current work-item is saved in its context, and then the next is executed (using swapcontext()). This is done in Coal::CPUKernelWorkGroup::barrier(). + * Another solution is POSIX contexts, managed by the functions \c setcontext(), \c getcontext() and \c swapcontext(). These are what Clover uses. When a \c barrier() call is encountered, the current work-item is saved in its context, and then the next is executed (using \c swapcontext()). This is done in \c Coal::CPUKernelWorkGroup::barrier(). * * \section stack The problem of stacks * - * In fact, it's a bit more complex than that. The first reason is that Clover doesn't want to slow down kernels not using barrier() (the majority), so the barrier() overhead in the case this built-in is never called must be as low as possible. Creating a context for each work-item takes time, it's not good. + * In fact, it's a bit more complex than that. The first reason is that Clover doesn't want to slow down kernels not using \c barrier() (the majority), so the \c barrier() overhead in the case this built-in is never called must be as low as possible. Creating a context for each work-item takes time, it's not good. * * Another thing to keep in mind is that a function (and kernels are functions) stores parameters, local variables and temporaries on the stack. If a work-item is halted, its stack mustn't be clobbered by another work-item. So, each work-item must have a separate stack, and these stacks must be created. * - * Clovers uses for that mmap(), a function that can be used to allocate large chunks of data, way faster than malloc() (malloc() uses mmap() internally, with also a memory pool). Stacks are in fact "large", currently 8 KB (but future version of OpenCL will run an analysis pass to count the maximum amount of data alloca'ed() by the kernel and its sub-functions), and each work-item must have its one. + * Clovers uses for that \c mmap(), a function that can be used to allocate large chunks of data, way faster than \c malloc() \c (malloc() uses \c mmap() internally, with also a memory pool). Stacks are in fact "large", currently 8 KB (but future version of OpenCL will run an analysis pass to count the maximum amount of data \c alloca'ed() by the kernel and its sub-functions), and each work-item must have its one. * - * For kernels designed for barrier(), that is to say with a known number of work-items per work-group (usually low), there is no problem. Even 512 work-items take only 4 MB of stack, a single Huge Page on x86, and nothing with regard to the amount of RAM currently found on modern computers. + * For kernels designed for \c barrier(), that is to say with a known number of work-items per work-group (usually low), there is no problem. Even 512 work-items take only 4 MB of stack, a single Huge Page on x86, and nothing with regard to the amount of RAM currently found on modern computers. * - * But the problem is for kernels not designed for barrier(). These ones use higher work-groups, or even let Clover decide how to split the work-items into work-groups (using Coal::CPUKernel::guessWorkGroupSize()). For a 1024x1024 image, with one work-item per pixel, and a 4-core CPU, Clover will create work-groups of 262144 work-items ! If each of them must have its own 8KB stack, that means a memory usage of 2 GB ! + * But the problem is for kernels not designed for \c barrier(). These ones use higher work-groups, or even let Clover decide how to split the work-items into work-groups (using \c Coal::CPUKernel::guessWorkGroupSize()). For a 1024x1024 image, with one work-item per pixel, and a 4-core CPU, Clover will create work-groups of 262144 work-items ! If each of them must have its own 8KB stack, that means a memory usage of 2 GB ! * - * So, Clover cannot use independent work-items when barrier() is never called. This is achieved by using a tiny dummy context at the beginning of Coal::CPUKernelWorkGroup::run(), holding only the data needed by the built-in functions like get_global_id(), that is to say the work-item index. Then, a while() loop is used to execute the work-items sequentially, incrementing the work-item index of the dummy context at each loop iteration. This makes the whole thing working when no barrier() calls are issued, and without any slowdown (maybe about ten CPU cycles, but JITing the caller functions takes way more time than that). + * So, Clover cannot use independent work-items when \c barrier() is never called. This is achieved by using a tiny dummy context at the beginning of \c Coal::CPUKernelWorkGroup::run(), holding only the data needed by the built-in functions like \c get_global_id(), that is to say the work-item index. Then, a \c while() loop is used to execute the work-items sequentially, incrementing the work-item index of the dummy context at each loop iteration. This makes the whole thing working when no \c barrier() calls are issued, and without any slowdown (maybe about ten CPU cycles, but JITing the caller functions takes way more time than that). * * \section implementation Implementation * - * How is barrier() implemented then ? Simple, when the first call to barrier() is made, Coal::CPUKernelWorkGroup::barrier() begins by allocating memory for the stacks (and puts its pointer in a TLS memory location). This function is also able to reuse the memory allocated by a previous work-group run on the same thread if it is large enough to hold the stacks. This way, it's a bit faster. + * How is \c barrier() implemented then ? Simple, when the first call to \c barrier() is made, \c Coal::CPUKernelWorkGroup::barrier() begins by allocating memory for the stacks (and puts its pointer in a TLS memory location). This function is also able to reuse the memory allocated by a previous work-group run on the same thread if it is large enough to hold the stacks. This way, it's a bit faster. * - * A note about this memory : it doesn't contain only the stacks, but also a Coal::CPUKernelWorkGroup::Context structure. This memory is accessed using Coal::CPUKernelWorkGroup::getContextAddr() that is given an index and returns a pointer to a Coal::CPUKernelWorkGroup::Context. The stack of this context is located just after the Context structure. + * A note about this memory : it doesn't contain only the stacks, but also a \c Coal::CPUKernelWorkGroup::Context structure. This memory is accessed using \c Coal::CPUKernelWorkGroup::getContextAddr() that is given an index and returns a pointer to a \c Coal::CPUKernelWorkGroup::Context. The stack of this context is located just after the Context structure. * - * Once the memory is allocated, Coal::CPUKernelWorkGroup::barrier() can proceed as if all was normal. It first tries to take the next context, and checks if it is initialized (that is to say its work-item has already begun and is currently halted somewhere in its execution). It's easy to do as mmap() zero-initializes the memory. If the work-item isn't yet initialized, it is created and initialized to point to the kernel function, with the correct \c args. + * Once the memory is allocated, \c Coal::CPUKernelWorkGroup::barrier() can proceed as if all was normal. It first tries to take the next context, and checks if it is initialized (that is to say its work-item has already begun and is currently halted somewhere in its execution). It's easy to do as \c mmap() zero-initializes the memory. If the work-item isn't yet initialized, it is created and initialized to point to the kernel function, with the correct \c args. * - * After the context creation, all that is needed is to swap the contexts, that is to say to save the current context in the memory location, and to jump to the next. If this is the first time barrier() is called, the current context is simply the "main context" of the thread, and it gets saved like any other context: barrier() successfully achieved to work even when a dummy context is used. + * After the context creation, all that is needed is to swap the contexts, that is to say to save the current context in the memory location, and to jump to the next. If this is the first time \c barrier() is called, the current context is simply the "main context" of the thread, and it gets saved like any other context: \c barrier() successfully achieved to work even when a dummy context is used. * * The context being swapped, the execution can begin or continue in the next work-item. * * \section finishing End of execution * - * One thing remains to be done, as pointed out at the end of \ref problem : when a barrier() has been encountered and a work-item finishes, we cannot just launch the next, as it has already begun before. So, we need to separately handle the case of a barrier() having been called. Coal::CPUKernelWorkGroup::run(), if a barrier() was called and when the first work-item finishes, doesn't launch the next one but goes directly in another loop. + * One thing remains to be done, as pointed out at the end of \ref problem : when a \c barrier() has been encountered and a work-item finishes, we cannot just launch the next, as it has already begun before. So, we need to separately handle the case of a \c barrier() having been called. \c Coal::CPUKernelWorkGroup::run(), if a \c barrier() was called and when the first work-item finishes, doesn't launch the next one but goes directly in another loop. * - * This loop simply calls swapcontext() for each remaining work-item. The other work-items will each terminate. An interesting property of contexts is used here: when a context finishes its execution, the execution returns to the context that have created it, the context given here in Coal::CPUKernelWorkGroup::barrier() : + * This loop simply calls \c swapcontext() for each remaining work-item. The other work-items will each terminate. An interesting property of contexts is used here: when a context finishes its execution, the execution returns to the context that have created it, the context given here in \c Coal::CPUKernelWorkGroup::barrier() : * * \code * Context *main = getContextAddr(0); // The context not created with makecontext @@ -101,5 +101,5 @@ * next->context.uc_link = &main->context; // <== here * \endcode * - * That means for clover that when a work-item finishes, the execution flow will return to Coal::CPUKernelWorkGroup::run() where it has left, that is to say at the swapcontext() call. This allows Clover to simply terminate all the work-items. + * That means for clover that when a work-item finishes, the execution flow will return to \c Coal::CPUKernelWorkGroup::run() where it has left, that is to say at the \c swapcontext() call. This allows Clover to simply terminate all the work-items. */ \ No newline at end of file diff --git a/doc/events.dox b/doc/events.dox index 0e6c879..5521ce9 100644 --- a/doc/events.dox +++ b/doc/events.dox @@ -5,23 +5,23 @@ * * \section overview Overview * - * A command queue is an object storing a list of events. In Clover, Coal::CommandQueue stores a list of Coal::Event objects. + * A command queue is an object storing a list of events. In Clover, \c Coal::CommandQueue stores a list of \c Coal::Event objects. * - * An event is an action that can be performed. The Coal::Event documentation page contains a big inheritance graph, so you can see that there are many events inheriting each other. An event is for instance a write to a memory object, a copy between an image and a buffer, or the execution of an OpenCL kernel. + * An event is an action that can be performed. The \c Coal::Event documentation page contains a big inheritance graph, so you can see that there are many events inheriting each other. An event is for instance a write to a memory object, a copy between an image and a buffer, or the execution of an OpenCL kernel. * - * There are also events that are called "dummy" in Clover: they do nothing and are simple information for the command queue. They are Coal::BarrierEvent, Coal::UserEvent, Coal::WaitForEventsEvent and Coal::MarkerEvent. + * There are also events that are called "dummy" in Clover: they do nothing and are simple information for the command queue. They are \c Coal::BarrierEvent, \c Coal::UserEvent, \c Coal::WaitForEventsEvent and \c Coal::MarkerEvent. * * \section queuing Queuing events * - * Queuing an event is the action of adding it to a command queue. A queued event will be executed by the command queue when certain conditions are met. A client application queues events by calling \ref src/api/api_enqueue.cpp "clEnqueueXXX()", for example clEnqueueCopyImageToBuffer(). These function create an object inheriting Coal::Event and call Coal::CommandQueue::queueEvent(). + * Queuing an event is the action of adding it to a command queue. A queued event will be executed by the command queue when certain conditions are met. A client application queues events by calling \ref src/api/api_enqueue.cpp \c "clEnqueueXXX()", for example \c clEnqueueCopyImageToBuffer(). These function create an object inheriting \c Coal::Event and call \c Coal::CommandQueue::queueEvent(). * - * In Clover, a Coal::Event object doesn't do anything besides checking its arguments. The work is all done in a worker thread, discussed later on this page. + * In Clover, a \c Coal::Event object doesn't do anything besides checking its arguments. The work is all done in a worker thread, discussed later on this page. * * \section ordering Ordering the events * - * Events are meant to be executed on a device. Clovers uses two event queues in order to do that, as there may be several Coal::CommandQueue objects for one single Coal::DeviceInterface. That means that Coal::CommandQueue has to keep a list of its events, and that Coal::DeviceInterface must do the same, with a separate list. + * Events are meant to be executed on a device. Clovers uses two event queues in order to do that, as there may be several \c Coal::CommandQueue objects for one single \c Coal::DeviceInterface. That means that \c Coal::CommandQueue has to keep a list of its events, and that \c Coal::DeviceInterface must do the same, with a separate list. * - * In order to have events executed, the command queue "pushes" events on the device using Coal::DeviceInterface::pushEvent(). + * In order to have events executed, the command queue "pushes" events on the device using \c Coal::DeviceInterface::pushEvent(). * * \dot * digraph G { @@ -41,21 +41,21 @@ * * The semantics are also different: on the device, the events are unordered. That means that there is no guarantee that an event stored after another in the device's event list will actually be executed after the previous one. This allows worker threads to pick up events without having to check their order: if an event is available, take it and run it. * - * On a Coal::CommandQueue object, events are ordered. If the queue has the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE property disabled, it's simple: the queue waits for an event to complete before pushing the next to the device. When this property is enabled, more complex heuristics are used. They are explained in the Coal::CommandQueue::pushEventsOnDevice() function documentation. + * On a \c Coal::CommandQueue object, events are ordered. If the queue has the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE property disabled, it's simple: the queue waits for an event to complete before pushing the next to the device. When this property is enabled, more complex heuristics are used. They are explained in the \c Coal::CommandQueue::pushEventsOnDevice() function documentation. * - * Roughly, every time an event completes, the Coal::CommandQueue explores its list of events and pushes all of them that meet certain conditions. For example, all the evens they are waiting on must be completed (see Coal::Event::waitEvents()), they have not to be already pushed, and they must be before any Coal::BarrierEvent() event. + * Roughly, every time an event completes, the \c Coal::CommandQueue explores its list of events and pushes all of them that meet certain conditions. For example, all the evens they are waiting on must be completed (see \c Coal::Event::waitEvents()), they have not to be already pushed, and they must be before any \c Coal::BarrierEvent() event. * * \section worker Worker threads * - * This section is specific to Coal::CPUDevice. The only thing a device has to do is to re-implement Coal::DeviceInterface::pushEvent(). An hardware-accelerated device can then simply push them to a ring buffer or something like that. + * This section is specific to \c Coal::CPUDevice. The only thing a device has to do is to re-implement \c Coal::DeviceInterface::pushEvent(). An hardware-accelerated device can then simply push them to a ring buffer or something like that. * - * Coal::CPUDevice has to do the ordering and dispatching between CPU cores itself. When a Coal::CPUDevice is first instantiated, it creates in Coal::CPUDevice::init() one "worker thread" per CPU core detected on the host system. + * \c Coal::CPUDevice has to do the ordering and dispatching between CPU cores itself. When a \c Coal::CPUDevice is first instantiated, it creates in \c Coal::CPUDevice::init() one "worker thread" per CPU core detected on the host system. * - * These worker threads are a simple loop polling for events to execute. The loop is in the worker() function. At each loop, Coal::CPUDevice::getEvent() is called. This function blocks until an event is available in the CPU event list, and then returns the first. + * These worker threads are a simple loop polling for events to execute. The loop is in the \c worker() function. At each loop, \c Coal::CPUDevice::getEvent() is called. This function blocks until an event is available in the CPU event list, and then returns the first. * - * For the vast majority of the events, once they are returned, they are removed from the event list. Doing that ensures that an event gets executed only one time, by one worker thread. Coal::KernelEvent objects are different, as a kernel is executed in chunks of work-items called "work groups". Each work-group can be executed in parallel with the others. For these events, the event is not removed from the event list until all the work-groups are executed. + * For the vast majority of the events, once they are returned, they are removed from the event list. Doing that ensures that an event gets executed only one time, by one worker thread. \c Coal::KernelEvent objects are different, as a kernel is executed in chunks of work-items called "work groups". Each work-group can be executed in parallel with the others. For these events, the event is not removed from the event list until all the work-groups are executed. * - * In fact, Coal::CPUKernelEvent::reserve() is called. This function locks a mutex in the Coal::CPUKernelEvent object, and returns whether the worker thread is about to run the last work-group of the kernel. If this is the case, the event is removed from the event list. If not, it is kept, and other worker threads will be able to run other work-groups. When the worker thread has its work-group reserved, it calls Coal::CPUKernelEvent::takeInstance(). This function unlocks the mutex, allowing the other worker threads to get other work-groups, and returns a Coal::CPUKernelWorkGroup object. These objects are described at the end of \ref llvm. + * In fact, \c Coal::CPUKernelEvent::reserve() is called. This function locks a mutex in the \c Coal::CPUKernelEvent object, and returns whether the worker thread is about to run the last work-group of the kernel. If this is the case, the event is removed from the event list. If not, it is kept, and other worker threads will be able to run other work-groups. When the worker thread has its work-group reserved, it calls \c Coal::CPUKernelEvent::takeInstance(). This function unlocks the mutex, allowing the other worker threads to get other work-groups, and returns a \c Coal::CPUKernelWorkGroup object. These objects are described at the end of \ref llvm. * - * As said above in this document, the Coal::Event objects don't do anything. They are simple device-independent pieces of information, with an optional "device-data" field (Coal::Event::deviceData()). The actual work is done in worker(), in a big switch structure. + * As said above in this document, the \c Coal::Event objects don't do anything. They are simple device-independent pieces of information, with an optional "device-data" field \c (\c Coal::Event::deviceData()). The actual work is done in \c worker(), in a big switch structure. */ \ No newline at end of file diff --git a/doc/llvm.dox b/doc/llvm.dox index 2525e6c..fb662ca 100644 --- a/doc/llvm.dox +++ b/doc/llvm.dox @@ -11,57 +11,57 @@ * * \section compilation Compiling OpenCL C to LLVM IR * - * The first step when one wants to launch a program is to compile it. It is done API-wise by the clCreateProgramWithSource() and clBuildProgram() functions. + * The first step when one wants to launch a program is to compile it. It is done API-wise by the \c clCreateProgramWithSource() and \c clBuildProgram() functions. * - * The first function creates a Coal::Program object, using Coal::Program::loadSources(). You can see at the top of this function a line prepending the OpenCL C standard header to the source being built. This function consists mainly of a concatenation of the strings given (they may be zero-terminated or not). + * The first function creates a \c Coal::Program object, using \c Coal::Program::loadSources(). You can see at the top of this function a line prepending the OpenCL C standard header to the source being built. This function consists mainly of a concatenation of the strings given (they may be zero-terminated or not). * - * Once the Coal::Program objects holds the source, clBuildProgram() can be used to compile it. It does so by invoking Coal::Program::build(). + * Once the \c Coal::Program objects holds the source, \c clBuildProgram() can be used to compile it. It does so by invoking \c Coal::Program::build(). * - * This big function compiles and links the program, so it will be explained later. The actual compilation job is done by Coal::Compiler. It does that in Coal::Compiler::compile, and then keeps the compilation log and options at hand for future use. + * This big function compiles and links the program, so it will be explained later. The actual compilation job is done by \c Coal::Compiler. It does that in \c Coal::Compiler::compile, and then keeps the compilation log and options at hand for future use. * - * When a program is compiled, the client application can retrieve it by using clGetProgramInfo(). + * When a program is compiled, the client application can retrieve it by using \c clGetProgramInfo(). * * \section linking Linking the program * - * The compilation step produced an "unlinked" module, that needs to be linked with the OpenCL C standard library, but only if the device for which the program is being built needs to. It's also possible to load a previously-compiled binary in a Coal::Program using Coal::Program::loadBinaries(). Doing this also loads an unlinked binary. + * The compilation step produced an "unlinked" module, that needs to be linked with the OpenCL C standard library, but only if the device for which the program is being built needs to. It's also possible to load a previously-compiled binary in a \c Coal::Program using \c Coal::Program::loadBinaries(). Doing this also loads an unlinked binary. * - * The separation between the unlinked binary and the linked one is the reason for the existence of Coal::Program::DeviceDependent::unlinked_binary. The source is compiled to LLVM IR in a module (temporarily stored in linked_module, though it isn't linked yet), that is dumped to unlinked_binary and then linked to form a full executable program. + * The separation between the unlinked binary and the linked one is the reason for the existence of \c Coal::Program::DeviceDependent::unlinked_binary. The source is compiled to LLVM IR in a module (temporarily stored in linked_module, though it isn't linked yet), that is dumped to unlinked_binary and then linked to form a full executable program. * - * So, Coal::Program::build() runs its code for every device for which a program must be built. These devices are either given at Coal::Program::loadBinaries, or as arguments to Coal::Program::build(). + * So, \c Coal::Program::build() runs its code for every device for which a program must be built. These devices are either given at \c Coal::Program::loadBinaries, or as arguments to \c Coal::Program::build(). * * The first step is to see if the program was loaded with sources. If it's the case, they have to be compiled (see \ref compilation). * - * Then, if the device for which the program is being built asks for that (Coal::DeviceProgram::linkStdLib(), Coal::CPUDevice does so), the program is linked with the OpenCL C standard library of Clover. An hardware-accelerated device normally will not want to have stdlib linked, as it's easier to convert LLVM IR to hardware-specific instructions when OpenCL built-ins functions are left in the form "call foo" instead of being inlined with inefficient CPU-centric code. + * Then, if the device for which the program is being built asks for that \c (\c Coal::DeviceProgram::linkStdLib(), \c Coal::CPUDevice does so), the program is linked with the OpenCL C standard library of Clover. An hardware-accelerated device normally will not want to have stdlib linked, as it's easier to convert LLVM IR to hardware-specific instructions when OpenCL built-ins functions are left in the form "call foo" instead of being inlined with inefficient CPU-centric code. * - * After this linking pass, optimization passes are created. The first ones are created by Coal::Program itself. They remove all the functions that are not kernels and are not called by a kernel. It allows LLVM to remove all the unused stdlib functions. + * After this linking pass, optimization passes are created. The first ones are created by \c Coal::Program itself. They remove all the functions that are not kernels and are not called by a kernel. It allows LLVM to remove all the unused stdlib functions. * - * Then, the device is allowed to add more optimization or analysis passes. Coal::CPUProgram::createOptimizationPasses() adds standard link-time optimizations, but hardware-accelerated devices could add autovectorizing, lowering, or analysis passes. + * Then, the device is allowed to add more optimization or analysis passes. \c Coal::CPUProgram::createOptimizationPasses() adds standard link-time optimizations, but hardware-accelerated devices could add autovectorizing, lowering, or analysis passes. * - * Finally, Coal::DeviceProgram::build is called. It's a no-op function for Coal::CPUDevice as it uses directly the module with a LLVM JIT, but hardware devices could use this function to actually compile the program for the target device (LLVM to TGSI transformation for example). + * Finally, \c Coal::DeviceProgram::build is called. It's a no-op function for \c Coal::CPUDevice as it uses directly the module with a LLVM JIT, but hardware devices could use this function to actually compile the program for the target device (LLVM to TGSI transformation for example). * * The program is now built and ready to be usable ! * * \section kernels Finding kernels * - * Now that the program is built, it's time to get its kernels. The functions declared as kernel in OpenCL C (with the \c __kernel attribute) are registered by Clang using the !opencl.kernels metadata. They are read from the LLVM module by Coal::Program::kernelFunctions(). Note that this function is device dependent, as it have to use the LLVM IR generated for the specified device. + * Now that the program is built, it's time to get its kernels. The functions declared as kernel in OpenCL C (with the \c __kernel attribute) are registered by Clang using the !opencl.kernels metadata. They are read from the LLVM module by \c Coal::Program::kernelFunctions(). Note that this function is device dependent, as it have to use the LLVM IR generated for the specified device. * - * When the kernels are found, Coal::Kernel objects can be instantiated. These objects are again device-independent as requested by the OpenCL spec. A Coal::Kernel object is mainly a name and a list of device-specific information. There is for instance the llvm::Function object that will be called. + * When the kernels are found, \c Coal::Kernel objects can be instantiated. These objects are again device-independent as requested by the OpenCL spec. A \c Coal::Kernel object is mainly a name and a list of device-specific information. There is for instance the \c llvm::Function object that will be called. * - * Once the Coal::Kernel object is created, Coal::Kernel::addFunction() is called for every device for which the Coal::Program is built. This function has the responsibility to explore the arguments of the function and to create a list of device-independent Coal::Kernel::Arg objects (kernel arguments). For instance, when it sees an argument of type <4 x i32>, it converts it to a Coal::Kernel::Arg of kind Coal::Kernel::Arg::Int32 and vector dimension 4. + * Once the \c Coal::Kernel object is created, \c Coal::Kernel::addFunction() is called for every device for which the \c Coal::Program is built. This function has the responsibility to explore the arguments of the function and to create a list of device-independent \c Coal::Kernel::Arg objects (kernel arguments). For instance, when it sees an argument of type <4 x i32>, it converts it to a \c Coal::Kernel::Arg of kind \c Coal::Kernel::Arg::Int32 and vector dimension 4. * * \section arguments Setting kernel arguments * - * The Coal::Kernel::Arg objects are interesting. They are an abstraction layer between the host CPU and the device. They also enable Coal::Kernel to implement its Coal::Kernel::setArg() function, that performs checks on the value given as argument. + * The \c Coal::Kernel::Arg objects are interesting. They are an abstraction layer between the host CPU and the device. They also enable \c Coal::Kernel to implement its \c Coal::Kernel::setArg() function, that performs checks on the value given as argument. * - * This class also contains semantic information specific to OpenCL. For instance, in LLVM, the address space qualifiers like __global or __local are represented as address spaces (0 = private, etc). Coal::Kernel::addFunction() translates these address spaces into Coal::Kernel::Arg::File values. + * This class also contains semantic information specific to OpenCL. For instance, in LLVM, the address space qualifiers like __global or __local are represented as address spaces (0 = private, etc). \c Coal::Kernel::addFunction() translates these address spaces into \c Coal::Kernel::Arg::File values. * - * When the users call clSetKernelArg(), the execution flow arrives at Coal::Kernel::setArg(). This function puts the correct value in the Coal::Kernel::Arg object, and does some checks. It is also his responsibility to recognize Coal::Sampler objects. + * When the users call \c clSetKernelArg(), the execution flow arrives at \c Coal::Kernel::setArg(). This function puts the correct value in the \c Coal::Kernel::Arg object, and does some checks. It is also his responsibility to recognize \c Coal::Sampler objects. * - * Samplers are a bit special as they are pointers to Coal::Sampler objects on the host CPU, and plain uint32_t values on the kernel side. This makes their translation from LLVM type to Coal::Kernel::Kind a bit difficult, as Clover only sees an LLVM \c i32 type for a sampler and also for a normal \c uint32. + * Samplers are a bit special as they are pointers to \c Coal::Sampler objects on the host CPU, and plain uint32_t values on the kernel side. This makes their translation from LLVM type to \c Coal::Kernel::Kind a bit difficult, as Clover only sees an LLVM \c i32 type for a sampler and also for a normal \c uint32. * - * The trick used in Clover is to store in memory a list of the known samplers. When a Coal::Sampler object is created, it is registered in this list. When it is deleted, its index is removed from the list. This in implemented in Coal::Object and shared between all the Coal classes. It allows the implementation of functions like Coal::Object::isa(), very useful to check that arguments given by the user are sane. + * The trick used in Clover is to store in memory a list of the known samplers. When a \c Coal::Sampler object is created, it is registered in this list. When it is deleted, its index is removed from the list. This in implemented in \c Coal::Object and shared between all the Coal classes. It allows the implementation of functions like \c Coal::Object::isa(), very useful to check that arguments given by the user are sane. * - * So, Coal::Object::isa() is used to recognize when an argument passed to Coal::Kernel::setArg() is in fact a sampler. When it is the case, the pointer to Coal::Sampler is replaced by the sampler's "bitfield" representation, using Coal::Sampler::bitfield(). + * So, \c Coal::Object::isa() is used to recognize when an argument passed to \c Coal::Kernel::setArg() is in fact a sampler. When it is the case, the pointer to \c Coal::Sampler is replaced by the sampler's "bitfield" representation, using \c Coal::Sampler::bitfield(). * * \code * // Special case for samplers (pointers in C++, uint32 in OpenCL). @@ -82,9 +82,9 @@ * * \section event Queuing an event * - * Once the Coal::Kernel object is created and its args set, the client application can call clEnqueueTask() or clEnqueueNDRangeKernel(). These functions create a Coal::KernelEvent object responsible for telling the device to execute the kernel. + * Once the \c Coal::Kernel object is created and its args set, the client application can call \c clEnqueueTask() or \c clEnqueueNDRangeKernel(). These functions create a \c Coal::KernelEvent object responsible for telling the device to execute the kernel. * - * When the event arrives on the device (see \ref events), Coal::CPUDevice initializes the LLVM JIT (Coal::CPUProgram::initJIT()) and then does that in src/core/cpu/worker.cpp : + * When the event arrives on the device (see \ref events), \c Coal::CPUDevice initializes the LLVM JIT \c (\c Coal::CPUProgram::initJIT()) and then does that in src/core/cpu/worker.cpp : * * \code * KernelEvent *e = (KernelEvent *)event; @@ -100,25 +100,25 @@ * delete instance; * \endcode * - * The first step is to use Coal::Event::deviceData() to get a Coal::CPUKernelEvent object. See Coal::Event::setDeviceData() and Coal::DeviceInterface::initEventDeviceData(). + * The first step is to use \c Coal::Event::deviceData() to get a \c Coal::CPUKernelEvent object. See \c Coal::Event::setDeviceData() and \c Coal::DeviceInterface::initEventDeviceData(). * - * This Coal::CPUKernelEvent holds information about the event needed by Coal::CPUDevice. + * This \c Coal::CPUKernelEvent holds information about the event needed by \c Coal::CPUDevice. * * \section workgroups Running the work groups * - * The next line is interesting : Coal::CPUKernelEvent::takeInstance() is called. This function works in pair with Coal::CPUKernelEvent::reserve() called from Coal::CPUDevice::getEvent(). + * The next line is interesting : \c Coal::CPUKernelEvent::takeInstance() is called. This function works in pair with \c Coal::CPUKernelEvent::reserve() called from \c Coal::CPUDevice::getEvent(). * * A kernel is run in multiple "work groups", that is to say batches of work items. The worker threads (see \ref events) take work-groups one at a time, so there can be multiple work groups of a single kernel running concurrently on a multicore CPU. * - * \ref events gives more details about that, but the main principle is that there is a list of events a worker thread can execute. For Coal::KernelEvent, a worker thread calls Coal::CPUKernelEvent::reserve() to see if there is a work-group available for execution (that is to say if the work groups aren't yet all executed). If there is one available, a mutex is locked and the function returns. Then, the worker thread calls Coal::CPUKernelEvent::takeInstance() to actually get the work-group, and runs it through Coal::CPUKernelWorkGroup::run(). + * \ref events gives more details about that, but the main principle is that there is a list of events a worker thread can execute. For \c Coal::KernelEvent, a worker thread calls \c Coal::CPUKernelEvent::reserve() to see if there is a work-group available for execution (that is to say if the work groups aren't yet all executed). If there is one available, a mutex is locked and the function returns. Then, the worker thread calls \c Coal::CPUKernelEvent::takeInstance() to actually get the work-group, and runs it through \c Coal::CPUKernelWorkGroup::run(). * * \section args Passing arguments to the kernel * - * Once the work-group is taken, it is run and must call the kernel function (using the JIT) for every work-item. This is done very simply by getting a function pointer to the kernel using llvm::ExecutionEngine::getPointerToFunction(). This function must now be called with the needed arguments. + * Once the work-group is taken, it is run and must call the kernel function (using the JIT) for every work-item. This is done very simply by getting a function pointer to the kernel using \c llvm::ExecutionEngine::getPointerToFunction(). This function must now be called with the needed arguments. * * The difficult thing is that C++ doesn't allow to give arbitrary arguments to a function. A function can receive arbitrary arguments, using void foo(int argc, ...), but an arbitrary list of arguments cannot be passed like in foo(va_build(std_vector));. They must be known at compilation-time. * - * The solution used by LLVM is to use a function like llvm::JIT::runFunction(function, vector of args). This function internally creates a "stub" function taking zero arguments but calling the target function itself with its arguments passed as constants. That is to say, when we want to call bar(3, 4);, a stub is compiled like this : + * The solution used by LLVM is to use a function like \c llvm::JIT::runFunction(function, vector of args). This function internally creates a "stub" function taking zero arguments but calling the target function itself with its arguments passed as constants. That is to say, when we want to call bar(3, 4);, a stub is compiled like this : * * \code * void stub() { @@ -139,7 +139,7 @@ * * The solution retained was to mimic the way LLVM does its stub. As a kernel can be run multiple times with different arguments (when the application explicitly does so, or when there are \c __local pointers needing to be reallocated between each work-group), the stub function cannot simply use constants, because rebuilding it for each set of arguments would be too slow. * - * So, the Clover's stub takes exactly one parameter: a \c void* pointer. This pointer contains the actual parameters, carefully aligned by Coal::CPUKernel::typeOffset(). The stub itself is built by Coal::CPUKernel::callFunction() and is like that : + * So, the Clover's stub takes exactly one parameter: a \c void* pointer. This pointer contains the actual parameters, carefully aligned by \c Coal::CPUKernel::typeOffset(). The stub itself is built by \c Coal::CPUKernel::callFunction() and is like that : * * \code * void stub(void *args) { @@ -161,9 +161,9 @@ * * The OpenCL C language provides built-ins that can be called from the kernels. For the most of them, there is no problem: they can either be implemented as LLVM instructions and then compiled for the CPU, or the standard library (src/core/runtime/stdlib.c) provides an implementation. * - * But there are cases where information outside the kernel is needed. For example, the get_work_dim() builtin takes no argument, but has to return a value dependent of the current Coal::KernelEvent being run. + * But there are cases where information outside the kernel is needed. For example, the \c get_work_dim() builtin takes no argument, but has to return a value dependent of the current \c Coal::KernelEvent being run. * - * In order to handle that, a call is made from the kernel to the Clover library. It's made possible by a very handy LLVM function: llvm::ExecutionEngine::InstallLazyFunctionCreator() called by Coal::CPUProgram::initJIT(). This function allows Clover to register a function that will resolve function names to function addresses. This way, a function called "get_work_dim" in the kernel will be passed to this function creator, that will return a pointer to get_work_dim() in src/core/cpu/builtins.cpp. + * In order to handle that, a call is made from the kernel to the Clover library. It's made possible by a very handy LLVM function: \c llvm::ExecutionEngine::InstallLazyFunctionCreator() called by \c Coal::CPUProgram::initJIT(). This function allows Clover to register a function that will resolve function names to function addresses. This way, a function called "get_work_dim" in the kernel will be passed to this function creator, that will return a pointer to \c get_work_dim() in src/core/cpu/builtins.cpp. * * \code * void *getBuiltin(const std::string &name) @@ -177,7 +177,7 @@ * } * \endcode * - * It's good, but one problem remains: get_work_dim() doesn't take any argument, but has to return an information about the currently-running kernel. Here, the internal structure of Coal::CPUDevice has to be taken into account. The device creates one worker thread per CPU core, and each of these worker threads can run only one work-group at a time, but multiple worker threads can run different kernels and work groups concurrently. + * It's good, but one problem remains: \c get_work_dim() doesn't take any argument, but has to return an information about the currently-running kernel. Here, the internal structure of \c Coal::CPUDevice has to be taken into account. The device creates one worker thread per CPU core, and each of these worker threads can run only one work-group at a time, but multiple worker threads can run different kernels and work groups concurrently. * * So, the solution retained is a Thread-Local variable. Such a variable is like a global variable (shared among all the classes and functions of a project), but its value is private to the currently-running thread. As a thread always handles only one work-group, a TLS variable is what is needed, and what Clover uses. It's named \ref g_work_group. * @@ -196,7 +196,7 @@ * !incVec(p_work_dim, p_dummy_context.local_id, p_max_local_id)); * \endcode * - * This code can be found in Coal::CPUKernelWorkGroup::run(). The incVec() call is there to handle the 3D global and local IDs. It returns true when the vector we are incrementing reaches \c p_max_local_id. + * This code can be found in \c Coal::CPUKernelWorkGroup::run(). The \c incVec() call is there to handle the 3D global and local IDs. It returns true when the vector we are incrementing reaches \c p_max_local_id. * * More explanation of this part can be found on the \ref barrier page. */ diff --git a/doc/mainpage.dox b/doc/mainpage.dox index b8229b3..0542983 100644 --- a/doc/mainpage.dox +++ b/doc/mainpage.dox @@ -31,11 +31,11 @@ * Here is the recommended order in which to read the functions: * * - A small self-contained API is used to get information about the "Platform", in src/api/api_platform.cpp - * - Then, the platform is used to query the devices, in src/api/api_devices.cpp . This is the first API file that uses C++ classes. The hardware abstraction layer used is Coal::DeviceInterface. Take a look at the Coal::CPUDevice class, implementing this interface, in src/core/cpu/device.cpp . - * - The device is then used to create a Coal::Context, in src/api/api_context.cpp then src/core/context.cpp . - * - When a context is built, one can use it to do many thing. For example, it's possible to create buffer objects. Take a look at clCreateBuffer(), and the Coal::Buffer and Coal::CPUBuffer classes. - * - A big part of Clover is dedicated to command queues, implemented in the Coal::CommandQueue class. Its special architecture is documented in detail in the \subpage events page. - * - There is also the Coal::Program class, that compiles OpenCL C code into LLVM IR using Coal::Compiler (a simple wrapper around Clang). - * - A program contains one or more kernels. Take a look at Coal::Kernel for how they are handled. - * - Coal::Program and Coal::Kernel are device-independent classes. They use device-specific classes like Coal::CPUProgram and Coal::CPUKernel. The former translates LLVM IR instructions into machine code using the LLVM's JIT, the latter is a bit more complex and described in the \subpage llvm page. + * - Then, the platform is used to query the devices, in src/api/api_devices.cpp . This is the first API file that uses C++ classes. The hardware abstraction layer used is \c Coal::DeviceInterface. Take a look at the \c Coal::CPUDevice class, implementing this interface, in src/core/cpu/device.cpp . + * - The device is then used to create a \c Coal::Context, in src/api/api_context.cpp then src/core/context.cpp . + * - When a context is built, one can use it to do many thing. For example, it's possible to create buffer objects. Take a look at \c clCreateBuffer(), and the \c Coal::Buffer and \c Coal::CPUBuffer classes. + * - A big part of Clover is dedicated to command queues, implemented in the \c Coal::CommandQueue class. Its special architecture is documented in detail in the \subpage events page. + * - There is also the \c Coal::Program class, that compiles OpenCL C code into LLVM IR using \c Coal::Compiler (a simple wrapper around Clang). + * - A program contains one or more kernels. Take a look at \c Coal::Kernel for how they are handled. + * - \c Coal::Program and \c Coal::Kernel are device-independent classes. They use device-specific classes like \c Coal::CPUProgram and \c Coal::CPUKernel. The former translates LLVM IR instructions into machine code using the LLVM's JIT, the latter is a bit more complex and described in the \subpage llvm page. */ -- cgit v1.2.3