An explicit parallel composition operator (Ateji PX‘s parallel bar) is pretty useful for hybrid CPU-GPU programming.
All frameworks I know embed GPU code into some sort of method call, sometimes indirectly. In any case, you end up writing something like
The bottom line is that input and output data, either implicit or explicit, are passed from CPU to GPU before running the code, and then from GPU to CPU when computation terminates. In other words, the GPU spends a lot of time idle waiting for I/O.
The key to GPU performance is to overlap data transfer and computation, so that input data is already present when a new computation starts. When coding in OpenCL or CUDA, this is done via asynchronous calls :
In this example, the intent is really to perform computation and communication in parallel. But since there’s no notion of parallel composition in OpenCL (or C, or any mainstream language for that matter), it is not possible to express directly the intent. Instead, you have to resort to this complex and not very intuitive mechanism of asynchronous calls. This is pretty low-level and you’d be happy that the compiler transform your intent into these low-level calls. That’s precisely what Ateji PX does.
So what’s the intent? Expressing that CPU and GPU must execute concurrently. In Ateji PX, this is done with two parallel branches, one of them bearing the
#GPU annotation. A channel is visible from both branches and will be used for communication.
Note that running code on the CPU or on the GPU is only a matter of modifying the annotations (can also be determined at run-time). No other change is the source code is required.
The GPU repeatedly waits for input data on channel c, performs computation and sends a result:
The CPU repeatedly sends input data and waits for results:
Computation and communication overlap because the communication channels have been declared as asynchronous:
The intent is clearly expressed in the source code : we have two parallel branches that communicate using channels, add a single annotation to state that a branch should run on GPU. No need to manage asynchronous calls, no need to use two different languages for coding an application, the Ateji PX compiler does all this for you.
Code is understandable and maintainable, and can work on multicore CPUs by simply changing the #GPU annotation. You can for instance debug the application on CPU before deploying it on GPU.
We prototyped a parallel version of the Mandelbrot set demo based on this idea, and achieved a speedup of 60x on a standard desktop PC. A pretty impressive speedup for just adding a #GPU annotation in the source code, isn’t it ?