Ateji PX for Java - Java on GPU
Ateji PX supports GPU programming via a simple mechanism of annotated parallel branches. Contact us at firstname.lastname@example.org for early access to the technology preview program.
Ateji PX introduces a notion of parallel branch at the language level, using the parallel bar "||" operator. In the example below (the standard adder example), there are four parallel branches running in parallel:
Branches communicate between each other via explicit message passing:
On a multicore machine, Ateji PX will automatically use all available cores.
The key to GPU programming using Ateji PX is to attach location annotations to parallel branches. For example:
Since the CPU and GPU code is similar, Ateji PX annotations can distribute branches at runtime on either a GPU or a CPU. It is possible for instance to split a problem over both multiple CPUs and multiple GPUs, making the best use of available hardware.
Mapping the GPU memory hierarchy to the lexical scope
An essential issue for achieving good performance is the management of memory hierarchies.
A GPU has typically three levels of memory:
While global memory is accessible from all kernels, it has the highest latency. At the other end, private memory has the lowest latency but cannot be accessed from two different kernels.
The following example shows how Ateji PX takes memory hierarchy into account:
Ateji PX expresses the mapping of variables to the memory hierarchy using lexical scope. In this example:
The use of nested branches and lexical scope makes it easy to understand the relationship between variables in the source code, their localization in memory and their lifetime. In contrast, other approaches tend to express memory hierarchy using annotations, with strange results such as a variable keeping its value while disappearing from lexical context.
Ateji PX for GPU implements a source-to-source translation from extended Java to OpenCL code.
This translation only handles the constructs that are expressible in OpenCL, namely it can translate only a subset of Java. The constructs that are translated are those that perform well on a GPU, typically:
Java constructs that are very dynamic in nature or require intensive JVM support are not translated. They include:
Such constructs would not perform well on GPU architectures and are better kept on the CPU anyway.
Although simple, the Ateji PX programming model exhibits good performance on GPU. For the standard Mandelbrot example, we have benchmarked a 50x speedup over a standard desktop CPU.
The Mandelbrot example is structured as three parallel branches:
The structure of the code is as follows:
The interesting thing about structuring code using parallel branches and message-passing is that the same code can be used on a wide range of hardware architectures and scales well. For example, the GPU branch could also run on the same CPU when a GPU is not available, or run remotely on another server or in the cloud.
We just completed an evaluation of Ateji's product, and it does everything it promises… this is a very smart idea
Ateji PX allows quicker and easier Java parallel programming without several of the pain-points of multithreading coming in the way
Ateji PX is a dream for Java™ developers, enabling all kinds of applications to take better advantage of NVIDIA’s multicore processors.
Thank you for this brilliant piece of engineering