Ateji PX for Java - Java on GPU

Ateji PX supports GPU programming via a simple mechanism of annotated parallel branches. Contact us at info@ateji.com for early access to the technology preview program.

Outline

Ateji PX introduces a notion of parallel branch at the language level, using the parallel bar "||" operator. In the example below (the standard adder example), there are four parallel branches running in parallel:

[
  || generate(c1);       // generate 1st stream of values
  || generate(c2);       // generate 2nd stream of values
  || adder(c1, c2, c3);  // add them into 3rd stream
  || display(c3);        // display the results
]


Branches communicate between each other via explicit message passing:

  • c1 ! value sends a value over channel c1
  • c1 ? value reads a value from channel c1

On a multicore machine, Ateji PX will automatically use all available cores.

 

The key to GPU programming using Ateji PX is to attach location annotations to parallel branches. For example:

[
  || generate(c1);
  || generate(c2);
  || #OpenCL("GPU1") adder(c1, c2, c3);
  || display(c3);
]


The #OpenCL("GPU1") annotation states that the adder branch will run on the first GPU. There are no other changes to the code.

Heterogeneous computing

Since the CPU and GPU code is similar, Ateji PX annotations can distribute branches at runtime on either a GPU or a CPU. It is possible for instance to split a problem over both multiple CPUs and multiple GPUs, making the best use of available hardware.

Mapping the GPU memory hierarchy to the lexical scope

An essential issue for achieving good performance is the management of memory hierarchies.

A GPU has typically three levels of memory:

 

  • private to the kernel
  • shared among a group of kernels
  • global to the GPU

While global memory is accessible from all kernels, it has the highest latency. At the other end, private memory has the lowest latency but cannot be accessed from two different kernels.

 

The following example shows how Ateji PX takes memory hierarchy into account:

Mandelbrot set

Ateji PX expresses the mapping of variables to the memory hierarchy using lexical scope. In this example:

  • there will be one single copy of x in global memory, accessible from all kernels
  • each workgroup will have its own copy of y, shared among the kernels in this workgroup
  • each kernel will have its own copy of z, private to the kernel

 

The use of nested branches and lexical scope makes it easy to understand the relationship between variables in the source code, their localization in memory and their lifetime. In contrast, other approaches tend to express memory hierarchy using annotations, with strange results such as a variable keeping its value while disappearing from lexical context.

Implementation notes

Ateji PX for GPU implements a source-to-source translation from extended Java to OpenCL code.

 

This translation only handles the constructs that are expressible in OpenCL, namely it can translate only a subset of Java. The constructs that are translated are those that perform well on a GPU, typically:

  • arithmetic and logic expressions over primitive types
  • multi-dimensional arrays
  • for loops
  • static method invocations

Java constructs that are very dynamic in nature or require intensive JVM support are not translated. They include:

  • heap allocation
  • dynamic dispatch
  • disruptions of control flow such as exceptions

Such constructs would not perform well on GPU architectures and are better kept on the CPU anyway.

Performance

Although simple, the Ateji PX programming model exhibits good performance on GPU. For the standard Mandelbrot example, we have benchmarked a 50x speedup over a standard desktop CPU.

 

The Mandelbrot example is structured as three parallel branches:

  • the GPU branch waits for position and zoom factor on a first channel, computes the image using the high parallelism available on the GPU, then sends the results back over a second channel
  • the first CPU branch interacts with the user and sends position and zoom factor
  • the second CPU branch displays the image

 

The structure of the code is as follows:

Chan in = ...;
Chan out = ...;
[
  || for(;;) { zoom = gui.getZoom(); in ! zoom; }
  || #OpenCL("GPU1") for(;;) {
       in ? zoom;
       image = compute(zoom);
       out ! image;
     }
  || for(;;) out ? image; display(image);
]


 

The interesting thing about structuring code using parallel branches and message-passing is that the same code can be used on a wide range of hardware architectures and scales well. For example, the GPU branch could also run on the same CPU when a GPU is not available, or run remotely on another server or in the cloud.

 

SEO by AceSEF

Customer Quotes

 

We just completed an evaluation of Ateji's product, and it does everything it promises… this is a very smart idea

Martin Curley,
European Research Director,
Intel

 

Ateji PX allows quicker and easier Java parallel programming without several of the pain-points of multithreading coming in the way

Dr. Gourab Nath,
Sr. Research Scientist,

Amadeus

 

Ateji PX is a dream for Java™ developers, enabling all kinds of applications to take better advantage of NVIDIA’s multicore processors.

Stephen Jones,
Product Line Manager,
Developer Tools NVIDIA
NVIDIA

 

Thank you for this brilliant piece of engineering

Ala Shiban,
Haifa University,
Cancer Research Group
Haifa