Java for HPC (High-Performance Computing)

When presenting Ateji PX to an audience with HPC or simulation background, I often hear definitive opinions of the kind “We’ll never consider Java, it’s too slow”.

This appears to be more a cultural bias than an objective statement referring to actual benchmarks. To better understand the state of Java for high-performance computing, let us browse through three recent publications:

Java is used happily for huge HPC projects

At the European Space Agency, Java has been chosen for large CPU and data handling needs in the order of 10^21 flops and 10^15 bytes. That’s a lot of zeroes, there is no doubt here we’re talking here about high-performance computing.

“We are happy with the decision made and haven’t (yet) faced any major drawback due to the choice of language” [1].

“HPC developers and users usually want to use Java in their projects” [2]. Indeed, Java has many advantages over traditional HPC languages:

  • faster development
  • higher code reliability
  • portability
  • adaptative run-time optimization

There is also a lesser known but very interesting side-effect. Since the language is cleaner, it is easier for developers to concentrate on performance optimization (rather than, say, chasing memory-allocation bugs):

“In Mare Nostrum the Java version runs about four times faster than the C version [...]. Obviously, an optimisation of the C code should make it much more efficient, to at least the level of the Java code. However, this shows how the same developer did a quicker and better job in Java (a language that, unlike C, he was unfamiliar with)” [1].

Java code is no longer slow [2].

All benchmarks show that Java is not any slower than C++. This was the case ten years ago, and the deceptive results of the JavaGrande initiative for promoting Java for HPC gave it a bad reputation. But recent JVMs (Java Virtual Machines) do an impressive job of aggressive runtime optimization, adapting to the specific hardware it is running on and dynamically optimizing critical code fragments.

However, performance varies greatly, depending on the JVM used (version and vendor) and the kind of computation performed [1]. One lesson to be remembered is that you should always test and benchmark your application with several recent JVMs from different vendors, as well as different command line arguments (compare -client and -server).

Obviously, there are some caveats.

There are still performance penalties in Java communications: pure Java libraries are not well suited for HPC (“the speed and scalability of the Java ProActive implementation are, as of today still lower than MPI” [3]), and wrapping native libraries with JNI is slow.

The situation is improving recently with projects such as Java Fast Sockets, Fast-MPJ, MPJ Express, that aim at providing fast message-passing without JNI overhead.

HPC users also lack the quality and sophistication of those available in C or Fortran (but this seems to be improving). Obviously, this is a matter of having a large enough community, and the critical mass seems to have been reached.

The decision process

I cannot resist citing the whole part about the decision process that took place at ESA:

FORTRAN was somewhat favoured by the scientific community but was quickly discarded; the type of system to develop would have been unmaintainable, and even not feasible in some cases. For this purpose the choice of an object-oriented approach was deemed advisable. The choice was narrowed to C++ and Java.

The C++ versus Java debate lasted longer. “Orthodox” thinking stated that C++ should be used for High Performance Computing for performance reasons. “Heterodox” thinking suggested that the disadvantage of Java in performance was outweighted by faster development and higher code reliability.

However, when JIT Java VMs were released we did some benchmarks to compare C++ vs Java performances (linear algebra, FFTs, etc.) . The results showed that the Java performance had become quite reasonable, even comparable to C++ code (and likely to improve!). Additionally, Java offered 100% portability and I/O was likely to be the main limiting factor rather than raw computation performance.

Java was finally chosen as the development language for DPAC. Since then hundreds of thousands of code lines have been written for the reduction system. We are happy with the decision made and haven’t (yet) faced any major drawback due to the choice of language.

And now comes Ateji PX

So there’s a growing consensus about using Java for HPC applications, in industries such as space, particle physics, bioinformatics, finance.

However, Java does not make parallel programming easy. It relies on libraries providing explicit threads (a concepts very ill-suited as a programming abstraction, see “The Problem with Threads”) or relying on Runnable-like interfaces for distributed programming, leaving all the splitting and interfacing work to the programmer.

What we’ve tried to provide with Ateji PX is a simple and easy way to do parallel high-performance computing in Java. Basically, the only thing you need to learn is the parallel bar ‘||’. Here is the parallel ‘Hello World’ in Ateji PX:

[

  || System.out.println("Hello");

  || System.out.println("World");

]

This code contains two parallel branches, introduced by the ‘||’ symbol. It will print either
Hello
World
or
World
Hello
depending on which branch gets scheduled first.

Data parallelism is obtained by quantifiying parallel branches. For instance, the code below increments all N elements of array in parallel:

[

  || (int i: n) array[i]++;

]

Communication is also part of language, and is mapped by the compiler to any available communication library, such as sockets or MPI, or via shared memory when possible. Here are two communicating branches, the first one sends a value on a channel, the second one reads the value and prints it:

Chan c = new Chan();

[

  || c ! 123;

  || c ? int value; System.out.println(value);

]

There much more, have a look at the Ateji PX whitepaper for a detailed survey.

Ateji PX has proven to be efficient, demonstrating a 12.5x speedup on a 16-core server with a single ‘||’.

It has also proven to be easy to learn and use. We ran a 12-month beta testing program that demonstrated that most Java developers, given the Ateji PX documentation can write, compile and run their first parallel program within a couple of hours.

An interesting aspect is that the source code does not depend on the actual hardware architecture being used, be it shared-memory, cluster, grid, cloud or GPU accelerator. It is the compiler that performs the boring task of mapping syntactic constructs to specific architectures and libraries.

A free evaluation version is available for download here. Have fun!

Posted in Parallelism | 1 Comment

Disruptive Technology at SC’10

Ateji PX has been selected for presentation at the Disrupted Technologies exhibit part of the SuperComputing 2010 conference.


“Each year, the SC Conference seeks out new technologies with the potential to disrupt the HPC landscape as we know it. Generally speaking, “disruptive technology” refers to drastic innovations in current practices such that they have the potential to completely transform the high-performance computing field as it currently exists — ultimately overtaking the incumbent technologies or software tools in the marketplace. For SC10, Disruptive Technologies examines new computing architectures and interfaces that will significantly impact the high-performance computing field throughout the next five to 15 years, but have not yet emerged in current systems. The Disruptive Technologies exhibits, located in the SC10 exhibit hall, will showcase technologies ranging from storage, programming, cooling and productivity software through presentations, demonstrations and an exhibit showcase.

Selected technologies for SC10 will be on display during regular exhibit hall hours. Please stop by the booth for more information on the presentations and demonstrations schedule.”

See you in New Orleans, November 13-19.

Posted in Uncategorized | Comments Off

Explaining parallelism to my mother with the Mandelbrot demo

We have put online an Ateji PX version of the Mandelbrot set demo.

You specify the number of processor cores to be used for the computation using a slider (lower right), ranging from 1 to the number of available cores on your computer. My mom’s PC has two cores:


Look ‘ma, when the slider is on 1, you see only one guy painting. When the slider is on 2, you see two guys painting at the same time:


Now ‘ma is finally able to proudly explain what her son is busy at. We’re actually using the same demo with CTO’s and high-ranking managers in large corporations.

For us developers, here’s the code. Since dots are independent of each other, we use a simple for loop:

for (int x:nx, int y:ny){
   compute( x , y );
}

You can see the code used in the demo under the “Source Code” tab. Parallelizing the for loop is simply a matter of inserting parallel bars right after the for keyword:

for ||(int x:nx, int y:ny){
   compute( x , y );
}

The demo also shows some advanced features of loop splitting. By default, the work is split in blocks in order to get one block per processors. However, you can see on the Mandelbrot demo that some blocks take more time to compute than others (blacks dots take more time). The result is that although the work has been split in parallel across a number of workers, we end up waiting for the one worker who has the more black dots. This is not the most efficient way of leveraging parallel hardware.

In such cases, the solution consists in splitting work in smaller blocks. You can play with block splitting in the “Advanced Settings” area (lower left). The effect on the source code is to insert the corresponding #BlockSize annotation:

for ||(#BlockSize(30), int x:nx, int y:ny){
   compute( x , y );
}

You can learn more about Ateji PX loop splitting in the language manual, downloadable from Ateji’s web site.

Posted in Parallelism | Comments Off

Integrating π (pi) in parallel

A simple way of computing the constant π (pi) consists in measuring the surface under a curve. In more algebraic terms, this amounts to integrating y = 4/(1+x*x) between 0 and 1, and in programming terms this means incrementing x in small steps and summing the corresponding y (the smaller the steps, the more accurate the result).

Tim Mattson in his blog entry “Writing Parallel Programs: a multi-language tutorial introduction” explores available tools for coding this algorithm in parallel, namely OpenMP, MPI and Java threads.

Here we will stick to the Java universe, and compare Java sequential and multi-threaded code with their Ateji PX equivalent. Impatient readers may readily jump to the Ateji PX version at the end of the article.

The sequential Java code, inspired from Tim’s sequential C version, is as follows:

  static final int numSteps = 100000; 
  static final double step = 1.0/numSteps; ; 

  public static void main(String[] args) {
    double sum = 0.0; 
    for(int i=0; i<= numSteps; i++) { 
      double x = (i+0.5)*step; 
      sum = sum + 4.0/(1.0+x*x); 
    } 
    double pi = step * sum;
    System.out.println(pi);
    System.out.println(Math.PI); 
  }

Try to play with the value of numSteps and see the effect on precision.

Tim parallelizes this code using threads as follows (slightly edited to make it look more Java-ish) :

static int nProcs = Runtime.getRuntime().availableProcessors(); 

static class PIThread extends Thread 
{ 
  final int partNumber; 
  double sum = 0.0; 

  public PIThread(int partNumber) { 
    this.partNumber = partNumber; 
  } 

  public void run() { 
    for (int i = partNumber; i < numSteps; i += nProcs) { 
      double x = (i + 0.5) * step; 
      sum += 4.0 / (1.0 + x * x); 
    } 
  } 
} 

public static void main(String[] args) { 
  PIThread[] part_sums = new PIThread[nProcs]; 
  for(int i = 0; i < nProcs; i++) { 
    (part_sums[i] = new PIThread(i)).start(); 
  } 
  double sum = 0.0; 
  for(int i = 0; i < nProcs; i++) { 
    try { 
      part_sums[i].join(); 
    } catch (InterruptedException e) {
    } 
    sum += part_sums[i].sum; 
  } 
  double pi = step * sum; 
  System.out.println(pi); 
} 

Pretty verbose, isn’t it ? The core of the algorithm becomes hidden behind a lot of irrelevant details.

Being verbose also means that it becomes just too easy to overlook potential problems. In this code, the handling of InterruptedException is wrong and may lead to very nasty bugs when put in the context of a larger application. Not to blame Tim: honestly, who understands the precise meaning and usage rules of InterruptedException ?

In contrast, let us code the integration of π using Ateji PX, an extension of Java. First of all, the mathematical expression used in the integration is a typical example of a comprehension, for which Ateji PX provides an intuitive syntax. Here is the sequential code:

public static void main(String[] args) {
  double sum = `+ for{ 4.0/(1.0+x*x) | int i : numSteps, double x = (i+0.5)*step }
  double pi = step * sum;
  System.out.println(pi);
}

The second line, computing sum, is very close to the standard big-sigma notation in mathematics. Having this notation available as an extension of Java makes the expression of many mathematical formulas concise and intuitive, almost like what you’ve learned in high school.

It also makes the code closer to the programmer’s intent. In the first sequential version, using a for loop, it takes some thinking before realizing that the code is actually computing a sum. This has a strong impact on code readability and maintenance.

But what’s really interesting is how this code can be parallelized. Simply add a parallel bar (“||”) right after the for keyword, and Ateji PX will perform the computation in parallel using all available cores.

public static void main(String[] args) {
  double sum = `+ for||{ 4.0/(1.0+x*x) | int i : numSteps, double x = (i+0.5)*step }
  double pi = step * sum;
  System.out.println(pi);
}

In the OpenMP community, this is called a parallel reduction. Compare this code to the OpenMP version and the multi-threaded version.

Comprehension expressions in Ateji PX are not limited to summation. They can express aggregate operations such as product, logical or, count and average, but also bulk data manipulation such as SQL-like queries and list or set comprehensions (the set of all … such that …), and even operate on user-defined operations.

Posted in Parallelism | Comments Off

Ateji PX gems : Non-local exits in parallel branches

This article is the first of a series that will explore all the lesser known gems of Ateji PX.

Non-local exits are all the statements that take the flow of control “away” from the current local scope. In Java, they are

  • return,
  • throw,
  • break,
  • continue

Other than Ateji PX, I do not know of any parallel programming framework that properly handles the combination of parallelism and non-local exits.

Which is a pity, because this combination proves very useful. For one, it makes it possible to parallelize existing code without having to rewrite all the control flow, a long and error-prone operation making code difficult to read.

It also makes it possible to design interesting algorithms specifically making use of this combination.

A good example is speculative parallelism. “speculative” in this context means starting work before being totally sure that it is needed. This is a way to make the best use of idle cores.

You can use speculative parallelism to put different algorithms in competition, and take the result from the first one that terminates. Here is a try at speculatively sorting an array in Ateji PX:

    [
      || return bubbleSortAlgorithm(array);
      || return insertionSortAlgorithm(array); 
    ]

This code runs the two algorithms in parallel, take the result from the first algorithm that terminates and return it as the global result, stopping all remaining branches.

Here the interesting work is done by the return statements enclosed within a parallel block. As expected, a return statement returns from the enclosing method, stopping all existing branches as necessary. Without the return statements, the program would wait until both branches have terminated.

Properly stopping other threads/tasks/processes is one of the trickiest part of parallel programming. If you’ve ever tried it, you know what I’m talking about. With Ateji PX, you simply add a return (or break, or continue) statement inside a branch.

Posted in Ateji Gems, Parallelism | Comments Off

Dr Dobbs

I have been for years an enthousiastic reader of Dr. Dobbs Journal, a software magazine where programmers talk to programmers. DDJ has always been for me a reference, providing accurate and timely information with a practitioner’s point of view.

I learned a lot from DDJ about technology itself and the way to apply it. I used to subscribe to the paper edition (probably sent via sail-mail, since it always took about 3 months to reach Paris…), before it became web-only.

This is why I take special pride in having made the headline of Dr. Dobbs Update with Ateji PX in an article by Jon Erickson, DDJ’s editor in chief, titled “Think Parallel, Think Java”.

Posted in News | 1 Comment

Easy multi-core programming for all: Ateji PX is launched

I am proud to announce the first public release of Ateji PX, our new language extension and programming environment for multi-core and parallel programming in Java.

Ateji PX went through a one year technology preview program with selected customers, that confirmed our initial claims about ease of use, ease of learning, and compatibility wih existing code and tools.

You can now download the public release and discover the powerful language constructs offered by Ateji PX. Read the documentation and play with the samples for a quick introduction.

Here is the official press release.

Enjoy!

Posted in Uncategorized | 1 Comment

Everything you always wanted to know about Ateji PX

A gentle introduction to Ateji PX is available as a white-paper. It covers all new language constructs, and will definitely give you food for thought if you’ve ever had a look at the existing multi-core programming techniques.

Download it here (pdf, 408kb).

Posted in Parallelism | Comments Off

A *language* in the top-10 technologies

In its estimation of the 10 most important emerging technologies for the year 2010, Technology Review mentions a programming language.

Yes, a language.

When we started Ateji, the common wisdom was that language was an irrelevant artifact of the programming process. I’m glad to see that we were on the right track, and that the importance of language is now acknowledged even in the most prestigious magazines.

We’re preparing an offering for cloud programming that will let the same source code run indifferently on desktop PCs, multi-core servers, clusters, or in the cloud, while remaining compatible with existing tools and practice. This takes the form of a language extension, and wouldn’t have been possible without language technology and clever language design.

Posted in Uncategorized | Comments Off

Robin Milner

Professor Robin Milner, one of the pioneers of computer science, passed away on March 20: http://www.timesonline.co.uk/tol/comment/obituaries/article7081867.ece.

He will be remembered, among many other achievements, as the inventor of pi-calculus, the theoretical foundation behind Ateji Parallel Extensions.

Though I never had a chance to actually meet him, he inspired my career all along, as a researcher and as an entrepreneur.

Posted in Uncategorized | Comments Off