Ateji is closed

 

I have the sad task to announce that Ateji went out of business.

What happened ? We have designed and delivered great products (OptimJ and Ateji PX) that received an enthusiastic feedback from users and from the press (see our wall of fame at the end of this message). But we haven’t been able to secure a steady stream of revenue to support the company. The sentence is without mercy: no cash, no business.

The Ateji project

The idea behind the founding of Ateji was to create an ecosystem of language extensions targeted to different application domains. Just like what you have today with libraries, APIs or software components, but with a much greater expressivity permitted by a powerful language support.

As a simple example: consider writing a constraint using an API:

solver.add(eq(var("x"), add(var("y"), 1))

or writing the same constraint using a language extension (OptimJ):

constraint { x == y+1; }

Or imagine how you would express the parallel composition

a(); || b();

using a thread library (the answer is in the Ateji PX whitepaper, it requires twenty lines of deeply technical code).

APIs have a very, very limited expressivity. Having the right concepts, specific to the problem at hand, directly in the language has a major impact on developer productivity and code quality. This is not only about making the code less verbose, it helps humans maintain the code, and also helps computers analyze, verify or optimize the code at a higher level. Better, faster, cheaper.

This is the reason why so many domain-specific languages (DSLs) have been developed. But stand-alone DSLs have their own set of problems, such as isolation and poor tooling support. By designing language extensions, we have the benefit of DSLs while preserving compatibility with existing code, tools and training.

At Ateji we have solved many of the technical challenges, and some were really tough. But the business challenge was even tougher. Just mentioning the term “language” frightens prospects. It is comparatively much easier to have developers adopt a new library than a new language extension.

Adoption issues

We did anticipate that introducing a new language-based technology and getting it adopted would take a long time. The truth is, it takes much longer than anticipated. Many “new” languages that are coming into fashion now were first introduced 10 or 20 years ago.

There are however examples of companies that successfully built a business on a language. Think of Oracle and SQL. They had a strong business case: suddenly, all software developers became able to access databases in a simple and intuitive way. We thought we had a similar strong business case with Ateji PX and parallel programming: suddenly, all Java developers became able to write parallel code for multicore processors in a simple and intuitive way.

Our largest mistake has probably been to surf on the popularity of the Java language and on the hype of the “multicore crisis” (all software has to be adapted for the coming generation of new hardware) without developing a fine enough understanding of the market.  We discovered the hard way that Java developers just don’t want to hear about parallel programming, regardless of how easy we made it. They’d rather use only one core out of their expensive 16-core server than write parallel code (we’ve seen the case). They long for solutions that hide parallelism. This may change when 1000′s of cores become the norm, future will tell.

At the other end of the spectrum, developers that do parallel programming just don’t want to hear about Java, because 10 years ago it used to be slower than C++.

What this means for our users

The Ateji team is going to be dismantled, but we’ll keep answering requests on support@ateji.com as much as possible. We will do this as individuals on our free time so please be lenient.

All current software downloads are free

We are committed to keep OptimJ and Ateji PX alive and running. All current versions of Ateji software can now be downloaded for free, no strings attached. There are no restrictions on time or usage, you can even ship them as part of your application. Please be aware that the status of future releases may be different.

The future of the technology

We are currently actively investigating ways to keep the technology alive, such as partnership, acquisition or sponsorship. A number of business discussions are under way, I’ll keep you informed on this blog.

We are open to all suggestions, so please do not hesitate to contact me if you think about other ideas or opportunities to ensure the future of OptimJ and Ateji PX. Supporting the technology requires people, and we’re open to study job offers that go in this direction.

A tribute to all Ateji stakeholders

Many individuals contributed to the Ateji story: founders, employees, investors, advisors, consultants, interns and early adopters. To all of you, I want to express my thanks in supporting this project.

It turned out not to be a financial success, but I am convinced that the technologies we’ve developed contributed to the progress of software development. As an example, the clean design of Ateji PX makes it a language of choice for teaching parallel programming, and it has already be adopted by several universities as part of their curriculum.

I hope that the Ateji experience has been an important contribution to the personal development of all team members, and I wish you all farewell and best of luck for your career.

Wall of fame

To end on a positive note, I would like to recall here some recent testimonials we have received about our products. If you haven’t done so already, we’d be happy to hear about your experience with OptimJ or Ateji PX.

“Within a day, we got x5 boost on a major back office application”
Pascal Bourcier, Director Risk Department, Natixis Investment Bank France.

“Let me say what a wonderful product you have. This is exactly what I’ve been looking for many years”
Ahmed Riza, Senior Developer, J.P.Morgan

Ateji PX is a dream for Java developers
Stephen Jones, Product Line Manager for Developer Tools, NVIDIA

“Ateji PX is a revolutionary technology for parallelization”
Dr. Gourab Nath, Sr. Research Scientist,Amadeus France.

“Thank you for this brilliant piece of engineering”
Ala Shiban, Haifa University, Cancer Research Group, Israel

Press coverage

HPCwire – “Top Feature of the Week”
French Firm Brews Parallel Java Offering

DrDobbs Update
Think Parallel, Think Java

CRN Magazine - “Intel Labs Director interview” “We just completed an evaluation of Ateji’s product, and it does everything it promises,” said Intel’s European research director Martin Curley. He calls Ateji’s tool for the parallelization of Java code a “very smart idea”.
Parijs bedrijf biedt uitweg uit multi-core crisis (original, nl) or Paris company offers multi-core solution to crisis (Google translation, en)

IT Business Edge “[...] although multicore programming remains a challenge, Java developers at least may be on the verge of a breakthrough.”
Employing Software in the Fight for Energy Efficiency

 

Posted in Uncategorized | 1 Comment

Mapping the GPU memory hierarchy

An interesting side-effect of the Ateji PX approach to GPU programming is that it makes it easy to map the GPU memory hierarchy. This is essential for achieving good performance on GPUs, and more generally on any hardware with non-uniform memory accesses (NUMA).

Remember, Ateji PX expresses parallelism with parallel branches, introduced by the || operator. When nesting parallel branches, each nesting level can be interpreted as being mapped to a different level in the memory.

What is important here is that memory hierarchy is expressed in terms of lexical scope. A variable from a different level in the memory hierarchy is accessible if and only if it is visible in the lexical scope.

In constrast, languages such as OpenCL use specific declaration modifiers to locate variables in different memory areas. With this approach, you can for instance a variable declared in the global lexical scope, but labeled with the __private modifier. It looks like there is only one variable, while actually each kernel has its own copy. Such modifiers make it very hard to understand the logic of the code.

Posted in Uncategorized | Comments Off

“Nuclear energy for a bright future”

This picture is now famous all over Japan, but I haven’t seen it yet in western media:

Road sign at the entrance of Futaba [双葉町], a ghost city in the evacuated zone around the Fukushima power plant. It says “Nuclear energy for a bright future” [原子力明るい未来のエネルギー / denshiryoku akarui mirai no enerugi-].

More slogans and pictures in this blog post (in Japanese).

20100425151202

Blooming cherry trees [桜 / sakura]  in Futaba (from this blog post). Picture taken in Spring 2010. Nobody watches them anymore, access is forbidden.

If you wonder what this post has to do with the business of programming tools, well, being an entrepreneur is precisely about shaping the world and deciding which future you want to leave as legacy to your children – this is not limited to the four walls of your office.

Posted in Uncategorized | Comments Off

Clarke’s second law

I just discovered Clarke’s second law and couldn’t resist citing it:

The only way of discovering the limits of the possible is to venture a little way past them into the impossible.

Looking back a few years ago, when we embarked on designing Ateji PX, was that within the limits of the possible or did we venture a little way into the impossible?

(from http://en.wikipedia.org/wiki/Clarke’s_three_laws)

 

Posted in Uncategorized | Comments Off

Parallel programming gotchas

When you practice sequential programming, you learn rules of thumb to identify and avoid patterns that lead to disastrous performance. The same is true for parallel programming.

Here are some lessons inspired by the feedback we received from first-time users of Ateji PX. Please add a comment if you think of other common patterns that fit into this category.

Avoid branch count in the millions

A parallel branch is a very lightweight object, especially when compared to threads. Creating and scheduling a parallel branch incurs very little overhead.

However, when creating a large number of small branches, the overhead may become noticeable, slowing down the whole computation. In practice, “large” starts at somewhere between 1 and 10 million branches, depending on your machine.

Here is an example:

Integer[] data = new Integer[1000000];
[
    || (int i : data.length) data[i] = i;
]

This code creates one million branches, each one of them just incrementing an array element. This would be fine on a machine with millions of cores, but we still need to wait a couple of years before such hardware becomes available. On my 4-core desktop, the parallel code is slower than its sequential equivalent:

Integer[] data = new Integer[1000000];
for(int i : data.length) data[i] = i;

Now that we’ve identified the problem, how do we correct it? For the experts, the answer is to use a #BlockCount indication:

Integer[] data = new Integer[1000000];
[
    || (#BlockCount(4), int i : data.length) data[i] = i;
]

This will create only 4 parallel branches, each of them iterating sequentially over a block of 250000 elements. You can learn more about indications in the Ateji PX language manual.

For the non-experts, the simple answer is to use the parallel for notation that takes care of block size for you. The parallel for is a convenient shorthand notation that will create as many blocks as there are available cores, so that you just don’t have to think about blocks:

Integer[] data = new Integer[1000000];
for|| (int i : data.length) data[i] = i;

Now the overhead has become unnoticeable: this code is roughly 4 times faster on my 4 cores than the sequential code.

Parallelize outer loops first

Many data-parallel algorithms use embedded for-loops:

for(int i : 1000000) {
  for(int j : 1000000) {
    doSomething(i,j);
  }
}

Let us parallelize the outer loop:

for||(int i : 1000000) {
  for(int j : 1000000) {
    doSomething(i,j);
  }
}

The outer parallel for creates as many branches as there are available cores. Each of these branches iterates sequentially over 250000 values of i, each time executing the inner loop. No overhead is noticeable.

Let us parallelize the inner loop:

for(int i : 1000000) {
  for||(int j : 1000000) {
    doSomething(i,j);
  }
}

Here each of the 1000000 iterations of the outer loop executes the inner parallel for, which creates parallel branches 1000000 times. Branch creation time becomes noticeable. If the amount of computation in doSomething() is small, this parallel program may be slower than its sequential version.

In the case where the number of loop iterations is commensurate with the number of cores, it makes sense to parallelize both the outer and the inner loop.

Posted in Parallelism | Comments Off

Why multithreading is difficult (2)

If you’ve ever tried multithreaded programming with any of the Java concurrency libraries, you’ve noticed that you still need to take care about a lot of boring technical details.

They seem to serve no other purpose than making code incredibly verbose and unmaintainable.

Whether using raw threads or letting a ThreadPool schedule tasks, the basic interface to your job — the computation you’re actually interested in –, is a method without parameters, without results and without exceptions. Here is the (in)famous Runnable interface:

public interface Runnable {
  void run():
}

The Callable interface introduced in Java 5 is a slight improvement, code can now return a result:

public interface Callable {
  V call();
}

Now what about:

  • passing parameters
  • returning results (in the case of a Runnable)
  • throwing checked exceptions
  • handling possibly multiple exceptions
  • performing non-local exits: return, break, continue
  • doing recursion

All these features form the basis of any modern language, including obviously Java. Said otherwise, without these features, you’re really coding at the assembly language level, with a screwdriver and a solder iron as only development tools.

But astoundingly, as soon as you try parallel programming in Java, using the standard concurrency library or any other library for that matter, you simply lose all these features at once.

Just think about it: what amount of your existing code base does not take parameters, does not return results, does not throw exceptions, does not break out of loops and does not contain recursion?

You clearly want the compiler to take care of this. This is one of the first difference you’ll notice between Java-style and Ateji-style parallel programming. Below are barebones examples for all these features (remember that || introduces a parallel branch):

Passing parameters and returning results:

[ || x = X(x1, ..., xn); || y = Y(y1, ..., yn); ]

Throwing checked exceptions (the compiler will complain if the throws clause is incorrect):

void m() throws IOException, URISyntaxException
{
  [
    ||  System.in.read(); // throws IOException
    ||  uri = new URI(string); // throws URISyntaxException
  ]
}

Performing non-local exits:

void m()
{
  boolean cond = true;
  for(;;) {
    [ 
      || if(cond) break; // break out of the for loop enclosing this parallel block
      || if(!cond) return; // return from the method
    ]
  }
}

Recurse:

int fib(int n) {
  if(<= 1) return 1;
  int fib1, fib2;
  // recursively create parallel branches
  [
    || fib1 = fib(n-1);
    || fib2 = fib(n-2);
  ]
  return fib1 + fib2;
}

How would you handle all these cases when using threads or threadpools ? And we haven’t mentioned yet partitioning work, and tuning for performance.

If you love semantics, I’ll explain in a next post the precise behavior of parallel blocks when multiple exceptions are thrown or non-local exists are encountered.

Posted in Ateji Gems, Parallelism | Comments Off

Ateji PX Free Edition available

Here at Ateji, we aim at offering innovative programming languages and tools. Running such a business is like performing a delicate balancing act: developing a good product takes time, and when the product is finally out you need to implement the right business model. An important choice is pricing.

Offering free software helps building a community behind the product, but does not help financing the company. This is why innovative solutions are typically launched in two steps. We have now reached the first step, where an initial stream of corporate customers guarantees a recurrent revenue. Now that we’ve made sure that we can feed our families, we’re ready to offer a free personal edition of the software.

I am proud to announce the availability of the Free Edition of Ateji PX, the easiest Java-based parallel programming tool on the market. The Free Edition handles up to 4 cores, which is typically the upper count found in personal computers. Use it for your existing applications and future developments, and enjoy multicore programming!

Corporate servers now commonly have 16 cores or more for computational intensive applications. Corporate users can try the core-unlimited evaluation version and purchase a license when the tests are promising. Don’t leave your powerful servers idle! Depending on the application, 12x speedups on 16-core servers are not uncommon.

Posted in News, Parallelism | Comments Off

Java on GPU with Ateji PX

An explicit parallel composition operator (Ateji PX‘s parallel bar) is pretty useful for hybrid CPU-GPU programming.

All frameworks I know embed GPU code into some sort of method call, sometimes indirectly. In any case, you end up writing something like

blablabla(); // runs on CPU

// this code must run on the GPU
output = doSomethingOnGPU(input);

blablabla(); // runs on CPU

The bottom line is that input and output data, either implicit or explicit, are passed from CPU to GPU before running the code, and then from GPU to CPU when computation terminates. In other words, the GPU spends a lot of time idle waiting for I/O.

The key to GPU performance is to overlap data transfer and computation, so that input data is already present when a new computation starts. When coding in OpenCL or CUDA, this is done via asynchronous calls :

// start asynchronous transfer
// the method call returns immediately
cudaMemcpyAsync(handle, …);

// perform computations that do not depend
// on the result of the transfer

// wait until the transfer is finished,
// using ’handle’ as a reference to the async transfer
kernel<<<grid, block>>>(handle);

// now we’re sure the transfer is complete,
// we can perform computations that do depend
// on its result

In this example, the intent is really to perform computation and communication in parallel. But since there’s no notion of parallel composition in OpenCL (or C, or any mainstream language for that matter), it is not possible to express directly the intent. Instead, you have to resort to this complex and not very intuitive mechanism of asynchronous calls. This is pretty low-level and you’d be happy that the compiler transform your intent into these low-level calls. That’s precisely what Ateji PX does.

So what’s the intent? Expressing that CPU and GPU must execute concurrently. In Ateji PX, this is done with two parallel branches, one of them bearing the #GPU annotation. A channel is visible from both branches and will be used for communication.

Chan input = new AsyncChan();
Chan output = new AsyncChan()
[
  || // code to be run on CPU
     ...
  || (#OpenCL) // code to be run on GPU
     ...
]

Note that running code on the CPU or on the GPU is only a matter of modifying the annotations (can also be determined at run-time). No other change is the source code is required.

The GPU repeatedly waits for input data on channel c, performs computation and sends a result:

    || // code to be run on GPU
       for(;;) {
         input ? data; // receive data from CPU on the input channel
         result = computation(data);
         output ! data; // send result on the output channel 
       }

The CPU repeatedly sends input data and waits for results:

    || // code to be run on CPU
       for(;;) {
         input ! data; 
         output ? data;
         … do something with the result …
       }

Computation and communication overlap because the communication channels have been declared as asynchronous:

Chan input = new AsyncChan();
Chan output = new AsyncChan();

That’s all!

The intent is clearly expressed in the source code : we have two parallel branches that communicate using channels, add a single annotation to state that a branch should run on GPU. No need to manage asynchronous calls, no need to use two different languages for coding an application, the Ateji PX compiler does all this for you.

Code is understandable and maintainable, and can work on multicore CPUs by simply changing the #GPU annotation. You can for instance debug the application on CPU before deploying it on GPU.

We prototyped a parallel version of the Mandelbrot set demo based on this idea, and achieved a speedup of 60x on a standard desktop PC. A pretty impressive speedup for just adding a #GPU annotation in the source code, isn’t it ?

Posted in Parallelism | Comments Off

Why multithreading is difficult

It is common wisdom that programming with Java threads is difficult (and even more so in other languages). But have you ever tried to figure precisely why this is so ?

The problem with threads (PDF) is a nicely written research report that tries to answer this question. In short, a thread is a notion that makes sense at the hardware level, when you think in terms of registers and instruction pointers. But this hardware level notion should never have made its way up to the source code level. Just like you wouldn’t mix Java source code and assembly language instructions.

Let us try to go deeper: why exactly is the notion of thread ill-suited as a programming abstraction ? The core of the problem can be expressed in one single word: composability.

Consider for example arithmetic expressions. You can take two numbers and compose them to form an addition:

1+2

The result of the addition is itself a number, so you can compose it again with yet another number:

(1+2)+3 or 3+(1+2)

When there’s no ambiguity, you can remove the parentheses altogether:

1+2+3

In other words, arithmetic expressions can be composed using + as a composition operator (there are many other).

The exact same process works for Java statements using sequential composition:

{ a; b; }
{ a; b; }; c;
c; { a; b; }
a; b; c;


What about threads ? Can you take two threads and combine them in order to obtain a thread ? So that it can itself be combined with yet another thread ? No, threads do not compose.

So why is composability important ?

Composability provides a method for building larger pieces out of smaller ones: just compose. This is how you can build large programs by building up on smaller ones.

You want to make sure that the program behaves as expected ? First start by making sure that each of its pieces behaves as expected. This is what unit testing is about.

Composability also provides a method for understanding what is going on: just decompose. If a divide-by-zero exception is thrown by { a; b; }, then it must have been thrown either from a or from b. Decomposition is our standard and intuitive way to debug a program.

Now what about a multithreaded program ? Imagine a program with two threads that throws an exception. Can you split the program in two pieces such that the exception comes from either one or the other ? No way !

This is precisely why multithreading is difficult. Multithreaded programs are impossible to test and debug, and making sure that they work properly requires a thorough analysis of the whole code, including pieces you didn’t even know they existed.

Parallel programming itself does not need to be difficult, the problem is that mainstream programming languages do not provide a parallel composition operator as they do for sequential composition. Ateji’s technological offering consists precisely in retrofitting languages with such an operator.

Let me stress this once again: multithreaded programming is difficult because of the lack of composability. Not because because of parallelism, non-determinism, or the other usual suspects.

This is true of Java threads, but also of all structures built upon threads without providing their own composition mechanism, such as tasks (the Java Executor framework) or concurrent collections (the Java concurrency framework).

Posted in Parallelism | Comments Off

Back from SuperComputing 2010


The New Orleans skyline from Garden District

This view from the hotel at 6am was about the only time we had a chance to see the sun. SC10 has been a very busy week for the Ateji team, with a lot of business meetings (all the big guys where there) and a wealth of visitors to our booth.

From left to right: Claude, Maxence and Patrick on the Ateji booth. Most visitors were curious about this new approach to do parallel programming and spent a long time chatting and asking questions. We even had a handful of teachers interested in using Ateji PX as a tool for teaching parallel programming, as it provides a general and intuitive model of parallelism on top of Java.

Ateji was part of the Disruptive Technologies exhibit. You get there after submitting your technology and vision and being selected by a panel of experts of the program committee. And yes, we got a free booth! Many thanks to John and Elizabeth for making this exhibit a success.
Parallel lines named Desire

Being recognized as a disruptive technology means that Ateji PX has the potential to deeply change the landscape of HPC and parallel programming in the coming years. We all hope for its success as an industry standard. The focus on disruptive technologies was emphasized by having Clayton Christensen, the author of “The Innovator’s Dilemma” and “The Innovator’s Solution”, as keynote speaker. Clayton, if you happen to read this, I’d be happy to have a chat with you.

We handed out these cute USB keys labeled Ateji – unlock Java performance, that contain whitepapers, documentation and an evaluation version of the software. They had a lot of success on the SC10 booth, but also with the TSA: I carried a few hundreds of them in my suitcase, it was opened and inspected every single time we boarded a plane! I now have a collection of these flyers.

Posted in Uncategorized | Comments Off