When you practice sequential programming, you learn rules of thumb to identify and avoid patterns that lead to disastrous performance. The same is true for parallel programming.
Here are some lessons inspired by the feedback we received from first-time users of Ateji PX. Please add a comment if you think of other common patterns that fit into this category.
Avoid branch count in the millions
A parallel branch is a very lightweight object, especially when compared to threads. Creating and scheduling a parallel branch incurs very little overhead.
However, when creating a large number of small branches, the overhead may become noticeable, slowing down the whole computation. In practice, “large” starts at somewhere between 1 and 10 million branches, depending on your machine.
Here is an example:
Integer[] data = new Integer[1000000]; [ || (int i : data.length) data[i] = i; ]
This code creates one million branches, each one of them just incrementing an array element. This would be fine on a machine with millions of cores, but we still need to wait a couple of years before such hardware becomes available. On my 4-core desktop, the parallel code is slower than its sequential equivalent:
Integer[] data = new Integer[1000000]; for(int i : data.length) data[i] = i;
Now that we’ve identified the problem, how do we correct it? For the experts, the answer is to use a #BlockCount indication:
Integer[] data = new Integer[1000000]; [ || (#BlockCount(4), int i : data.length) data[i] = i; ]
This will create only 4 parallel branches, each of them iterating sequentially over a block of 250000 elements. You can learn more about indications in the Ateji PX language manual.
For the non-experts, the simple answer is to use the parallel for notation that takes care of block size for you. The parallel for is a convenient shorthand notation that will create as many blocks as there are available cores, so that you just don’t have to think about blocks:
Integer[] data = new Integer[1000000]; for|| (int i : data.length) data[i] = i;
Now the overhead has become unnoticeable: this code is roughly 4 times faster on my 4 cores than the sequential code.
Parallelize outer loops first
Many data-parallel algorithms use embedded for-loops:
for(int i : 1000000) { for(int j : 1000000) { doSomething(i,j); } }
Let us parallelize the outer loop:
for||(int i : 1000000) { for(int j : 1000000) { doSomething(i,j); } }
The outer parallel for creates as many branches as there are available cores. Each of these branches iterates sequentially over 250000 values of i, each time executing the inner loop. No overhead is noticeable.
Let us parallelize the inner loop:
for(int i : 1000000) { for||(int j : 1000000) { doSomething(i,j); } }
Here each of the 1000000 iterations of the outer loop executes the inner parallel for, which creates parallel branches 1000000 times. Branch creation time becomes noticeable. If the amount of computation in doSomething() is small, this parallel program may be slower than its sequential version.
In the case where the number of loop iterations is commensurate with the number of cores, it makes sense to parallelize both the outer and the inner loop.
