Parallel programming may not be so daunting

Mar 24, 2014 by Larry Hardesty
Credit: Thinkstock

Computer chips have stopped getting faster: The regular performance improvements we've come to expect are now the result of chipmakers' adding more cores, or processing units, to their chips, rather than increasing their clock speed.

In theory, doubling the number of cores doubles the chip's efficiency, but splitting up computations so that they run efficiently in parallel isn't easy. On the other hand, say a trio of computer scientists from MIT, Israel's Technion, and Microsoft Research, neither is it as hard as had been feared.

Commercial software developers writing programs for multicore chips frequently use so-called "lock-free" parallel algorithms, which are relatively easy to generate from standard sequential code. In fact, in many cases the conversion can be done automatically.

Yet lock-free algorithms don't come with very satisfying theoretical guarantees: All they promise is that at least one core will make progress on its computational task in a fixed span of time. But if they don't exceed that standard, they squander all the additional computational power that multiple cores provide.

In recent years, theoretical computer scientists have demonstrated ingenious alternatives called "wait-free" algorithms, which guarantee that all cores will make progress in a fixed span of time. But deriving them from sequential code is extremely complicated, and commercial developers have largely neglected them.

In a paper to be presented at the Association for Computing Machinery's Annual Symposium on the Theory of Computing in May, Nir Shavit, a professor in MIT's Department of Electrical Engineering and Computer Science; his former student Dan Alistarh, who's now at Microsoft Research; and Keren Censor-Hillel of the Technion demonstrate a new analytic technique suggesting that, in a wide range of real-world cases, lock-free algorithms actually give wait-free performance.

"In practice, programmers program as if everything is wait-free," Shavit says. "This is a kind of mystery. What we are exposing in the paper is this little-talked-about intuition that programmers have about how [chip] schedulers work, that they are actually benevolent."

The researchers' key insight was that the chip's performance as a whole could be characterized more simply than the performance of the individual cores. That's because the allocation of different "threads," or chunks of code executed in parallel, is symmetric. "It doesn't matter whether thread 1 is in state A and thread 2 is in state B or if you just swap the states around," says Alistarh, who contributed to the work while at MIT. "What we noticed is that by coalescing symmetric states, you can simplify this a lot."

In a real chip, the allocation of threads to cores is "a complex interplay of latencies and scheduling policies," Alistarh says. In practice, however, the decisions arrived at through that complex interplay end up looking a lot like randomness. So the researchers modeled the scheduling of threads as a process that has at least a little randomness in it: At any time, there's some probability that a new thread will be initiated on any given core.

The researchers found that even with a random scheduler, a wide range of lock-free algorithms offered performance guarantees that were as good as those offered by wait-free algorithms.

That analysis held, no matter how the probability of thread assignment varied from core to core. But the researchers also performed a more specific analysis, asking what would happen when multiple cores were trying to write data to the same location in memory and one of them kept getting there ahead of the others. That's the situation that results in a lock-free algorithm's worst performance, when only one core is making progress.

For that case, they considered a particular set of probabilities, in which every core had the same chance of being assigned a thread at any given time. "This is kind of a worst-case random scheduler," Alistarh says. Even then, however, the number of cores that made progress never dipped below the square root of the number of cores assigned threads, which is still better than the minimum guarantee of lock-free algorithms.

Explore further: Cleverer 'cache' management could improve computer chips' performance, reduce energy consumption

More information: Paper: "Are Lock-Free Concurrent Algorithms Practically Wait-Free?" arxiv.org/pdf/1311.3200v2

add to favorites email to friend print save as pdf

Related Stories

Writing graphics software gets much easier

Aug 01, 2012

(Phys.org) -- Image-processing software is a hot commodity: Just look at Instagram, a company built around image processing that Facebook is trying to buy for a billion dollars. Image processing is also going ...

Recommended for you

Oculus unveils new prototype VR headset

Sep 20, 2014

Oculus has unveiled a new prototype of its virtual reality headset. However, the VR company still isn't ready to release a consumer edition.

Computerized emotion detector

Sep 16, 2014

Face recognition software measures various parameters in a mug shot, such as the distance between the person's eyes, the height from lip to top of their nose and various other metrics and then compares it with photos of people ...

User comments : 11

Adjust slider to filter visible comments by rank

Display comments: newest first

Eikka
5 / 5 (1) Mar 24, 2014
Parallel computing has some serious limitations.

See for example Amdahl's law, which assumes there is no cost in communication between the processing cores, and Gunther's Universal law of computational scalability.

http://en.wikiped...lability

http://blogs.msdn...law.aspx

For many practical computation tasks, the optimum number of processing units is surprisingly small; only a handful of cores can work on the same problem without getting in each other's way.

Jizby
Mar 24, 2014
This comment has been removed by a moderator.
Jizby
Mar 24, 2014
This comment has been removed by a moderator.
taka
3 / 5 (2) Mar 24, 2014
To make big number of processors to access one memory is not scalable, finally the computer collapses under memory access lines weight and memory access overhead. What is scalable is to make each core a full computer with its own memory and make only neighbors communicating to each other's. That create completely different computation model to won Neumann one, here data flow throw computer network and gets transformed in the process. This is the dataflow model. Actually easy to handle, but seems alien as people are not used to it.
antialias_physorg
5 / 5 (2) Mar 24, 2014
Programmers are always a bit hestitant to use automatic parallelization unless the benefits are clear. You never know if the speedup will be worth it (in some cases the context switching and overhead for inter-process communication actually slows down your program). And it's a bitch to debug if something goes wrong which can eat up days and days.

When programming you should be very aware whether parallelization is warranted or not. Don't just throw it in there as a matter of course. You'll be sorry if you do.
Jizby
Mar 24, 2014
This comment has been removed by a moderator.
Eikka
3.7 / 5 (3) Mar 24, 2014
This is just the consequence of fact, that the cores in multicore processors are still communicating in old good von Neuman/Turing's way, i.e. sequentially.


Did you actually read the Gunther's scalability law?

It's architecture independent. Changing the architecture simply lessens the effect of communications overhead, but does not and can not completely eliminate it to make anything "perfectly scalable".

That means there's always an optimum number of processing units per task, and adding more will actually decrease the performance.

What is scalable is to make each core a full computer with its own memory and make only neighbors communicating to each other's.


That's the GRID model, tried and prototyped in the 80's. Didn't work as well as hoped.
Eikka
3 / 5 (4) Mar 24, 2014
Each raster core has its piece of graphic raster memory (buffer) assigned and allocated. Therefore the raster GPU's are scalable to high degree.


There is no such thing as a "raster GPU". Video cards are an example of SIMD architectures developed in the 70's with the early supercomputers. They're called vector CPUs, or in case of GPUs, shader units, and they only work efficiently when the data transformation they perform is completely independent. In other words, the piece of data they compute does not depend on anything else but itself, so there's no need to communicate between the computing units.

Are you absolutely sure you know what you're talking about?
scenage
not rated yet Mar 24, 2014
To make big number of processors to access one memory is not scalable, finally the computer collapses under memory access lines weight and memory access overhead. What is scalable is to make each core a full computer with its own memory and make only neighbors communicating to each other's. That create completely different computation model to won Neumann one, here data flow throw computer network and gets transformed in the process. This is the dataflow model. Actually easy to handle, but seems alien as people are not used to it.


When I read this I instantly thought of distributed computing but then after a bit more thought, I realised this was more on a network level. With Parallel computing are all saying that this is all on the same network? Or are we saying that they all need to be physically in the same box?

I had a bit of a read on Wikipedia and all it says is there is not a clear distinction between them.

http://en.wikiped...omputing
taka
not rated yet Mar 25, 2014
It does not matter how you pack jour cores. If you want to be scalable you must keep communication local, that all. Memory access is also communication. Of curse that does not work in all ceases, but it allows using conveyer parallelism and truly paralleled parallelism. I believe that it is not physically possible to do better that that. Who this does not work out in 80's was problems with programming. There is no ways to automatically and efficiently convert sequential memory access oriented programs for hat architecture. But it is actually simple to design algorithms initially for that architecture. It is similar to flow diagram, only when in sequential programming there is only if type branch and join possible here also parallel start and join exists. Parallel join is the problematic thing when errors come, but solution is also simple - you are allowed to join only what you are initially sent out for parallel processing. Computation must be like waves trawling throw medium.
antialias_physorg
not rated yet Mar 29, 2014
It does not matter how you pack jour cores. If you want to be scalable you must keep communication local,

It's not quite as simple. if you have a lot of long running processes that can be run in parallel (e.g. sophisticated signal/image analysis alogrithms that return only small bist of data like "found/not found") then having all your communication local doesn't matter so much.
If you're dependent on frequent context switching or a lot of short methods then local commuication is paramount.

So it's always important to have a good idea where your program will actually be spending its time and how much data you'll be sending over the lines/between cores before you decide on a layout.
Anda
not rated yet Mar 29, 2014
Anyway I just want to say that it's not really complicated, just more costly in hours of programming so commercial businesses avoid to reach "state-of-the-art" software products to increase net revenue. As simple as that...
Ducklet
not rated yet Mar 29, 2014
The main problem is that programming is about patterns, not algorithms. Commercial software has very little concurrent raw computation that's highly performance sensitive. Implementing SS7 over IP for instance, to scale effectively and efficiently across modern NUMA architectures, isn't an algorithmic problem, it's about design patterns.
orti
not rated yet Mar 30, 2014
I don't pretend to understand it, but it sounds like real research with a real payback. Hope it proves itself.