Language barrier: To take advantage of multicore chips, programmers will need fundamentally new software

March 2, 2011 By Larry Hardesty
A snippet of code written in the Cilk language, showing the 'spawn' and 'sync' commands. Credit: Christine Daniloff

For decades, computer scientists tried to develop software that could automatically turn a conventional computer program -- a long sequence of instructions intended to be executed in order -- into a parallel program -- multiple sets of instructions that can be executed at the same time. Now, most agree that that was a forlorn hope: Code that can be parallelized is too hard to recognize, and the means for parallelizing it are too diverse and context-dependent. "If you want to get parallel performance, you have to start writing parallel code," says MIT computer-science professor Saman Amarasinghe. And MIT researchers are investigating a host of techniques to make writing parallel code easier.

One of the most prominent is a software development system created by computer-science professor Charles Leiserson and his Supertech Research Group. Initially, the system used the programming language C — hence its name, Cilk. Cilk, Leiserson says, adds three commands to C: “spawn," “sync,” and a variation of the standard command “for.” If a programmer has identified a section of a program that can be executed in parallel — if, say, the same operation has to be performed on a lot of different data — he or she simply inserts the command “spawn” before it. When the program is running, the Cilk system automatically allocates the spawned computation as many cores as are free to handle it. If the results of the spawned computations need to be aggregated before the program moves on to the next instruction, the programmer simply inserts the command “sync.”

The reason Leiserson’s group could get away with such minimal alteration of C is the “runtime” system that underlies programs written in Cilk. A runtime is an extra layer of software between a program and a computer’s operating system, which allows the same program to run on much different machines; the most familiar example is probably the runtime that interprets programs written in Java. “All the smartness is underneath, in the runtime system,” Leiserson explains.

One unusual feature of Cilk’s runtime is the way it allocates tasks to different cores. Many parallel systems, Leiserson explains, use a technique called “work sharing,” in which a core with a parallel task queries the other cores on the chip to see which are free to take on some additional work. But passing messages between cores is much more time-consuming than executing computations on a given core, and it ends up eating into the gains afforded by parallel execution. The Cilk runtime instead uses a technique called “work stealing.” A core that generates a host of tasks that could, in principle, be executed in parallel just queues them up in its own memory, as it would if there were no other cores on the chip. A core that finds itself without work, on the other hand, simply selects one other core at random and pulls tasks out of its queue. As long as the program has enough parallelism in it, this drastically reduces the communication overhead.

One of the advantages of Cilk, Leiserson explains, is that the programmer writes the same program whether it’s going to run on a multicore computer or a single-core computer. Execution on a single-core computer is no different than execution on a computer with multiple cores, all but one of which is busy. Indeed, Cilk’s advantages are evident enough that Intel now designs its compilers — the programs that convert code into instructions intelligible to computers — to work with Cilk.

Amarasinghe is attacking the problem of parallel programming on several fronts. One difficulty with parallel programs is that their behavior can be unpredictable: If, for instance, a computation is split between two cores, the program could yield a very different result depending on which core finishes its computation first. That can cause headaches for programmers, who often try to identify bugs by changing a line or two of code and seeing what happens. That approach works only if the rest of the program executes in exactly the same way. Amarasinghe and his students have developed a system in which cores report their results in an order determined by the number of instructions they’ve executed, not the time at which they finished their computations. If a core with a short list of instructions runs into an unexpected snag — if, say, its request for data from main memory gets hung up — the other cores will wait for it to finish before reporting their own results, preserving the order in which the results arrive.

Another project, called StreamIt, exploits the parallelism inherent in much digital signal processing. Before a computer can display an Internet video, for instance, it needs to perform a slew of decoding steps — including several different types of decompression and color correction, motion compensation and equalization. Traditionally, Amarasinghe says, video software will take a chunk of incoming data, pass it through all those decoding steps, and then grab the next chunk. But with StreamIt, as one chunk of data is exiting a step, another chunk is entering it. The programmer just has to specify what each step does, and the system automatically divides up the data, passes it between cores, and synthesizes the results.

A programmer trying to decide how to perform a particular computation generally has a range of algorithms to choose from, and which will work best depends on the data it’s handling and the hardware it’s running on. Together with professor of applied mathematics Alan Edelman, Amarasinghe has developed a language called PetaBricks that allows programmers to specify different ways to perform the same computation. When a PetaBricks program launches, it performs a series of measurements to determine which types of operations will work best on that machine under what circumstances. Although PetaBricks could offer mild advantages even on single-core computers, Amarasinghe explains, it’s much more useful on multicore machines. On a single-core machine, one way of performing a computation might, in rare cases, prove two or three times as efficient as another; but because of the complexities of parallel computing, the difference on a multicore machine could be a factor of 100 or more.

One of the more radical parallel-programming proposals at the Computer Science and Artificial Intelligence Laboratory comes from in the lab of Panasonic Professor of Electrical Engineering Gerald Sussman. Traditionally, have thought of computers as having two fundamental but distinct components: a logic circuit and a memory bank. In practice, that distinction has been complicated by evolving hardware designs, but for purposes of theoretical analysis, it’s generally taken for granted.

Sussman and his former postdoc Alexey Radul, who completed his PhD at MIT in 2009 and is now at the Hamilton Institute in Maynooth, Ireland, suggest that we instead envision a computer as a fleet of simple processors and memory cells, and programming as wiring those elements together in different patterns. That conception, Radul believes, would make it easier to design software to solve problems common in artificial intelligence, such as constraint-satisfaction problems, whose solutions need to meet several sometimes-contradictory conditions at once. Sudoku puzzles are a simple example.

Radul’s network is an abstraction, designed to make things easier for AI researchers: It could, in principle, be implemented on a single core. But it obviously lends itself to multicore computing. Either way, one of the central problems it poses is how to handle situations in which multiple processing units are trying to store different values in a shared memory cell. In his doctoral thesis, Radul demonstrated how to design memory cells that store information about data rather than storing the data themselves, much like a Sudoku solver who jots several possibilities in the corner of an empty square. But Radul acknowledges that designing the memory cells is just a first step in making his and Sussman’s system feasible. Reinventing computing from the ground up will take more work than that.

This story is republished courtesy of MIT News (, a popular site that covers news about MIT research, innovation and teaching.

Explore further: Parallel course: Researchers help ease transition to parallel programming

More information: Computer chips' clocks have stopped getting faster. To maintain the regular doubling of computer power that we now take for granted, chip makers have been giving chips more “cores,” or processing units. But how to distribute computations across multiple cores is a hard problem, and this five-part series of articles examines the different levels at which MIT researchers are tackling it, from hardware design up to the development of new programming languages.

Designing the hardware -
The next operating system -
Retooling algorithms -
Minimizing communication between cores -

Related Stories

New software design technique allows programs to run faster

April 5, 2010

( -- Researchers at North Carolina State University have developed a new approach to software development that will allow common computer programs to run up to 20 percent faster and possibly incorporate new security ...

Computing, Sudoku-style

April 28, 2010

When Alexey Radul began graduate work at MIT's Computer Science and Artificial Intelligence Lab in 2003, he was interested in natural-language processing -- designing software that could understand ordinary written English. ...

The next operating system

February 24, 2011

At the most basic level, a computer is something that receives zeroes and ones from either memory or an input device — like a keyboard — combines them in some systematic way, and ships the results off to either ...

Retooling algorithms

February 25, 2011

At its most fundamental, computer science is about the search for better algorithms — more efficient ways for computers to do things like sort data or filter noise out of digital signals. But most new algorithms are ...

Recommended for you

Inferring urban travel patterns from cellphone data

August 29, 2016

In making decisions about infrastructure development and resource allocation, city planners rely on models of how people move through their cities, on foot, in cars, and on public transportation. Those models are largely ...

How machine learning can help with voice disorders

August 29, 2016

There's no human instinct more basic than speech, and yet, for many people, talking can be taxing. 1 in 14 working-age Americans suffer from voice disorders that are often associated with abnormal vocal behaviors - some of ...

Apple issues update after cyber weapon captured

August 26, 2016

Apple iPhone owners on Friday were urged to install a quickly released security update after a sophisticated attack on an Emirati dissident exposed vulnerabilities targeted by cyber arms dealers.


Adjust slider to filter visible comments by rank

Display comments: newest first

3 / 5 (2) Mar 02, 2011
Haskell for the win!!!
1.5 / 5 (2) Mar 02, 2011
hmmm aha but maybe what we need is a programming language that takes advantage of quantum computers.
1 / 5 (2) Mar 02, 2011
I realize that segment of code is meant just to demonstrate a common algorithm with the syntax for making parallel threads, but a self-recursive function call is terribly inefficient for solving fibonacci numbers. The amount of memory and processing power this uses compared to an iterative approach is completely INSANE, with or without parallel processors.

So while it's a familiar segment of code for demonstrating concepts, it's a completely moronic application as far as practical value goes.

A better code would be some sort of optimized search algorithm to split a container into smaller containers and have each "spawn" search a different sub-container.
2.3 / 5 (3) Mar 02, 2011
This could allow the development of Chess algorithms and Real Time Strategy game A.I. which could potentially be exponentially better than existing algorithms and scripting engines.

Well, Chess wouldn't improve exponentially, but probably noticeably. Each processor could work on one specific family of game possibilities, recursively spawning new threads for diverging possibilities for each turn, and calculate the value of those positions, etc.

Excluding absurd situations such as defeating your opponent without losing a piece (which I've done several times,) There are certain limits to how many moves are available to a player on a given turn. If you have only a king and a queen, then the maximum number of available moves is no higher than 35, assuming none of the moves are blocked.

The fewest number of moves a player can have available is a king oscilating between two spaces on his only available move.
2.3 / 5 (3) Mar 02, 2011
Because the weakest possible position BEFORE check or checkmate is a lone king in a corner with only two spaces to work with, this means the maximum possible available moves which does not result in a position which automatically wins the game is 63* minus the number of squares occupied by your own pieces.

*you could perpetually check the enemy King, though if you had 63 available moves in one turn you should have already won with a queen or rook.

So the computer attempts to play the game "backwards" by finding all possible moves on each turn, assigning the resulting position a value, and then finding all possible responses, and the resulting position a value, and repeating this process, usually limited by some reasonable factor such as time or a count of iterations to avoid a perpetual calculation. Then once it has found all possibilities and assigned them a value, it makes the move which has the greatest value.
2.3 / 5 (3) Mar 02, 2011
So the idea here is that you would design the algorithm to discard redundant positions to kill those branches of the tree, which they already do, but then the ability to spawn multiple branches from each successive branch would allow the computer to calculate N branches simultaneously, where N is the number of processors. So for example, if the limiting factor is "time" (thinking of the old "one second" rule on Chessmaster,) then it would be "almost" N times more powerful of an algorithm on an N-core computer as compared to a similar algorithm on a 1 core computer.

The down side? Each player has 20 possible first moves, and depending on which move each player made on the first move, there are even more possible second moves. Depending on first moves, there are 19 to 24 possible second moves for white.

So the first two moves are something like 20*20*24*24 possible combinations of moves. Being able to split the algorithm efficiently between multiple cores is a big issue.
3.7 / 5 (3) Mar 03, 2011
I like the simplicity of the new syntax. Good work.
2 / 5 (2) Mar 03, 2011
Uh, they could try Occam, which was written to handle multiple transputer processors many, many years ago...
3 / 5 (2) Mar 03, 2011
One of the advantages of Cilk, Leiserson explains, is that the programmer writes the same program whether its going to run on a multicore computer or a single-core computer

That's true with current languages too. When I write multithreaded apps, they can run on one or multi-core machines. On single CPU machines, there's no advantage, but when run on multi-core machines, the threads run in true parallel, as opposed to time-sliced simulations of parallel on a single core machine.

the other cores will wait for it to finish before reporting their own results, preserving the order in which the results arrive.


Now, we need debuggers that can handle stepping through multi-threaded apps better than the current crop. There have been improvements on this, but we still need more advancement.

The shared queue method, is by far, one of the best methods to evenly distribute work. This is not a new idea, but I'm glad to see it implemented at the language (or runtime) level.
1 / 5 (1) Mar 03, 2011
Each processor could work on one specific family of game possibilities, recursively spawning new threads for diverging possibilities for each turn, and calculate the value of those positions, etc.

The problem with this (which is what's currently done) is some will have fewer possibilities and will finish early and "starve", hence the task queuing method discussed in this article which solves that problem. Instead trying to figure out how to divide the work, the work tasks are simply shoved onto a queue and each processor works on one queue item (which could be a chunk of sub-items). When a processor finishes one queue item, it pulls another item off the queue. This way, they all stay busy until the last set of items on is removed from the queue and they all finish at almost exactly the same time.

1 / 5 (1) Mar 03, 2011

Imagine Wal-Mart with 20 register queues: You pick one and find it's slower than the others that free up. Now there's un-utilized processing power. Instead, make one line and each each register is processing one customer. The first one to free up gets the next person in the global queue. No registers are starved of tasks and no customers are stuck behind one slow customer. Some queue items will inevitably take longer to process than others, so you can't pre-allocate them effectively without knowing exactly how long they take to process. So, avoid pre-allocating them to processors and let the processors request work when they're free.

Think of a help desk application. Tickets come into the queue. As soon as an operator completes work with one ticket, they take the next one off the global queue. If these queue items are people on hold, it eliminates the caller from being unfairly held in one slow operator's queue and the whole thing is more efficient.
1 / 5 (1) Mar 03, 2011
Being able to split the algorithm efficiently between multiple cores is a big issue.

Hence the purpose of not attempting it. The shared queue method resolves this and in a simple way. One of the rare win/win options. I've implemented this in some apps and it makes a big difference.


BTW, C# 4.0 now supports some parallelism at the language level today: _ttp://

I don't believe the runtime is smart enough to put the tasks in queues yet, but code written with today's C# 4.0 will immediately benefit when the runtime is updated to do this. No change is necessary in your code to take advantage of that future upgrade. Of course, if you want queues today, you'll have to code it yourself, which really isn't that difficult, but it's better when the runtime handles it for you.
1 / 5 (1) Mar 03, 2011
QC: To clarify, your ideas are sound and have been the method used by most developers trying to do multi-threaded programming. I hope my responses didn't come across as derogatory, because they weren't meant to be. Just pointing out how the problem of allocating tasks is, for the most part, easily solved with a task queuing type of method.
1 / 5 (1) Mar 03, 2011
hmmm aha but maybe what we need is a programming language that takes advantage of quantum computers.

Well that is just peachy! When are you going to build that quantum comp?
1 / 5 (1) Mar 03, 2011
This seems inadequate to any real parallelism challenge for the future. The troubles of programming in parallel go far beyond simply specifying "parallelize this instruction." There are issues with nonuniform memory architectures, migrating threads, memory bandwidth, caching, etc. that require knowledge of the architecture in order to get good performance, at least at present. The idea that this can all be abstracted away is a nice one, but will require hardware support and much more than just "spawn, sync and for."
1 / 5 (1) Mar 05, 2011
to get parallel execution, program for concurrency.

google's new "go" language (golang) for the win! :)
1 / 5 (1) Mar 06, 2011
Writing wicked fast code is more like cutting a gem that turning a crank. The architeture of the machine is a primary consideration. Both Intel and the Portland Group make compilers that support parallel computing. Cilk is simple but if it does not use native code I might as well use compiled Mathematica. What is needed now are FORTRAN & c++ compilers that can take advantage of FPGA cards and motherboards with the bus capacity to make it worth thee effort.

Please sign in to add a comment. Registration is free, and takes less than a minute. Read more

Click here to reset your password.
Sign in to get notified via email when new comments are made.