Parallel course: Researchers help ease transition to parallel programming

Parallel course
An image of a Quad-Core AMD Opteron processor. Courtesy of AMD

( -- In 1995, a good computer chip had a clock speed of about 100 megahertz. Seven years later, in 2002, a good computer chip had a clock speed of about three gigahertz -- a 30-fold increase. And now, seven years later, a good computer chip has a clock speed of... still about three gigahertz.

Four or five years ago, chip makers realized that they couldn't make chips go much faster, so they adopted a new strategy for increasing computers' power: putting multiple "cores," or processing units, on each chip. In theory, a chip with two cores working in parallel can accomplish twice as much as a chip with one core. But software developers tend to see their programs as lists of sequential instructions, and they've had trouble breaking up those instructions in ways that take advantage of parallel processing. Professor of Computer Science and Engineering Saman Amarasinghe and his colleagues at MIT's Computer Science and Lab are giving them a hand.

In the past, computer science researchers had hoped that sequential programs could be converted into parallel programs automatically. "I spent a good part of my life trying to do that," says Amarasinghe. But Amarasinghe has now come to the conclusion that "if you want to get parallel performance, you have to start writing parallel code." At a high level, he believes, programmers will need to specify which tasks performed by their programs can run concurrently.

"Just writing anything parallel doesn't mean that it's going to run fast," Amarasinghe says. "A lot of parallel programs will actually run slower, because you parallelize the wrong place." And determining when to parallelize, and which cores to assign which tasks, is something that computers can do automatically, Amarasinghe believes.

Amarasinghe divides his group's work on multicore computing into two categories: tools to ease programmers' transition to parallel programming and tools to improve programs' performance once that transition is complete.

Predictable parallelism

The first set of tools addresses one of the central difficulties of parallel programming: if several tasks are assigned to separate cores, there's no way to be sure which will be completed first. Most of the time, that's not a problem. But occasionally, different cores have to access the same resource — updating the same stored value, for instance, or sending data to the monitor. Depending on which core reaches the resource first, the program's behavior could be quite different. Even given the exact same inputs, a multicore program may not execute in quite the same way twice in a row.

That's a nightmare for developers trying to debug their programs. To find a bug, developers run their programs over and over, with slight variations, to localize problems to ever-smaller blocks of code. But those variations don't convey any useful information unless the rest of the program behaves in the exact same way.

Amarasinghe and his grad students Marek Olszewski and Jason Ansel designed a system to make multicore programs more predictable. When a core tries to access a shared resource, it's assigned a priority based not on the time of its request but on the number of instructions it's executed. At any point during a parallel program's execution, any two cores will have executed about the same number of instructions. But if one core has had to, say, access a little-used value in a distant part of the computer's memory, it might have fallen slightly behind. In those cases, the researchers' system gives it a chance to catch up. Once it's reached the same instruction count, if it hasn't issued a higher-priority request for the shared resource, the first core's request is granted.

That waiting around means, of course, that the program as a whole will run more slowly. But in experiments, the researchers determined that, on average, a software implementation of their system increased program execution time by only about 16 percent. For developers, that's a small price to pay for reliable debugging. Moreover, if the ability to count instructions were built into the cores themselves, the system would be much more efficient. Indeed, Intel — which has demonstrated eight-core chips at trade shows — is talking with Amarasinghe about funding further research on the system.

"A lot of other people have this alternative approach," says Krste Asanovic, an associate professor at the Parallel Computing Lab at the University of California, Berkeley. "You kind of just run it [the parallel program] however it runs and then try and record what it did so then you can go back and replay it." But with the MIT system, he says, "You don't have to worry about recording how it executed because when you execute it, it will always run the same way. So it's strictly more powerful than replay."

Maximizing multicore

Amarasinghe's lab has two projects that fit into his second category — tools that optimize the performance of parallel programs. The first deals with streaming applications, such as video broadcast over the Web. Before your computer can display an Internet video, it needs to perform a slew of decoding steps — splitting the data stream into parallel signals, performing different types of decompression and color correction, recombining the signals, performing motion compensation, equalizing the signal, and so on. Traditionally, Amarasinghe says, video software will take a chunk of incoming data, pass it through all those decoding steps, and then grab the next chunk.

That approach, says Amarasinghe, squanders the inherent parallelism of signal-processing circuits. As one chunk of data exits each decoding step, the next chunk of data could be entering it. StreamIt, a language developed by Amarasinghe's group, allows programmers to specify the order of the steps but automatically determines when to pass how much data to each step.

"If you have some kind of specification document, they're all written in block diagrams" that illustrate a signal-processing path as a series of sequential steps, Amarasinghe says. "When they write to [the programming language] C, you lose those blocks." StreamIt, by contrast, "is very close to the original thinking," Amarasinghe says.

"Other people had talked about stream programming, but StreamIt was really putting everything together in a language and compiler tool chain so people could actually write streaming programs," says Asanovic. "So I think the language design and compiler tool chain that followed from that was pretty influential."

The group's other parallel-computing project helps programs adapt on the fly to changing conditions. Often, Amarasinghe explains, a programmer has several ways of tackling a particular task. There are, for example, many common algorithms for sorting data, with names like quicksort, insertion sort, radix sort, and the like. But the algorithms perform differently under different circumstances. Quicksort is often the best choice, but not "when the data is very small," Amarasinghe says. With huge data sets, however, radix sort may work even better than quicksort.

That variability is compounded when the algorithms are being assigned to different cores. So Amarasinghe's group has designed another language that asks the developer to specify four or five different methods for performing a given computational task. When the program is running, the computer automatically identifies the most efficient method.

It might seem that that approach would lead to bloated programs. But Amarasinghe points out that when a typical program executes, it devotes much more memory to storing data than it does to storing instructions. "Something like a sort is, like, five lines of code that work on arrays with billions of elements. So making five lines of code 25 is not a big issue," he says.

It does require some added work on the programmer's part. But "there's no free lunch: you won't get nice multicore performance by doing nothing," Amarasinghe says. "Believe me, we've been trying for 50 years."

Provided by Massachusetts Institute of Technology (news : web)

Explore further

'Not so fast, supercomputers,' say software programmers

Citation: Parallel course: Researchers help ease transition to parallel programming (2009, October 23) retrieved 25 August 2019 from
This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Feedback to editors

User comments

Oct 23, 2009
I heard from a buddy that the issue with increasing clock speed above 3GHz is the electronic signal behaves more like an EM wave than an electrical pulse. Already wiring corners must be rounded or the singal leaves the board. This implies that the wiring is more like a wave guide than an electrical wire. Think of cable TV (coax)versus phone lines (copper pairs.
The article should make clarify the difference between hyperthreading and a true parallel processor.

Oct 23, 2009
it's assigned a priority based not on the time of its request but on the number of instructions it's executed.

Unfortunately it's not that simple because the parallel code sections may be accessing level 1, 2, or 3 cache memory or external dram so that the number of instructions executed has little to do with the time to execute them.

Oct 23, 2009
The inherent problem is in the interpreter rather than being an active runtime environment that handles functional objects as i/o applications it treats them as linear sequential scripts. An intelligent interpreter seeks the in's out's and id's of functions and connects the code between as relational databases do.

The second major hurdle is pushing the front side bus to keep up with CPU instructions reducing buffer and increasing hardware access time.

Oct 23, 2009
This is my understanding of the ~3 GHz limitation: for about 40 years fundamental circuit design has relied on of Kirchhoff's current law (i.e. sum(I_i)=0 for a circuit node), but this is an approximation of Maxwell's laws where drho/dt=0. This translates to the approximation holding when f>>c/L where L is the size of a circuit element. Using L = 1cm we get 3e10 Hz, so this starts to break down around 3e9 Hz.

Oct 23, 2009
Why did they pick an example of parallel programming for which video cards with hundreds of cores are already available? What programs need massive parallelism that are not catered for by current GPU cores?

Oct 23, 2009
What programs need massive parallelism that are not catered for by current GPU cores?

When I work on video games, I like the AI path-finding and the AI decision making to run at the same time since path-finding can be tedious when traversing a 3d environment. Physics engines is also something to have on a separate core. Render engine to run on its own core.

Oct 23, 2009
I can see some problems with the implementation of a cycle count for each processor. The operating system needs to be able to change which processors run which threads over the entire system; this will choke if the processors are not running on the same threads at the same time. Intel would also have to change iret to not have a miscount due to interrupt handlers, which would break all existing OSes.

Oct 24, 2009
What programs need massive parallelism that are not catered for by current GPU cores?

When I work on video games, I like the AI path-finding and the AI decision making to run at the same time since path-finding can be tedious when traversing a 3d environment. Physics engines is also something to have on a separate core. Render engine to run on its own core.

Exactly. Path finding absolutely MUST run on a separate core, or at the very least in a separate thread. That is, if you want your FPS to stay above 30 and not just stop completely while your algorithm iterates through the available options. Even with something like A* you still can be iterating through a ton of data, often recursively. You would definitely prefer that to run in parallel.

Oct 24, 2009
I must say the standaard clock speeds of today are still very low. Indeed around 3Ghz or a little bit more. But new technology's like more cores or new materials and technology's we will be able to reach a 1 Thz and more. Lets say the next 5 years we can achieve 10 Ghz or more with 15nm and beyond.

Oct 24, 2009
Funny thing: I've just grabbed my copy of 'Programming in Occam' off the shelf. That was a language designed to run on multiple Transputer cores, and pipeline each thread according to available resource. Slow, fast, one, four, sixteen, twenty, whatever mix you got, it could use without re-compiling...

The (c) date ? 1987...

Please sign in to add a comment. Registration is free, and takes less than a minute. Read more