New system greatly speeds common parallel-computing algorithms

July 3, 2017 by Larry Hardesty, Massachusetts Institute of Technology
A new system dubbed Fractal achieves 88-fold speedups through a parallelism strategy known as speculative execution. Credit: MIT News

The chips in most modern desktop computers have four "cores," or processing units, which can run different computational tasks in parallel. But the chips of the future could have dozens or even hundreds of cores, and taking advantage of all that parallelism is a stiff challenge.

Researchers from MIT's Computer Science and Artificial Intelligence Laboratory have developed a new system that not only makes parallel programs run much more efficiently but also makes them easier to code.

In tests on a set of benchmark algorithms that are standard in the field, the researchers' new system frequently enabled more than 10-fold speedups over existing systems that adopt the same parallelism strategy, with a maximum of 88-fold.

For instance, algorithms for solving an important problem called max flow have proven very difficult to parallelize. After decades of research, the best parallel implementation of one common max-flow algorithm achieves only an eightfold speedup when it's run on 256 parallel processors. With the researchers' new system, the improvement is 322-fold—and the program required only one-third as much code.

The new system, dubbed Fractal, achieves those speedups through a parallelism strategy known as speculative execution.

"In a conventional parallel program, you need to divide your work into tasks," says Daniel Sanchez, an assistant professor of electrical engineering and computer science at MIT and senior author on the new paper. "But because these tasks are operating on shared data, you need to introduce some synchronization to ensure that the data dependencies that these tasks have are respected. From the mid-90s to the late 2000s, there were multiple waves of research in what we call speculative architectures. What these systems do is execute these different chunks in parallel, and if they detect a conflict, they abort and roll back one of them."

Constantly aborting computations before they complete would not be a very efficient parallelization strategy. But for many applications, aborted computations are rare enough that they end up squandering less time than the complicated checks and updates required to synchronize tasks in more conventional parallel schemes. Last year, Sanchez's group reported a system, called Swarm, that extended speculative parallelism to an important class of computational problems that involve searching data structures known as graphs.

Irreducible atoms

Research on speculative architectures, however, has often run aground on the problem of "atomicity." Like all parallel architectures, speculative architectures require the programmer to divide programs into tasks that can run simultaneously. But with speculative architectures, each such task is "atomic," meaning that it should seem to execute as a single whole. Typically, each atomic task is assigned to a separate processing unit, where it effectively runs in isolation.

Atomic tasks are often fairly substantial. The task of booking an airline flight online, for instance, consists of many separate operations, but they have to be treated as an atomic unit. It wouldn't do, for instance, for the program to offer a plane seat to one customer and then offer it to another because the first customer hasn't finished paying yet.

With speculative execution, large atomic tasks introduce two inefficiencies. The first is that, if the task has to abort, it might do so only after chewing up a lot of computational cycles. Aborting smaller tasks wastes less time.

The other is that a large atomic task may have internal subroutines that could be parallelized efficiently. But because the task is isolated on its own processing unit, those subroutines have to be executed serially, squandering opportunities for performance improvements.

Fractal—which Sanchez developed together with MIT graduate students Suvinay Subramanian, Mark Jeffrey, Maleen Abeydeera, Hyun Ryong Lee, and Victor A. Ying, and with Joel Emer, a professor of the practice and senior distinguished research scientist at the chip manufacturer NVidia—solves both of these problems. The researchers, who are all with MIT's Department of Electrical Engineering and Computer Science, describe the system in a paper they presented this week at the International Symposium on Computer Architecture.

With Fractal, a programmer adds a line of code to each subroutine within an atomic task that can be executed in parallel. This will typically increase the length of the serial version of a program by a few percent, whereas an implementation that explicitly synchronizes parallel tasks will often increase it by 300 or 400 percent. Circuits hardwired into the Fractal chip then handle the parallelization.

Time chains

The key to the system is a slight modification of a circuit already found in Swarm, the researchers' earlier speculative-execution system. Swarm was designed to enforce some notion of sequential order in . Every task executed in Swarm receives a time stamp, and if two tasks attempt to access the same memory location, the one with the later time stamp is aborted and re-executed.

Fractal, too assigns each atomic task its own time stamp. But if an atomic task has a parallelizable subroutine, the subroutine's time stamp includes that of the that spawned it. And if the subroutine, in turn, has a parallelizable subroutine, the second subroutine's time stamp includes that of the first, and so on. In this way, the ordering of the subroutines preserves the ordering of the atomic tasks.

As tasks spawn subroutines that spawn subroutines and so on, the concatenated time stamps can become too long for the specialized circuits that store them. In those cases, however, Fractal simply moves the front of the -stamp train into storage. This means that Fractal is always working only on the lowest-level, finest-grained tasks it has yet identified, avoiding the problem of aborting large, high-level atomic tasks.

Explore further: New chip design makes parallel programs run many times faster and requires one-tenth the code

More information: Fractal: An Execution Model for Fine-Grain Nested Speculative Parallelism. ISCA '17, June 24-28, 2017, DOI: 10.1145/3079856.3080218

Related Stories

Recommended for you

Researchers find tweeting in cities lower than expected

February 20, 2018

Studying data from Twitter, University of Illinois researchers found that less people tweet per capita from larger cities than in smaller ones, indicating an unexpected trend that has implications in understanding urban pace ...

Augmented reality takes 3-D printing to next level

February 20, 2018

Cornell researchers are taking 3-D printing and 3-D modeling to a new level by using augmented reality (AR) to allow designers to design in physical space while a robotic arm rapidly prints the work.

7 comments

Adjust slider to filter visible comments by rank

Display comments: newest first

rderkis
5 / 5 (1) Jul 03, 2017
The new system, dubbed Fractal, achieves those speedups through a parallelism strategy known as speculative execution.[/q

This is really neat but you would have thought such talent would have come up with some kind of original unique name. Even if it was a combination of words.
Google the word(s) fractile, programing, cores, and etc. You won't find it because the word fractal is so common.
Phil DePayne
5 / 5 (2) Jul 03, 2017
As I'm a computer programmer and as great as this system sounds you always have to fight the inertia of those people who don't want to learn something new even if it is easier.

EmceeSquared
5 / 5 (3) Jul 03, 2017
Phil DePayne:
you always have to fight the inertia of those people who don't want to learn something new even if it is easier.


These heuristics will become integrated into compilers and libraries as they're tweaked by feedback from real world applications. Humans will continue to be used for the high level translation of human problems across whatever moving boundary to machine execution, while the tools match the inner complexity.

For most applications anyway. Some specialized ones demanding high performance and rapid deployment, like superfast financial trading strategies and executions, or cutting edge machine vision or learning, will indeed need smart humans who've learned to use the first implementations. Theirs is the feedback that will tweak these tools for the rest of us.
rderkis
5 / 5 (3) Jul 03, 2017
people who don't want to learn something new even if it is easier.

I to am programer and most these kind of techniques, that optimize the code, are eventually built right into the programing languages compiler.
BackBurner
not rated yet Jul 04, 2017
Back in the late 70's there was a project I was involved with at Stanford called "ONCE: A Single Assignment Language". The basic idea was that the results of any computation were always assigned a unique memory location (data were never overwritten "in place"), and dependence on an upstream computation caused a program fragment to block until the location was written to by the upstream computation. It sounds very similar to the approach described here as "Fractal". I wonder if the research team involved is aware of ONCE?
rderkis
not rated yet Jul 04, 2017
results of any computation were always assigned a unique memory location (data were never overwritten "in place"),

Wow, back then memory was small and at a premium. I am surprised at that approach.
Whydening Gyre
not rated yet Jul 04, 2017
... It sounds very similar to the approach described here as "Fractal". I wonder if the research team involved is aware of ONCE?

Once Fractal, always Fractal...:-)

Please sign in to add a comment. Registration is free, and takes less than a minute. Read more

Click here to reset your password.
Sign in to get notified via email when new comments are made.