Language barrier: To take advantage of multicore chips, programmers will need fundamentally new software
March 2, 2011 By Larry Hardesty
A snippet of code written in the Cilk language, showing the 'spawn' and 'sync' commands. Credit: Christine Daniloff
For decades, computer scientists tried to develop software that could automatically turn a conventional computer program -- a long sequence of instructions intended to be executed in order -- into a parallel program -- multiple sets of instructions that can be executed at the same time. Now, most agree that that was a forlorn hope: Code that can be parallelized is too hard to recognize, and the means for parallelizing it are too diverse and context-dependent. "If you want to get parallel performance, you have to start writing parallel code," says MIT computer-science professor Saman Amarasinghe. And MIT researchers are investigating a host of techniques to make writing parallel code easier.
One of the most prominent is a software development system created by computer-science professor Charles Leiserson and his Supertech Research Group. Initially, the system used the programming language C hence its name, Cilk. Cilk, Leiserson says, adds three commands to C: spawn," sync, and a variation of the standard command for. If a programmer has identified a section of a program that can be executed in parallel if, say, the same operation has to be performed on a lot of different data he or she simply inserts the command spawn before it. When the program is running, the Cilk system automatically allocates the spawned computation as many cores as are free to handle it. If the results of the spawned computations need to be aggregated before the program moves on to the next instruction, the programmer simply inserts the command sync.
The reason Leisersons group could get away with such minimal alteration of C is the runtime system that underlies programs written in Cilk. A runtime is an extra layer of software between a program and a computers operating system, which allows the same program to run on much different machines; the most familiar example is probably the runtime that interprets programs written in Java. All the smartness is underneath, in the runtime system, Leiserson explains.
One unusual feature of Cilks runtime is the way it allocates tasks to different cores. Many parallel systems, Leiserson explains, use a technique called work sharing, in which a core with a parallel task queries the other cores on the chip to see which are free to take on some additional work. But passing messages between cores is much more time-consuming than executing computations on a given core, and it ends up eating into the gains afforded by parallel execution. The Cilk runtime instead uses a technique called work stealing. A core that generates a host of tasks that could, in principle, be executed in parallel just queues them up in its own memory, as it would if there were no other cores on the chip. A core that finds itself without work, on the other hand, simply selects one other core at random and pulls tasks out of its queue. As long as the program has enough parallelism in it, this drastically reduces the communication overhead.
One of the advantages of Cilk, Leiserson explains, is that the programmer writes the same program whether its going to run on a multicore computer or a single-core computer. Execution on a single-core computer is no different than execution on a computer with multiple cores, all but one of which is busy. Indeed, Cilks advantages are evident enough that Intel now designs its compilers the programs that convert code into instructions intelligible to computers to work with Cilk.
Amarasinghe is attacking the problem of parallel programming on several fronts. One difficulty with parallel programs is that their behavior can be unpredictable: If, for instance, a computation is split between two cores, the program could yield a very different result depending on which core finishes its computation first. That can cause headaches for programmers, who often try to identify bugs by changing a line or two of code and seeing what happens. That approach works only if the rest of the program executes in exactly the same way. Amarasinghe and his students have developed a system in which cores report their results in an order determined by the number of instructions theyve executed, not the time at which they finished their computations. If a core with a short list of instructions runs into an unexpected snag if, say, its request for data from main memory gets hung up the other cores will wait for it to finish before reporting their own results, preserving the order in which the results arrive.
Another project, called StreamIt, exploits the parallelism inherent in much digital signal processing. Before a computer can display an Internet video, for instance, it needs to perform a slew of decoding steps including several different types of decompression and color correction, motion compensation and equalization. Traditionally, Amarasinghe says, video software will take a chunk of incoming data, pass it through all those decoding steps, and then grab the next chunk. But with StreamIt, as one chunk of data is exiting a step, another chunk is entering it. The programmer just has to specify what each step does, and the system automatically divides up the data, passes it between cores, and synthesizes the results.
A programmer trying to decide how to perform a particular computation generally has a range of algorithms to choose from, and which will work best depends on the data its handling and the hardware its running on. Together with professor of applied mathematics Alan Edelman, Amarasinghe has developed a language called PetaBricks that allows programmers to specify different ways to perform the same computation. When a PetaBricks program launches, it performs a series of measurements to determine which types of operations will work best on that machine under what circumstances. Although PetaBricks could offer mild advantages even on single-core computers, Amarasinghe explains, its much more useful on multicore machines. On a single-core machine, one way of performing a computation might, in rare cases, prove two or three times as efficient as another; but because of the complexities of parallel computing, the difference on a multicore machine could be a factor of 100 or more.
One of the more radical parallel-programming proposals at the Computer Science and Artificial Intelligence Laboratory comes from in the lab of Panasonic Professor of Electrical Engineering Gerald Sussman. Traditionally, computer scientists have thought of computers as having two fundamental but distinct components: a logic circuit and a memory bank. In practice, that distinction has been complicated by evolving hardware designs, but for purposes of theoretical analysis, its generally taken for granted.
Sussman and his former postdoc Alexey Radul, who completed his PhD at MIT in 2009 and is now at the Hamilton Institute in Maynooth, Ireland, suggest that we instead envision a computer as a fleet of simple processors and memory cells, and programming as wiring those elements together in different patterns. That conception, Radul believes, would make it easier to design software to solve problems common in artificial intelligence, such as constraint-satisfaction problems, whose solutions need to meet several sometimes-contradictory conditions at once. Sudoku puzzles are a simple example.
Raduls network is an abstraction, designed to make things easier for AI researchers: It could, in principle, be implemented on a single core. But it obviously lends itself to multicore computing. Either way, one of the central problems it poses is how to handle situations in which multiple processing units are trying to store different values in a shared memory cell. In his doctoral thesis, Radul demonstrated how to design memory cells that store information about data rather than storing the data themselves, much like a Sudoku solver who jots several possibilities in the corner of an empty square. But Radul acknowledges that designing the memory cells is just a first step in making his and Sussmans system feasible. Reinventing computing from the ground up will take more work than that.
This story is republished courtesy of MIT News (http://web.mit.edu/newsoffice/), a popular site that covers news about MIT research, innovation and teaching.
More information: Computer chips' clocks have stopped getting faster. To maintain the regular doubling of computer power that we now take for granted, chip makers have been giving chips more cores, or processing units. But how to distribute computations across multiple cores is a hard problem, and this five-part series of articles examines the different levels at which MIT researchers are tackling it, from hardware design up to the development of new programming languages.
Designing the hardware - http://www.physorg … 7669712.html
The next operating system - http://www.physorg … -system.html
Retooling algorithms - http://www.physorg … orithms.html
Minimizing communication between cores - http://www.physorg … g-cores.html
Provided by
Massachusetts Institute of Technology
-
From lemons to lemonade: Reaction uses carbon dioxide to make carbon-based semiconductor,
32 comments
-
Thioridazine kills cancer stem cells in human while avoiding toxic side-effects of conventional cancer treatments,
3 comments
-
SpaceX private rocket blasts off for space station (Update),
42 comments
-
Climate scientists say they have solved riddle of rising sea,
30 comments
-
Research team claims to have found evidence Lake Cheko is impact crater for Tunguska Event,
18 comments
-
Ideas to mitigate risk of 911 calls being misdirected
May 24, 2012
-
Live scribe pen?
May 10, 2012
-
Shallow water flow simulation
May 07, 2012
-
Tablet for taking notes?
May 05, 2012
-
Best fit tablet for me?
May 05, 2012
-
Measure of Informaton
May 04, 2012
- More from Physics Forums - Computing & Technology
More news stories
SpotterRF debuts Radar Backpack Kit (w/ Video)
(Phys.org) -- SpotterRF has announced a special radar backpack kit designed to enhance situational awareness for soldiers on the ground. The company says its special radar is designed for warfighters as part ...
Probability of contamination from severe nuclear reactor accidents is higher than expected: study
Catastrophic nuclear accidents such as the core meltdowns in Chernobyl and Fukushima are more likely to happen than previously assumed. Based on the operating hours of all civil nuclear reactors and the number ...
Technology / Energy & Green Tech
May 22, 2012 |
3.6 / 5 (21) |
54
|
Delphi gasoline-injection engine technique rivals hybrid's edge
(Phys.org) -- Running a diesel like engine on gasoline is something Delphi is doing in notable fashion. They claim they are on to a promising way to enjoy an engine that gives the vehicle owner high efficiency ...
HyperSolar shows dirty water no barrier to power world
(Phys.org) -- The Santa Barbara, California, company, HyperSolar, is set to transparently share the ups and downs of its research experiences toward the companys ultimate vision, successfully producing ...
Tesla to launch electric sedan in US on June 22
Tesla Motors said Tuesday it would begin deliveries of "the world's first premium electric sedan" on June 22, slightly ahead of schedule.
Technology / Energy & Green Tech
May 22, 2012 |
4.5 / 5 (11) |
18
Scientist: Evolution debate will soon be history
(AP) -- Richard Leakey predicts skepticism over evolution will soon be history. Not that the avowed atheist has any doubts himself.
Dell tablet leak: 10.1-inch display, two-battery choice
(Phys.org) -- Headline after headline talks about vendors tablets in the wings as likely number-one contenders for the iPad. Such claims have justifiably been taken with a grain of salt, considering ...
SpaceX capsule has 'new car' smell, astronauts say (Update)
SpaceX's Dragon cargo vessel smells like a new car, said astronauts at the International Space Station after opening the hatches Saturday following the spacecraft's landmark mission to the orbiting lab.
Thousands of shellfish found dead in Peru
Thousands of crustaceans were found dead off the coast of Lima following the mystery mass death of dolphins and pelicans, the Peruvian Navy said Friday.
Keep food safety in mind this memorial day weekend
(HealthDay) -- Picnics, parades and cookouts are as much a part of Memorial Day weekend as tributes to the United States' war veterans.
Astronomers seize last chance in lifetime for Venus Transit
Astronomers are gearing for one the rarest events in the Solar System: an alignment of Earth, Venus and the Sun that will not be seen for another 105 years.
Mar 02, 2011
Rank: 3 / 5 (2)
Mar 02, 2011
Rank: 1.5 / 5 (2)
Mar 02, 2011
Rank: 1 / 5 (2)
So while it's a familiar segment of code for demonstrating concepts, it's a completely moronic application as far as practical value goes.
A better code would be some sort of optimized search algorithm to split a container into smaller containers and have each "spawn" search a different sub-container.
Mar 02, 2011
Rank: 2.3 / 5 (3)
Well, Chess wouldn't improve exponentially, but probably noticeably. Each processor could work on one specific family of game possibilities, recursively spawning new threads for diverging possibilities for each turn, and calculate the value of those positions, etc.
Excluding absurd situations such as defeating your opponent without losing a piece (which I've done several times,) There are certain limits to how many moves are available to a player on a given turn. If you have only a king and a queen, then the maximum number of available moves is no higher than 35, assuming none of the moves are blocked.
The fewest number of moves a player can have available is a king oscilating between two spaces on his only available move.
Mar 02, 2011
Rank: 2.3 / 5 (3)
*you could perpetually check the enemy King, though if you had 63 available moves in one turn you should have already won with a queen or rook.
So the computer attempts to play the game "backwards" by finding all possible moves on each turn, assigning the resulting position a value, and then finding all possible responses, and the resulting position a value, and repeating this process, usually limited by some reasonable factor such as time or a count of iterations to avoid a perpetual calculation. Then once it has found all possibilities and assigned them a value, it makes the move which has the greatest value.
Mar 02, 2011
Rank: 2.3 / 5 (3)
The down side? Each player has 20 possible first moves, and depending on which move each player made on the first move, there are even more possible second moves. Depending on first moves, there are 19 to 24 possible second moves for white.
So the first two moves are something like 20*20*24*24 possible combinations of moves. Being able to split the algorithm efficiently between multiple cores is a big issue.
Mar 03, 2011
Rank: 3.7 / 5 (3)
Mar 03, 2011
Rank: 2 / 5 (2)
Mar 03, 2011
Rank: 3 / 5 (2)
That's true with current languages too. When I write multithreaded apps, they can run on one or multi-core machines. On single CPU machines, there's no advantage, but when run on multi-core machines, the threads run in true parallel, as opposed to time-sliced simulations of parallel on a single core machine.
Nice!
Now, we need debuggers that can handle stepping through multi-threaded apps better than the current crop. There have been improvements on this, but we still need more advancement.
The shared queue method, is by far, one of the best methods to evenly distribute work. This is not a new idea, but I'm glad to see it implemented at the language (or runtime) level.
Mar 03, 2011
Rank: 1 / 5 (1)
The problem with this (which is what's currently done) is some will have fewer possibilities and will finish early and "starve", hence the task queuing method discussed in this article which solves that problem. Instead trying to figure out how to divide the work, the work tasks are simply shoved onto a queue and each processor works on one queue item (which could be a chunk of sub-items). When a processor finishes one queue item, it pulls another item off the queue. This way, they all stay busy until the last set of items on is removed from the queue and they all finish at almost exactly the same time.
(continued...)
Mar 03, 2011
Rank: 1 / 5 (1)
Imagine Wal-Mart with 20 register queues: You pick one and find it's slower than the others that free up. Now there's un-utilized processing power. Instead, make one line and each each register is processing one customer. The first one to free up gets the next person in the global queue. No registers are starved of tasks and no customers are stuck behind one slow customer. Some queue items will inevitably take longer to process than others, so you can't pre-allocate them effectively without knowing exactly how long they take to process. So, avoid pre-allocating them to processors and let the processors request work when they're free.
Think of a help desk application. Tickets come into the queue. As soon as an operator completes work with one ticket, they take the next one off the global queue. If these queue items are people on hold, it eliminates the caller from being unfairly held in one slow operator's queue and the whole thing is more efficient.
Mar 03, 2011
Rank: 1 / 5 (1)
Hence the purpose of not attempting it. The shared queue method resolves this and in a simple way. One of the rare win/win options. I've implemented this in some apps and it makes a big difference.
----
BTW, C# 4.0 now supports some parallelism at the language level today: _ttp://msdn.microsoft.com/en-us/library/dd460693%28VS.100%29.aspx
I don't believe the runtime is smart enough to put the tasks in queues yet, but code written with today's C# 4.0 will immediately benefit when the runtime is updated to do this. No change is necessary in your code to take advantage of that future upgrade. Of course, if you want queues today, you'll have to code it yourself, which really isn't that difficult, but it's better when the runtime handles it for you.
Mar 03, 2011
Rank: 1 / 5 (1)
Mar 03, 2011
Rank: 1 / 5 (1)
Well that is just peachy! When are you going to build that quantum comp?
Mar 03, 2011
Rank: 1 / 5 (1)
Mar 05, 2011
Rank: 1 / 5 (1)
google's new "go" language (golang) for the win! :)
Mar 06, 2011
Rank: 1 / 5 (1)