Researchers unveil experimental 36-core chip

Jun 23, 2014 by Larry Hardesty
The MIT researchers' new 36-core chip is "tiled," meaning that it simply repeats the same circuit layout 36 times. Tiling makes multicore chips much easier to design.

The more cores—or processing units—a computer chip has, the bigger the problem of communication between cores becomes. For years, Li-Shiuan Peh, the Singapore Research Professor of Electrical Engineering and Computer Science at MIT, has argued that the massively multicore chips of the future will need to resemble little Internets, where each core has an associated router, and data travels between cores in packets of fixed size.

This week, at the International Symposium on Computer Architecture, Peh's group unveiled a 36-core chip that features just such a "network-on-chip." In addition to implementing many of the group's earlier ideas, it also solves one of the problems that has bedeviled previous attempts to design networks-on-chip: maintaining cache coherence, or ensuring that cores' locally stored copies of globally accessible remain up to date.

In today's chips, all the cores—typically somewhere between two and six—are connected by a single wire, called a bus. When two cores need to communicate, they're granted exclusive access to the bus.

But that approach won't work as the core count mounts: Cores will spend all their time waiting for the bus to free up, rather than performing computations.

In a network-on-chip, each core is connected only to those immediately adjacent to it. "You can reach your neighbors really quickly," says Bhavya Daya, an MIT graduate student in and , and first author on the new paper. "You can also have multiple paths to your destination. So if you're going way across, rather than having one congested path, you could have multiple ones."

Get snoopy

One advantage of a bus, however, is that it makes it easier to maintain cache coherence. Every core on a chip has its own cache, a local, high-speed memory bank in which it stores frequently used data. As it performs computations, it updates the data in its cache, and every so often, it undertakes the relatively time-consuming chore of shipping the data back to main memory.

But what happens if another core needs the data before it's been shipped? Most chips address this question with a protocol called "snoopy," because it involves snooping on other cores' communications. When a core needs a particular chunk of data, it broadcasts a request to all the other cores, and whichever one has the data ships it back.

If all the cores share a bus, then when one of them receives a data request, it knows that it's the most recent request that's been issued. Similarly, when the requesting core gets data back, it knows that it's the most recent version of the data.

But in a network-on-chip, data is flying everywhere, and packets will frequently arrive at different cores in different sequences. The implicit ordering that the snoopy protocol relies on breaks down.

Imposing order

Daya, Peh, and their colleagues solve this problem by equipping their chips with a second network, which shadows the first. The circuits connected to this network are very simple: All they can do is declare that their associated cores have sent requests for data over the main network. But precisely because those declarations are so simple, nodes in the shadow network can combine them and pass them on without incurring delays.

Groups of declarations reach the routers associated with the cores at discrete intervals—intervals corresponding to the time it takes to pass from one end of the shadow network to another. Each router can thus tabulate exactly how many requests were issued during which interval, and by which other cores. The requests themselves may still take a while to arrive, but their recipients know that they've been issued.

During each interval, the chip's 36 cores are given different, hierarchical priorities. Say, for instance, that during one interval, both core 1 and core 10 issue requests, but core 1 has a higher priority. Core 32's router may receive core 10's request well before it receives 1's. But it will hold it until it's passed along 1's.

This hierarchical ordering simulates the chronological ordering of requests sent over a bus, so the snoopy protocol still works. The hierarchy is shuffled during every interval, however, to ensure that in the long run, all the cores receive equal weight.

Proof, pudding

Cache coherence in multicore chips "is a big problem, and it's one that gets larger all the time," says Todd Austin, a professor of electrical engineering and computer science at the University of Michigan. "Their contribution is an interesting one: They're saying, 'Let's get rid of a lot of the complexity that's in existing networks. That will create more avenues for communication, and our clever communication protocol will sort out all the details.' It's a much simpler approach and a faster approach. It's a really clever idea."

"One of the challenges in academia is convincing industry that our ideas are practical and useful," Austin adds. "They've really taken the best approach to demonstrating that, in that they've built a working chip. I'd be surprised if these technologies didn't find their way into commercial products."

After testing the prototype chips to ensure that they're operational, Daya intends to load them with a version of the Linux operating system, modified to run on 36 cores, and evaluate the performance of real applications, to determine the accuracy of the group's theoretical projections. At that point, she plans to release the blueprints for the chip, written in the hardware description language Verilog, as open-source code.

Explore further: Cleverer 'cache' management could improve computer chips' performance, reduce energy consumption

More information: The paper is available online: projects.csail.mit.edu/wiki/pu… scorpio_isca2014.pdf

add to favorites email to friend print save as pdf

Related Stories

Chips as mini Internets

Apr 10, 2012

Computer chips have stopped getting faster. In order to keep increasing chips’ computational power at the rate to which we’ve grown accustomed, chipmakers are instead giving them additional “cores,” or ...

Parallel programming may not be so daunting

Mar 24, 2014

Computer chips have stopped getting faster: The regular performance improvements we've come to expect are now the result of chipmakers' adding more cores, or processing units, to their chips, rather than ...

New hardware boosts communication speed on multi-core chips

Jan 31, 2011

Computer engineers at North Carolina State University have developed hardware that allows programs to operate more efficiently by significantly boosting the speed at which the "cores" on a computer chip communicate with each ...

Recommended for you

Why the Sony hack isn't big news in Japan

9 minutes ago

Japan's biggest newspaper, Yomiuri Shimbun, featured a story about Sony Corp. on its website Friday. It wasn't about hacking. It was about the company's struggling tablet business.

Off-world manufacturing is a go with space printer

3 hours ago

On Friday, the BBC reported on a NASA email exchange with a space station which involved astronauts on the International Space Station using their 3-D printer to make a wrench from instructions sent up in ...

Cadillac CT6 will get streaming video mirror

4 hours ago

Cadillac said Thursday it will add high resolution streaming video to the function of a rearview mirror, so that the driver's vision and safety can be enhanced. The technology will debut on the 2016 Cadillac ...

Sony faces 4th ex-employee lawsuit over hack

5 hours ago

A former director of technology for Sony Pictures Entertainment has sued the company over the data breach that resulted in the online posting of his private financial and personal information.

User comments : 7

Adjust slider to filter visible comments by rank

Display comments: newest first

verkle
1 / 5 (3) Jun 23, 2014
In today's chips, all the cores—typically somewhere between two and six—are connected by a single wire, called a bus....


A little disinformation here. Today's chips are typically 2 and 4, and then next generation will generally have 8, not 6.

Also, buses are not a single wire, but many (up to hundreds of) parallel wires moving data around at gigahertz speeds.

Anyway, good research on using many cores. This will be a very important area of research for the next 10 years.

Doiea
Jun 23, 2014
This comment has been removed by a moderator.
kelman66
5 / 5 (2) Jun 23, 2014
In today's chips, all the cores—typically somewhere between two and six—are connected by a single wire, called a bus....


A little disinformation here. Today's chips are typically 2 and 4, and then next generation will generally have 8, not 6.

Also, buses are not a single wire, but many (up to hundreds of) parallel wires moving data around at gigahertz speeds.

Anyway, good research on using many cores. This will be a very important area of research for the next 10 years.



Six cores are part of the current generation. I have a six core machine.
evropej
not rated yet Jun 23, 2014
How is this concept different then those in GPUs?
TheGhostofOtto1923
3.7 / 5 (3) Jun 23, 2014
A little disinformation here. Today's chips are typically 2 and 4, and then next generation will generally have 8, not 6.
FX-Series, quad-, 6-, and 8-core desktop processors. (AMD FX 6100 6-Core Processor, 3.3 6 Socket AM3+
Opteron, dual-, quad-, 6-, 8-, 12-, and 16-core server/workstation processors.
Phenom II, dual-, triple-, quad-, and 6-core desktop processor
Six cores are part of the current generation. I have a six core machine
Yeah verkle likes to compose his own disinformation.

So - can moores law be appropriately applied to the # of cores on a chip now?
gwrede
not rated yet Jun 23, 2014
So - can moores law be appropriately applied to the # of cores on a chip now?
Yes. But that doesn't mean that computers will get that much faster. Rather, it depends on the task.

Corporate computing and especially servers and cloud centers will stand to win, but home computers, tablets and smart phones may not win. It is usual for a user to do one task only, and even if they have other windows open in the background, there simply may not be more than half a dozen active processes any instant. (Of course, there probably are tens of sleeping processes, but most of them are sleeping for reasons other than waiting for the processor, at a given instant. (And if they really are all waiting for the processor, then a programmer needs to get fired.))

For example, when I am writing this text, there is no way I can use 36 cores, because you cannot distribute a single, serial task.
FainAvis
not rated yet Jun 24, 2014
@doiea "They're already optimized to graphical (pixel and vector) operations, which may be both advantage, both disadvantage in many applications" [sic] Where did you get that grammar?
TheGhostofOtto1923
1 / 5 (1) Jun 24, 2014
@doiea "They're already optimized to graphical (pixel and vector) operations, which may be both advantage, both disadvantage in many applications" [sic] Where did you get that grammar?
He gots it from the 3rd world where they all understand everybody very well.

Please sign in to add a comment. Registration is free, and takes less than a minute. Read more

Click here to reset your password.
Sign in to get notified via email when new comments are made.