Chips as mini Internets

Apr 10, 2012 by Larry Hardesty

Computer chips have stopped getting faster. In order to keep increasing chips’ computational power at the rate to which we’ve grown accustomed, chipmakers are instead giving them additional “cores,” or processing units.

Today, a typical chip might have six or eight cores, all communicating with each other over a single bundle of wires, called a bus. With a bus, however, only one pair of cores can talk at a time, which would be a serious limitation in chips with hundreds or even thousands of cores, which many electrical engineers envision as the future of computing.

Li-Shiuan Peh, an associate professor of electrical engineering and computer science at MIT, wants cores to communicate the same way computers hooked to the Internet do: by bundling the information they transmit into “packets.” Each core would have its own router, which could send a packet down any of several paths, depending on the condition of the network as a whole.

At the Design Automation Conference in June, Peh and her colleagues will present a paper she describes as “summarizing 10 years of research” on such “networks on chip.” Not only do the researchers establish theoretical limits on the efficiency of packet-switched on-chip communication networks, but they also present measurements performed on a test chip in which they came very close to reaching several of those limits.

Last stop for buses

In principle, multicore chips are faster than single-core chips because they can split up computational tasks and run them on several cores at once. Cores working on the same task will occasionally need to share data, but until recently, the core count on commercial chips has been low enough that a single bus has been able to handle the extra communication load. That’s already changing, however: “Buses have hit a limit,” Peh says. “They typically scale to about eight cores.” The 10-core chips found in high-end servers frequently add a second bus, but that approach won’t work for chips with hundreds of cores.

For one thing, Peh says, “buses take up a lot of power, because they are trying to drive long wires to eight or 10 cores at the same time.” In the type of network Peh is proposing, on the other hand, each core communicates only with the four cores nearest it. “Here, you’re driving short segments of wires, so that allows you to go lower in voltage,” she explains.

In an on-chip network, however, a packet of data traveling from one core to another has to stop at every router in between. Moreover, if two packets arrive at a router at the same time, one of them has to be stored in memory while the router handles the other. Many engineers, Peh says, worry that these added requirements will introduce enough delays and computational complexity to offset the advantages of packet switching. “The biggest problem, I think, is that in industry right now, people don’t know how to build these networks, because it has been buses for decades,” Peh says.

Forward thinking

Peh and her colleagues have developed two techniques to address these concerns. One is something they call “virtual bypassing.” In the Internet, when a packet arrives at a router, the router inspects its addressing information before deciding which path to send it down. With virtual bypassing, however, each router sends an advance signal to the next, so that it can preset its switch, speeding the packet on with no additional computation. In her group’s test chips, Peh says, virtual bypassing allowed a very close approach to the maximum data-transmission rates predicted by theoretical analysis.

The other technique is something called low-swing signaling. Digital data consists of ones and zeroes, which are transmitted over communications channels as high and low voltages. Sunghyun Park, a PhD student advised by both Peh and Anantha Chandrakasan, the Joseph F. and Nancy P. Keithley Professor of Electrical Engineering, developed a circuit that reduces the swing between the high and low voltages from one volt to 300 millivolts. With its combination of virtual bypassing and low-swing signaling, the researchers’ test chip consumed 38 percent less energy than previous packet-switched test chips. The researchers have more work to do, Peh says, before their test chip’s power consumption gets as close to the theoretical limit as its data transmission rate does. But, she adds, “if we compare it against a bus, we get orders-of-magnitude savings.”

Luca Carloni, an associate professor of computer science at Columbia University who also researches networks on chip, says “the jury is always still out” on the future of chip design, but that “the advantages of packet-switched networks on seem compelling.” He emphasizes that those advantages include not only the operational efficiency of the chips themselves, but also “a level of regularity and productivity at design time that is very important.” And within the field, he adds, “the contributions of Li-Shiuan are foundational.”

Explore further: Ride-sharing could cut cabs' road time by 30 percent

Related Stories

Designing the hardware

Feb 23, 2011

Computer chips' clocks have stopped getting faster. To maintain the regular doubling of computer power that we now take for granted, chip makers have been giving chips more "cores," or processing units. But ...

Microchips' optical future

Feb 15, 2012

Computer chips are one area where the United States still enjoys a significant manufacturing lead over the rest of the world. In 2011, five of the top 10 chipmakers by revenue were U.S. companies, and Intel, ...

Study explores computing bursts for smartphones

Feb 21, 2012

(PhysOrg.com) -- A study team from the computer science and engineering departments at University of Pennsylvania and University of Michigan are tackling smartphone performance with an idea about chips that ...

Recommended for you

Ride-sharing could cut cabs' road time by 30 percent

Sep 01, 2014

Cellphone apps that find users car rides in real time are exploding in popularity: The car-service company Uber was recently valued at $18 billion, and even as it faces legal wrangles, a number of companies ...

Avatars make the Internet sign to deaf people

Aug 29, 2014

It is challenging for deaf people to learn a sound-based language, since they are physically not able to hear those sounds. Hence, most of them struggle with written language as well as with text reading ...

Chameleon: Cloud computing for computer science

Aug 26, 2014

Cloud computing has changed the way we work, the way we communicate online, even the way we relax at night with a movie. But even as "the cloud" starts to cross over into popular parlance, the full potential ...

User comments : 16

Adjust slider to filter visible comments by rank

Display comments: newest first

Lurker2358
2.3 / 5 (3) Apr 10, 2012
Meh.

Gave it a 4.

Anyone else notice the obvious lack of diagonals or 3-D?

I mean, if you have 100 cores and the top left one needs to communicate with the bottom right one, for some reason, then it has to go through a minimum of 18 cores to get to the one it really wants to talk to.

A solution to that might be to hybridize the chips, and run buses between some of the worst case scenarios, producing an "extended star" network. But it might not be worth it, because that might only improve performance by a few percent at best.

Now if you could "fold" a chip, and use optical interconnects, and lets say you have 5 cores by 5 cores per layer, and 5 layers, that would be 125 cores. Closest Cube to 100

Each core would be adjacent to 3 to 6 other cores, depending on where it is in the cube, and the worst case scenario for communication is 12 steps among 125 cores....vs a worst case of 18 steps among 100 cores in a 2d chip.

Average case is 6 steps vs 10 steps.
Lurker2358
1 / 5 (3) Apr 10, 2012
In fact, if you could do this folding and use optical interconnects across layers, then a 7by7by7 cube would have worst case scenario 18 steps, the same as the 10by10 square. It would have average case of 9 steps, which is still less than or equal to the average case of the 10 by 10 square.

So that would be 343 cores vs 100, and the 343 core 3D processor would communicate with equal or better efficiency than the 100 core processor at all times...
baudrunner
not rated yet Apr 10, 2012
Another method of parallel computation would be to treat cores as segments and segments as templates capable of overlaying OS, application software, and even application data, on top of data on a single large platform chip. This might be how quantum computing will work, but it would work also for any multi-state qubit-like array.
Lurker2358
3 / 5 (2) Apr 10, 2012
Because people gave me negatives, I'll explain the maths, as I did this by hand for real, instead of guesstimating.

An ideal 3-d configuration of 7 cores requires 5% fewer steps to facilitate communication between all cores as compared to the ideal 2d configuration.

An ideal 3d cube of 27 cores requires 15% fewer steps to communicate between all cores as compared to a 25 core 5by5 grid in 2d...and you gained 2 extra processors to boot.

That would give 8% improvement in core power while using 15% fewer steps to communicate between all cores.

By the time you reach an 8 by 8 2d grid vs a 4by4by4 cube, each with 64 cores, the cube will be using slightly more than half as many average steps to communicate between all cores.

I realize that's not entirely related, but it's yet another demonstration of how much better 3d would be if engineers could only figure out how to do it.

these numbers are big numbers, but they will be reality even in PCs, smart phones, and tablets in a few years
wenmaar
3 / 5 (2) Apr 10, 2012
Why not think beyond 3D and go for 4D.
4*4*4*4=256
Or 5D 5*5*5*5*5=3125
Aloken
3 / 5 (1) Apr 10, 2012
Good luck keeping a cube made out of 100's of processor cores cool. 2D is easier to maintain in that aspect. To the guy above, if you can build 4D/5D structures you might want to patent the tech and sell it. No doubt you'd get a boatload of cash.
kaasinees
0.2 / 5 (21) Apr 10, 2012
Good luck keeping a cube made out of 100's of processor cores cool. 2D is easier to maintain in that aspect. To the guy above, if you can build 4D/5D structures you might want to patent the tech and sell it. No doubt you'd get a boatload of cash.

At IBM they already have the solution to that.
wenmaar
1 / 5 (2) Apr 10, 2012
Just take 4 cubes of 4X4 and connect them.
Or in 2D you have to make the connections in multiple layers.
Connections do not cunsume power so heat is no issue.

Principle is that every core in 5D will have to connect to 5 other cores.
Che2000
4 / 5 (1) Apr 10, 2012
Is there a way for the cores to communicate optically with each other?
Lurker2358
4 / 5 (1) Apr 10, 2012
Is there a way for the cores to communicate optically with each other?


they are working on that, which I mentioned above.

Several teams are trying very hard to develop on-chip lasers, plasmonics, or other on-chip devices to allow optical interconnects at every scale of architecture.

It's probably still 5 to 10 years away as a market-grade product, but prototypes of the devices have been demonstrated and this site has hosted several articles about similar devices.

Pay no attention to the guy talking about 4d or 5d, as that's absurd, non-classical physics which presumably requires access to alternate universes. No clue why he doesn't get that...
kaasinees
0.3 / 5 (22) Apr 10, 2012
IBM developed a technology that cools the CPU from within by running very small channels through the chip.
This method allows the chip to cool even if the water itself is running at 50 celcius(this assumes ofcoursse that the chip is running full capacity which makes the running chip over 70 celcius).
This can be incorporated with 3d transistors and multi layered chips.
Che2000
not rated yet Apr 10, 2012
Is there a way for the cores to communicate optically with each other?


they are working on that, which I mentioned above.

Several teams are trying very hard to develop on-chip lasers, plasmonics, or other on-chip devices to allow optical interconnects at every scale of architecture.

It's probably still 5 to 10 years away as a market-grade product, but prototypes of the devices have been demonstrated and this site has hosted several articles about similar devices.

Pay no attention to the guy talking about 4d or 5d, as that's absurd, non-classical physics which presumably requires access to alternate universes. No clue why he doesn't get that...


Your smrt. : )
Lurker2358
not rated yet Apr 10, 2012
Your smrt. : )


I just read a lot.

I probably read more articles on this site than it's own administrators over the years.

So I gather a lot of general technical knowledge about many, many fields, even if I don't always grasp every detail.

Sometimes I even read articles that don't interest me, oh well.
chasehusky
5 / 5 (2) Apr 10, 2012
Several teams are trying very hard to develop on-chip lasers, plasmonics, or other on-chip devices to allow optical interconnects at every scale of architecture.


Thankfully solutions for simultaneous communication, between at least 1000 cores, on a theorized optical network already exist (http:// groups.csail.mit.edu/carbon/?page_id=62).

Also, with regards to the article, it appears that what Li-Shiuan Peh is proposing is only a slight perturbation of the idea found the "mesh" model used in the Raw processor (http:// groups.csail.mit.edu/cag/raw/documents/ieee-micro-2002.pdf) and also in Tilera's processors, which are based off of the Raw architecture.
baudrunner
not rated yet Apr 11, 2012
According to my reading on this very site, it seems that a qubit array of 25 qubits square can do the job. Nobody saw this coming? Really, I only spend about a couple of hours a week on this site, and I at least know that.
hikenboot
not rated yet Apr 14, 2012
I would like to see chips that work as a giant look-up table of next chip functions to perform. They would be simpler in the fact that they would only have to perform a look-up to local memory (in chip, in core). The look-up table for emulating an intel chip would be about 2TB's in size to emulate all gates contained in the chip and would have to be small enough to sit within every single core. It would decide which chip operation to do next by using the tables to trace a virtual path through a virtual chip. This is hard to explain but something like a VRIS (Very Long Instruction Set on steroids). Some how dividing computation amongst chip cores would become simpler using some such layer of abstraction. I can picture it better than explain it. Sorry...