How to program unreliable chips

November 4, 2013 by Larry Hardesty, Massachusetts Institute of Technology
How to program unreliable chips
Credit: Christine Daniloff

As transistors get smaller, they also become less reliable. So far, computer-chip designers have been able to work around that problem, but in the future, it could mean that computers stop improving at the rate we've come to expect.

A third possibility, which some researchers have begun to float, is that we could simply let our computers make more mistakes. If, for instance, a few pixels in each frame of a high-definition video are improperly decoded, viewers probably won't notice—but relaxing the requirement of perfect decoding could yield gains in speed or energy efficiency.

In anticipation of the dawning age of unreliable chips, Martin Rinard's research group at MIT's Computer Science and Artificial Intelligence Laboratory has developed a new programming framework that enables software developers to specify when errors may be tolerable. The system then calculates the probability that the software will perform as it's intended.

"If the really is going to stop working, this is a pretty big deal for computer science," says Rinard, a professor in the Department of Electrical Engineering and Computer Science. "Rather than making it a problem, we'd like to make it an opportunity. What we have here is a … system that lets you reason about the effect of this potential unreliability on your program."

Last week, two graduate students in Rinard's group, Michael Carbin and Sasa Misailovic, presented the new system at the Association for Computing Machinery's Object-Oriented Programming, Systems, Languages and Applications conference, where their paper, co-authored with Rinard, won one a best-paper award.

On the dot

The researchers' system, which they've dubbed Rely, begins with a specification of the hardware on which a program is intended to run. That specification includes the expected failure rates of individual low-level instructions, such as the addition, multiplication, or comparison of two values. In its current version, Rely assumes that the hardware also has a failure-free mode of operation—one that might require slower execution or higher power consumption.

A developer who thinks that a particular program instruction can tolerate a little error simply adds a period—a "dot," in programmers' parlance—to the appropriate line of . So the instruction "total = total + new_value" becomes "total = total +. new_value." Where Rely encounters that telltale dot, it knows to evaluate the program's execution using the failure rates in the specification. Otherwise, it assumes that the instruction needs to be executed properly.

Compilers—applications that convert instructions written in high-level programming languages like C or Java into low-level instructions intelligible to computers—typically produce what's called an "intermediate representation," a generic low-level program description that can be straightforwardly mapped onto the instruction set specific to any given chip. Rely simply steps through the intermediate representation, folding the probability that each instruction will yield the right answer into an estimation of the overall variability of the program's output.

"One thing you can have in programs is different paths that are due to conditionals," Misailovic says. "When we statically analyze the program, we want to make sure that we cover all the bases. When you get the variability for a function, this will be the variability of the least-reliable path."

"There's a fair amount of sophisticated reasoning that has to go into this because of these kind of factors," Rinard adds. "It's the difference between reasoning about any specific execution of the program where you've just got one single trace and all possible executions of the program."

Trial runs

The researchers tested their system on several benchmark programs standard in the field, using a range of theoretically predicted failure rates. "We went through the literature and found the numbers that people claimed for existing designs," Carbin says.

With the existing version of Rely, a programmer who finds that permitting a few errors yields an unacceptably low probability of success can go back and tinker with his or her code, removing dots here and there and adding them elsewhere. Re-evaluating the code, the researchers say, generally takes no more than a few seconds.

But in ongoing work, they're trying to develop a version of the system that allows the programmer to simply specify the accepted failure rate for whole blocks of code: say, pixels in a frame of video need to be decoded with 97 percent reliability. The system would then go through and automatically determine how the code should be modified to both meet those requirements and maximize either power savings or speed of execution.

"This is a foundation result, if you will," says Dan Grossman, an associate professor of and engineering at the University of Washington. "This explains how to connect the mathematics behind reliability to the languages that we would use to write code in an unreliable environment."

Grossman believes that for some applications, at least, it's likely that chipmakers will move to unreliable components in the near future. "The increased efficiency in the hardware is very, very tempting," Grossman says. "We need software work like this work in order to make that hardware usable for ."

Explore further: New mathematical framework formalizes oddball programming techniques

More information: Paper (PDF): "Verifying Quantitative Reliability for Programs That Execute on Unreliable Hardware"

Related Stories

Dude, where's my code?

October 16, 2013

Compilers are computer programs that translate high-level instructions written in human-readable languages like Java or C into low-level instructions that machines can execute. Most compilers also streamline the code they ...

Defibrillator for stalled software

August 3, 2011

It’s happened to everyone: You’re using a familiar piece of software to do something you’ve done a thousand times before — say, find a particular word in a document — and all of a sudden the program ...

Writing programs using ordinary language

July 11, 2013

In a pair of recent papers, researchers at MIT's Computer Science and Artificial Intelligence Laboratory have demonstrated that, for a few specific tasks, it's possible to write computer programs using ordinary language rather ...

Detecting program-tampering in the cloud

September 11, 2013

For small and midsize organizations, the outsourcing of demanding computational tasks to the cloud—huge banks of computers accessible over the Internet—can be much more cost-effective than buying their own hardware. But ...

Recommended for you

After a reset, Сuriosity is operating normally

February 23, 2019

NASA's Curiosity rover is busy making new discoveries on Mars. The rover has been climbing Mount Sharp since 2014 and recently reached a clay region that may offer new clues about the ancient Martian environment's potential ...

Study: With Twitter, race of the messenger matters

February 23, 2019

When NFL player Colin Kaepernick took a knee during the national anthem to protest police brutality and racial injustice, the ensuing debate took traditional and social media by storm. University of Kansas researchers have ...

Solving the jet/cocoon riddle of a gravitational wave event

February 22, 2019

An international research team including astronomers from the Max Planck Institute for Radio Astronomy in Bonn, Germany, has combined radio telescopes from five continents to prove the existence of a narrow stream of material, ...


Please sign in to add a comment. Registration is free, and takes less than a minute. Read more

Click here to reset your password.
Sign in to get notified via email when new comments are made.