Dude, where's my code?

Oct 16, 2013 by Larry Hardesty
Dude, where’s my code?

Compilers are computer programs that translate high-level instructions written in human-readable languages like Java or C into low-level instructions that machines can execute. Most compilers also streamline the code they produce, modifying algorithms specified by programmers so that they'll run more efficiently.

Sometimes that means simply discarding lines of code that appear to serve no purpose. But as it turns out, compilers can be overaggressive, dispensing not only with functional code but also with code that actually performs vital security checks.

At the ACM Symposium on Operating Systems Principles in November, MIT researchers will present a new system, dubbed Stack, that automatically combs through ' code, identifying just those lines that compilers might discard but which could, in fact, be functional. Although the paper hasn't appeared yet, commercial software engineers have already downloaded Stack and begun using it, with encouraging results.

As strange as it may seem to nonprogrammers—or people whose only experience with coding is on small, tightly managed projects—large commercial programs are frequently full of instructions that will never be executed, known as "dead code." When hundreds of developers are working on an application with millions of lines of code that have been continually revised for decades, one of them may well end up inserting a seemingly innocuous condition that ensures that a function thousands of lines away, written by someone else, never gets executed. Dead code is ubiquitous, and compilers should remove it.

Problems arise when compilers also remove code that leads to "undefined behavior." "For some things this is obvious," says Frans Kaashoek, the Charles A. Piper Professor in the Department of Electrical Engineering and Computer Science (EECS). "If you're a programmer, you should not write a statement where you take some number and divide it by zero. You never expect that to work. So the compiler will just remove that. It's pointless to execute it anyway, because there's not going to be any sensible result."

Defining moments

Over time, however, "compiler writers got a little more aggressive," Kaashoek says. "It turns out that the C programming language has a lot of subtle corners to the language specification, and there are things that are undefined behavior that most programmers don't realize are undefined behavior."

A classic example, explains Xi Wang, a graduate student in EECS and first author on the new paper, is the assumption that if a program attempts to store too large a number at a memory location reserved for an integer, the computer will lop off the bits that don't fit. "In machines, integers have a limit," Wang says. "Whenever you exceed that limit, the input value basically wraps around to a smaller value."

Seasoned C programmers will actually exploit this behavior to verify that program inputs don't exceed some threshold. Rather than writing a line of code that, say, compares the sum of two numbers to the known threshold for an integer ("if a + b < int_max"), they'll check to see whether the sum of the numbers is smaller than one of the addends ("if a + b > a")—whether, that is, the summation causes the integer to wrap around to a smaller value.

According to Wang, programmers give a range of explanations for this practice. Some say that the intent of the comparison—an overflow check—is clearer if they use integer wraparound; others say that the wraparound comparison executes more efficiently than the more conventional comparison; and some maintain that it avoids cluttering up their code with unneeded terminology (like "int_max"). But whatever the reason, while the wraparound check works fine with unsigned integers—integers that are always positive—it is, according to the C language specification, undefined for signed integers—integers that can be either positive or negative.

As a consequence, some C compilers will simply discard the wraparound comparison. And sometimes, that can mean dispensing with a security check that guarantees the program's proper execution.

The fine print

Complicating things further is the fact that different compilers will dispense with different undefined behaviors: Some might permit wraparound checks but prohibit other programming shortcuts; some might impose exactly the opposite restrictions.

So Wang combed through the C language specifications and identified every undefined behavior that he and his coauthors—Kaashoek and his fellow EECS professors Nickolai Zeldovich and Armando Solar-Lezama—imagined that a programmer might ever inadvertently invoke. Stack, in effect, compiles a program twice: once just looking to excise dead code, and a second time to excise dead code and undefined behavior. Then it identifies all the code that was cut the second time but not the first and warns the programmer that it could pose problems.

The MIT researchers tested their system on several open-source programs. In one case, the developers of a program that performs database searches refused to believe that their code had bugs, even after they'd examined the instructions flagged by Stack. "Xi sent them a one-line SQL statement that basically crashed their [application], by exploiting their 'correct' code," Kaashoek says.

Mattias Engdegård, an engineer at Intel, is one of the developers who found Stack online and has already applied it to his company's code. "Stack is very carefully designed to have a very low false-positive ratio," Engdegård says. Nonetheless, "it found some errors that no other static-analysis tool had found before," he says, resulting in "one or two dozens of instances of changes."

"This could be some kind of harbinger of things to come," Engdegård adds. "I think static analyzers are going to focus on these sort of things in the future."

The paper is titled "Towards Optimization-Safe Systems: Analyzing the Impact of Undefined Behavior."

Explore further: Singing the same tune: Scientists develop novel ways of separating birdsong sources

More information: pdos.csail.mit.edu/~xi/papers/stack-sosp13.pdf

Related Stories

New programming language to plug information leaks in software

Nov 23, 2011

The current method for preventing users and unauthorised individuals from obtaining information to which they should not have access in data programs is often to have code reviewers check the code manually, looking for potential ...

An oracle for object-oriented programmers

Oct 07, 2011

In the last 40 years, the major innovation in software engineering has been the development of what are called object-oriented programming languages. “Objects” are, effectively, repositories for ...

Recommended for you

Tesla loss widens as it ramps up expansion plan

2 minutes ago

US electric automaker Tesla Motors reported Thursday a widening loss in the past quarter amid record revenues as it ramped up plans for a giant battery plant for future vehicles.

CIA director reverses himself on Senate spying

26 minutes ago

For months, CIA Director John Brennan had stood firm in his insistence that the CIA had little to be ashamed of after searching the computers of the Senate Intelligence Committee. His defiant posture quickly ...

Tesla says decision on battery factory months away

32 minutes ago

(AP)—Electric car maker Tesla Motors said Thursday that it is preparing a site near Reno, Nevada, as a possible location for its new battery factory, but is still evaluating other sites.

Taking great ideas from the lab to the fab

11 hours ago

A "valley of death" is well-known to entrepreneurs—the lull between government funding for research and industry support for prototypes and products. To confront this problem, in 2013 the National Science ...

User comments : 5

Adjust slider to filter visible comments by rank

Display comments: newest first

axemaster
5 / 5 (3) Oct 16, 2013
But whatever the reason, while the wraparound check works fine with unsigned integers—integers that are always positive—it is, according to the C language specification, undefined for signed integers—integers that can be either positive or negative.
As a consequence, some C compilers will simply discard the wraparound comparison. And sometimes, that can mean dispensing with a security check that guarantees the program's proper execution.

Wow! I've written some programs in C, and I never heard of that before. Fortunately I've always used unsigned ints for my wraparounds, but imagine the confusion that would result from the compiler just throwing out the code... Horrible.
wiyosaya
not rated yet Oct 16, 2013
If I am not mistaken, most compilers will issue a warning (based on a setting at least) if they are doing something like this. The trouble in that case is that there are programmers out there that ignore warnings because they code in a manner that produces bogus warnings. So, I say if you miss the fact that the compiler is removing code that you think is doing something, then set your compiler settings properly.
antialias_physorg
1 / 5 (1) Oct 16, 2013
("if a + b > a")

If you think about writing code like this (wraparound checks) then you should probably take up another hobby (most certainly another profession). Those kind of hacks may sound cool when you tell them to others but using them in producton code is just awful. Using side-effects as decision criteria is a big no-no on any project I've ever worked on.
baudrunner
1 / 5 (7) Oct 16, 2013
I wouldn't hire a programmer who is confused about the use of data types. Unless the app is intensely calculation specific (and that would mean hiring technically experienced programmers who know their stuff) it is not relevant anymore today because of processing speed and power, and faster apps can be produced by simply using int() or truncating decimal places. Granted, error checking is important, but the less it can be used, the better.
Ro_Man
1 / 5 (1) Oct 16, 2013
Thus, a subroutine that changed the value of one of its parameters could silently change a literal constant.

C
C CHANGE THE VALUE OF 5
C

CALL TWEAK (5)
WRITE (10, 30) 5
30 FORMAT (I5)
END

C
C SUBROUTINE TWEAK:
C

SUBROUTINE TWEAK (I)
I = I + 1
END
Run this program on an old-enough FORTRAN compiler and it will output:

00006

And from here, maybe you test whether 5 is equal to 4 plus 1 and if not, then you "fix" your constant. Typical nerdish code. The program works until you run it with the new optimizing compiler.