Bug repellent for supercomputers proves effective

Nov 14, 2012 by Anne M Stark
Bug repellent for supercomputers proves effective

(Phys.org)—Lawrence Livermore National Laboratory (LLNL) researchers have used the Stack Trace Analysis Tool (STAT), a highly scalable, lightweight tool to debug a program running more than one million MPI processes on the IBM Blue Gene/Q (BGQ)-based Sequoia supercomputer.

The debugging tool is a significant in LLNL's multi-year collaboration with the University of Wisconsin (UW), Madison and the University of New Mexico (UNM) to ensure supercomputers run more efficiently.

Playing a significant role in scaling up the supercomputer, STAT, a 2011 R&D 100 Award winner, has helped both early access users and system integrators quickly isolate a wide range of errors, including particularly perplexing issues that only manifested at extremely large scales up to 1,179,648 compute cores. During the Sequoia scale-up, bugs in applications as well as defects in system software and hardware have manifested themselves as failures in applications. It is important to quickly diagnose errors so they can be reported to experts who can analyze them in detail and ultimately solve the problem.

"STAT has been indispensable in this capacity, helping the multi-disciplined integration team keep pace with the aggressive system scale-up schedule," said LLNL computer scientist Greg Lee.

"While testing a subsystem of Blue/Gene Q, my test program consistently failed only when scaled to 1,179,648 MPI processes. Although the test program was simple, the sheer scale at which this program ran made debugging efforts highly challenging. But when I applied STAT, it quickly revealed that one particular rank process was consistently stuck in a system call," said Dong Ahn, a computer scientist in Livermore Computing.

Based on this finding, a system expert took a close look at the compute core on which this rank process was running and discovered a hardware defect. "Replacing the component suddenly got the entire Sequoia system back to life," Ahn said. "Putting this exercise into perspective, this error was due to a defect in a tiny hardware unit, the decrementor, of a single hardware thread out of a total of 4.7 million hardware threads. I felt it was like finding a needle in a haystack over a coffee break."

Sequoia delivers 20 petaflops of peak power and was ranked No. 1 in June of this year's TOP500 list. It is currently ranked No. 2, behind Oak ridge National Laboratory's Titan.

LLNL plans to use Sequoia's impressive computational capability to advance understanding of fundamental physics and engineering questions that arise in the National Nuclear Security Administration's (NNSA) program to ensure the safety, security and effectiveness of the United States' nuclear deterrent without testing. Sequoia also will support NNSA/DOE programs at LLNL that focus on nonproliferation, counterterrorism, energy, security, health and climate change.

As LLNL takes delivery of the Sequoia system and works to move it into production, computer scientists will migrate applications that have been running on earlier systems to this newer architecture. This is a period of intense activity for LLNL's application teams as they gain experience with the new hardware and software environment.

"Having a highly effective debugging tool that scales to the full system is vital to the installation and acceptance process for Sequoia. It is critical that our development teams have a comprehensive parallel debugging tool set as they iron out the inevitable issues that come up with running on a new system like Sequoia," said Kim Cupps, leader of the Livermore Computing Division at LLNL.

STAT is particularly important for LLNL because supercomputer simulations are essential in virtually every mission area of the Laboratory. The tool also has been used at other sites and proved to be effective on a wide range of platforms, including Linux clusters and Cray systems.

The team is actively pursuing further optimization of STAT technologies and is exploring commercialization strategies. More information about STAT, including a link to the source code, is available on the Web.

Explore further: UT Dallas professor to develop framework to protect computers' cores

Related Stories

IBM To Build Supercomputer For U.S. Government

Feb 03, 2009

(PhysOrg.com) -- The U.S. Government has contracted out IBM to build a massive supercomputer bigger than any supercomputer out there. The supercomputer system, called Sequoia, will be capable of delivering ...

Predictive simulation successes on Dawn supercomputer

Sep 30, 2009

(PhysOrg.com) -- The 500-teraFLOPS Advanced Simulation and Computing program's Sequoia Initial Delivery System (Dawn), an IBM machine of the same lineage as BlueGene/L, has immediately proved itself useful ...

Cray supercomputer named world's fastest

Nov 12, 2012

A Cray supercomputer at the US government's Oak Ridge National Laboratory was named Monday the world's fastest, overtaking an IBM supercomputer at another American research center.

Recommended for you

User comments : 0

More news stories

Ex-Apple chief plans mobile phone for India

Former Apple chief executive John Sculley, whose marketing skills helped bring the personal computer to desktops worldwide, says he plans to launch a mobile phone in India to exploit its still largely untapped ...

Airbnb rental site raises $450 mn

Online lodging listings website Airbnb inked a $450 million funding deal with investors led by TPG, a source close to the matter said Friday.

Health care site flagged in Heartbleed review

People with accounts on the enrollment website for President Barack Obama's signature health care law are being told to change their passwords following an administration-wide review of the government's vulnerability to the ...

A homemade solar lamp for developing countries

(Phys.org) —The solar lamp developed by the start-up LEDsafari is a more effective, safer, and less expensive form of illumination than the traditional oil lamp currently used by more than one billion people ...

NASA's space station Robonaut finally getting legs

Robonaut, the first out-of-this-world humanoid, is finally getting its space legs. For three years, Robonaut has had to manage from the waist up. This new pair of legs means the experimental robot—now stuck ...

Filipino tests negative for Middle East virus

A Filipino nurse who tested positive for the Middle East virus has been found free of infection in a subsequent examination after he returned home, Philippine health officials said Saturday.

Egypt archaeologists find ancient writer's tomb

Egypt's minister of antiquities says a team of Spanish archaeologists has discovered two tombs in the southern part of the country, one of them belonging to a writer and containing a trove of artifacts including reed pens ...