Nuclear weapon simulations show performance in molecular detail

Jun 05, 2012 by Emil Venere
Employees at Lawrence Livermore National Laboratory work on a high-performance computer. Purdue researchers have collaborated with the national laboratory, using a similar high-performance computer to improve simulations that show a nuclear weapon's performance in precise molecular detail. Photo courtesy of Lawrence Livermore National Laboratory

U.S. researchers are perfecting simulations that show a nuclear weapon's performance in precise molecular detail, tools that are becoming critical for national defense because international treaties forbid the detonation of nuclear test weapons.

The simulations must be operated on supercomputers containing thousands of processors, but doing so has posed reliability and accuracy problems, said Saurabh Bagchi, an associate professor in Purdue University's School of Electrical and Computer Engineering.

Now researchers at Purdue and high-performance computing experts at the National Nuclear Security Administration's (NNSA) Lawrence Livermore National Laboratory have solved several problems hindering the use of the ultra-precise simulations. NNSA is the quasi-independent agency within the U.S. Department of Energy that oversees the nation's nuclear security activities.

The simulations, which are needed to more efficiently certify , may require 100,000 machines, a level of complexity that is essential to accurately show molecular-scale reactions taking place over milliseconds, or thousandths of a second. The same types of simulations also could be used in areas such as climate modeling and studying the dynamic changes in a protein's shape.

Such highly complex jobs must be split into many processes that execute in parallel on separate machines in large , Bagchi said.

"Due to natural faults in the execution environment there is a high likelihood that some processing element will have an error during the application's execution, resulting in corrupted memory or failed communication between machines," Bagchi said. "There are bottlenecks in terms of communication and computation."

These errors are compounded as long as the simulation continues to run before the glitch is detected and may cause simulations to stall or crash altogether.

"We are particularly concerned with errors that corrupt data silently, possibly generating incorrect results with no indication that the error has occurred," said Bronis R. de Supinski, co-leader of the ASC Application Development Environment Performance Team at Lawrence Livermore. "Errors that significantly reduce system performance are also a major concern since the systems on which the simulations run are very expensive."

Advanced Simulation and Computing is the computational arm of NNSA's Stockpile Stewardship Program, which ensures the safety, security and reliability of the nation's nuclear deterrent without underground testing.

New findings will be detailed in a paper to be presented during the Annual IEEE/IFIP International Conference on Dependable Systems and Networks from June 25-28 in Boston. Recent research findings were detailed in two papers last year, one presented during the IEEE Supercomputing Conference and the other during the International Symposium on High-Performance Parallel and Distributed Computing.

The researchers have developed automated methods to detect a glitch soon after it occurs.

"You want the system to automatically pinpoint when and in what machine the error took place and also the part of the code that was involved," Bagchi said. "Then, a developer can come in, look at it and fix the problem."

One bottleneck arises from the fact that data are streaming to a central server.

"Streaming data to a central server works fine for a hundred machines, but it can't keep up when you are streaming data from a thousand machines," said Purdue doctoral student Ignacio Laguna, who worked with Lawrence Livermore computer scientists. "We've eliminated this central brain, so we no longer have that bottleneck."

Each machine in the supercomputer cluster contains several cores, or processors, and each core might run one "process" during simulations. The researchers created an automated method for "clustering," or grouping the large number of processes into a smaller number of "equivalence classes" with similar traits. Grouping the processes into equivalence classes makes it possible to quickly detect and pinpoint problems.

"The recent breakthrough was to be able to scale up the clustering so that it works with a large supercomputer," Bagchi said.

Lawrence Livermore computer scientist Todd Gamblin came up with the scalable clustering approach.

A lingering bottleneck in using the simulations is related to a procedure called checkpointing, or periodically storing data to prevent its loss in case a machine or application crashes. The information is saved in a file called a checkpoint and stored in a parallel system distant from the machines on which the application runs.

"The problem is that when you scale up to 10,000 machines, this parallel file system bogs down," Bagchi said. "It's about 10 times too much activity for the system to handle, and this mismatch will just become worse because we are continuing to create faster and faster computers."

Doctoral student Tanzima Zerin and Rudolf Eigenmann, a professor of electrical and computer engineering, along with Bagchi, led work to develop a method for compressing the checkpoints, similar to the compression of data for images.

"We're beginning to solve the checkpointing problem," Bagchi said. "It's not completely solved, but we are getting there."

The checkpointing bottleneck must be solved in order for researchers to create supercomputers capable of "exascale computing," or 1,000 quadrillion operations per second.

"It's the Holy Grail of supercomputing," Bagchi said.

Explore further: Engineer leads effort to develop computer systems that can see better than humans

More information: Automatic Fault Characterization via Abnormality-Enhanced Classification, Greg Bronevetsky, Ignacio Laguna, Saurabh Bagchi, and Bronis R. de Supinski, PDF.

ABSTRACT
Enterprise and high-performance computing systems are growing extremely large and complex, employing many processors and diverse software/hardware stacks. As these machines grow in scale, faults become more frequent and system complexity makes it difficult to detect and diagnose them. The difficulty is particularly large for faults that degrade system performance or cause erratic behavior but do not cause outright crashes. The cost of these errors is high since they significantly reduce system productivity, both initially and by time required to resolve them. Current system management techniques do not work well since they require manual examination of system behavior and do not identify root causes. When a fault is manifested, system administrators need timely notification about the type of fault, the time period in which it occurred and the processor on which it originated. Statistical modeling approaches can accurately characterize normal and abnormal system behavior. However, the complex effects of system faults are less amenable to these techniques. This paper demonstrates that the complexity of system faults makes traditional classification and clustering algorithms inadequate for characterizing them. We design novel techniques that combine classification algorithms with information on the abnormality of application behavior to improve detection and characterization accuracy significantly. Our experiments demonstrate that our techniques can detect and characterize faults with 85% accuracy, compared to just 12% accuracy for direct applications of traditional techniques.

Related Stories

New automated tool 'debugs' nuclear weapon simulations

Jun 01, 2010

Purdue University researchers, working with high-performance computing experts at Lawrence Livermore National Laboratory, have created an automated program to "debug" simulations used to more efficiently certify the nation's ...

Research team 'virtualizes' supercomputer

Jan 20, 2010

A collaboration between researchers at Northwestern University, Sandia National Labs and the University of New Mexico has resulted in the largest-scale study ever done on what many consider an important part of the future ...

Improving U.S. missile defense

Oct 25, 2010

Researchers at Purdue University are working with the U.S. Department of Defense's Missile Defense Agency to develop software that would improve the ability to manage the large volume of incoming data during an enemy attack.

Predictive simulation successes on Dawn supercomputer

Sep 30, 2009

(PhysOrg.com) -- The 500-teraFLOPS Advanced Simulation and Computing program's Sequoia Initial Delivery System (Dawn), an IBM machine of the same lineage as BlueGene/L, has immediately proved itself useful ...

Recommended for you

Enabling a new future for cloud computing

14 hours ago

The National Science Foundation (NSF) today announced two $10 million projects to create cloud computing testbeds—to be called "Chameleon" and "CloudLab"—that will enable the academic research community ...

Hacking Gmail with 92 percent success

Aug 21, 2014

(Phys.org) —A team of researchers, including an assistant professor at the University of California, Riverside Bourns College of Engineering, have identified a weakness believed to exist in Android, Windows ...

User comments : 10

Adjust slider to filter visible comments by rank

Display comments: newest first

Eikka
1 / 5 (1) Jun 05, 2012
It reminds me of a similiar problem with data storage.

There's a probability of a bit error per certain amount of data copied between the drives in a RAID array, and when the drives are large enough you can pretty much guarantee that you cannot recover from a drive failure with your data intact. Any time you read the redundant copy from start to finish, you get at least one false bit, and, the redundant copy itself is likely to have at least one false bit in it to begin with, so you can't just read it multiple times.

To solve the problem, you have to add more drives so you can compare at least two imperfect copies and compose the correct data, or, slow down the performance of the array to double-check everything all the time by adding an extra layer of error correction code.
chardo137
1.9 / 5 (9) Jun 05, 2012
Great, let's use supercomputers to figure out how to kill people. Just think of all the good things that could be learned with this much computing power.
antialias_physorg
not rated yet Jun 05, 2012
While it's reloatively easy to compensate for errors in data storage and retreival.
RAID is only one option. You can also fiddle around with Viterbi codes and codes with various Hamming distances if you want self correcting bitstreams to arbitrary numbers of bit errors per block until you're satisfied that the chance of n bit errors per block is so unlikely as to be negligible.(though this will make your bitstream bigger it is worth it if you're dealing with senistive data or channels with well characterized noise types) ...or at the very least you can useparity bits/CRCs to know when you have received faulty bits and need to re-retreive the data.

In calculations that's not really possible since you don't know the results in order to add such methods (otherwise you wouldn't need to do the computation in the first place).

Beyond basic sanity checks, clustering (as per the article) and multiple redundancy in computation there's not much you can do.
Valentiinro
1 / 5 (3) Jun 05, 2012
On the subject of nuclear weapons:
If you're not gonna use em, can't even test em, why make em?

If the only thing you're going to do is calculate one is stronger than the other so you can bring a number to the next international dick measuring contest, who cares if the calculation is a little off? People lie about that sort of thing all the time!
kaasinees
2.6 / 5 (5) Jun 05, 2012
I doubt they used physical data storage, it is very slow and requires more cycles.

Rather they used ECC memory, which comes with error trapping.

edit:

"The problem is that when you scale up to 10,000 machines, this parallel file system bogs down," Bagchi said. "It's about 10 times too much activity for the system to handle, and this mismatch will just become worse because we are continuing to create faster and faster computers."


Sheesh, talk about design flaws.
ccheval
3.7 / 5 (3) Jun 05, 2012
I think it is a brilliant solution. No strontium-90 and other unfriendly by-products created. Barack Obama said in a speech we have 5113 nuclear weapons, way down from the 70s. One doesn't just "make one" and it stays stable. Different metals in intimate contact, moving parts, sub-critical fission occurring continuously -these things are higher maintenance than
Gisele. Got to love the cyber-modeling solution and the robust efforts to hone its accuracy. The very exercise will significantly advance computer modeling. Modeling fission then fusion through time by the molecule? Wow!
Sadly Chardo137, we can't eliminate nuclear arms entirely but scaling back further is a worthwhile compromise and worth pursuing. With today's political realities, throwing our stockpile completely away seems rash and more than a little bit foolish.
dtxx
3.7 / 5 (3) Jun 05, 2012
It reminds me of a similiar problem with data storage.

There's a probability of a bit error per certain amount of data copied between the drives in a RAID array...


I'm sorry, but based on both my empirical experience and education I have to disagree. If you can provide me supporting information I would love to see it, as I deal with these issues regulary. Parity striping takes the factors you mention into account. Also sorry to cut your quote short, but I needed the characters.
Higher Intelligence Agency
1 / 5 (4) Jun 06, 2012
Amazing how the fascination of such computing technology can make us completely forget about what this technology is actually being used for, isn't it?
infinite_energy
1 / 5 (2) Jun 06, 2012
Let's say the supercomputer computes sqrt(78234897123.8242391*2989912048.2453)=???
How could you possibly know a wrong result?
If some LEDs stop blinking (or change color) yes you know there is a fault someplace but else...??? And what if only the LED is at fault?
antialias_physorg
not rated yet Jun 06, 2012
Let's say the supercomputer computes sqrt(78234897123.8242391*2989912048.2453)=???

Depending on where your bitflip is that is relatively easy (the more significant the bit, the easier it is to write a sanity check for such a case)
Otherwise you just do it the way it's done on rockets: Have three independent systems and check the results against each other, "taking identical two out of three" if they disagree or redoing the calculations when all disagree.
(While preferrably running on machines of different types and different operating systems so as to eliminate systemic errors that occur accross all systems)
While it is not unlikley that there will be the occasional error it is extremely unlikely that errors in separate cores will produce the same erroneous result.