Fewer Faults for Faster Computing

Mar 28, 2011
The redundant data distribution of an array showing that a node failure will leave at least one copy of the data available for continued execution. This is the basic idea that enables the approach, which can be used in many science domains.

(PhysOrg.com) -- Environmental Molecular Sciences Laboratory (EMSL) users have designed and implemented an efficient fault-tolerant version of the coupled cluster method for high-performance computational chemistry using in-memory data redundancy.

Their method, demonstrated with the EMSL-developed and now NWChem, addresses the challenges of reduced mean time between failures, which is currently days and projected to be hours for upcoming extreme scale supercomputers.

Their approach with coupled cluster perturbative triples enables the program to correctly continue execution despite the loss of processes.

The team extended the Global Array toolkit, a library that provides an efficient and portable “shared-memory” programming interface for distributed-memory computers.

Each process in a Multiple Instruction/Multiple Data parallel program can asynchronously access logical blocks of physically distributed dense multidimensional arrays, without requiring cooperation by other processes.

The infrastructure that the team developed was shown to add an overhead of less than 10% and can be deployed to other algorithms throughout NWChem as well as other codes. Such advances in supercomputing will enhance scientific capability to address global challenges such as climate change and energy solutions using top-end computing platforms.

Explore further: Communication-optimal algorithms for contracting distributed tensors

More information: van Dam HJJ, A Vishnu, and WA de Jong. 2011. “Designing a Scalable Fault Tolerance Model for High Performance Computational Chemistry: A Case Study with Coupled Cluster Perturbative Triples.” J. Chem. Theory Comput. 2011, 7, 66–75. DOI:10.1021/ct100439u

add to favorites email to friend print save as pdf

Related Stories

Scaling Goes eXtreme: Researchers reach 34K CPUs

May 25, 2010

(PhysOrg.com) -- Currently, researchers have demonstrated the scalability of high-level excited-state coupled-cluster approaches and parallel-in-time algorithms, reaching a staggering 34,000 Core Processing ...

Customizing supercomputers from the ground up

May 27, 2010

(PhysOrg.com) -- Computer scientist Adolfy Hoisie has joined the Department of Energy's Pacific Northwest National Laboratory to lead PNNL's high performance computing activities. In one such activity, Hoisie will direct ...

Recommended for you

Sony surprises with first quarter profit

6 minutes ago

(AP)—Sony reported a surprise eightfold jump in quarterly profit as sales got a perk from a cheap yen and its bottom line was helped by gains from buildings and its stake in a video-game maker.

Samsung profit falls as smartphone sales slow

21 minutes ago

(AP)—Samsung Electronics Co. reported a bigger-than-expected fall in second quarter profit on Thursday and said it was uncertain if earnings from its handset business would improve in the current quarter.

Microsoft unveils Xbox in China as it faces probe

11 hours ago

Microsoft on Wednesday unveiled its Xbox game console in China, the first to enter the market after an official ban 14 years ago, even as it faces a Chinese government probe over business practices.

Teens love vacation selfies; adults, not so much

12 hours ago

(AP)—Jacquie Whitt's trip to the Galapagos with a group of teenagers was memorable not just for the scenery and wildlife, but also for the way the kids preserved their memories. It was, said Whitt, a "selfie ...

User comments : 0