SDSC assists in whole-genome sequencing analysis under collaboration with Janssen

Mar 20, 2014

A recent whole-genome sequencing (WGS) analysis project supported by the San Diego Supercomputer Center (SDSC) at the University of California, San Diego has demonstrated the effectiveness of innovative applications of "flash" memory technology to rapidly process large data sets that are pervasive throughout human genomics research.

Janssen Research and Development, LLC (Janssen), in collaboration with SDSC and the Scripps Translational Science Institute (STSI), recently launched a project to conduct whole-genome sequencing of 438 patients with rheumatoid arthritis to better understand the disease, as well as explore genetic factors of patient response to a biologic therapy discovered, developed, and currently marketed by Janssen in the United States.

The analysis began with 50 terabytes of "read" data generated by DNA sequencers from samples originally obtained from each of the study participants. These source data were fed into a 14-step processing "pipeline" using open source software tools. Key components of the analysis were mapping the DNA read sequences from each patient against a reference genome and calling to identify the variants between the two.

The read mapping and variant calling were done by Kristopher Standish, a UC San Diego graduate student working under Nicholas Schork, formerly with STSI and now with the J. Craig Venter Institute. SDSC provided high-performance computing and storage resources, as well as expertise to set up and optimize the computational pipeline.

"The need to conduct analysis of 438 full human genomes in a relatively short timeframe necessitated a thorough understanding not only of the computational workload, but of the memory, storage, and input/output requirements," said Wayne Pfeiffer, an SDSC Distinguished Scientist and the Center's lead researcher in the collaboration. "The emergence of 'big data' challenges such as those in human genomics has brought to the fore situations where computer analyses are more likely memory-and I/O (input/output)-bound than compute-bound, meaning that while the actual computer processors may have plenty of capacity, the ability to store and/or move around large amounts of data becomes the limiting factor in throughput."

In the case of the Janssen collaboration, one step in particular – the "sort" step of the read mapping stage – was particularly challenging, requiring a relatively small number of processor cores, but rapid access to several terabytes of data, more than can be kept in the supercomputer's high performance main memory. The conventional approach of storing data on hard disk drives during the sort step resulted in a severely I/O-bound situation, dramatically limiting throughput.

"The solution was to take advantage of Gordon's flash memory, which provides much higher speed than conventional disk drives for the random access I/O operations of the sort step," said Pfeiffer. "Several terabytes of flash were aggregated into what we call "BigFlash" nodes, which significantly reduced the I/O bottleneck in this step and contributed to helping researchers meet the project's timelines."

"The bulk of the analysis was completed in six weeks (including learning time on Gordon) using more than 300,000 core hours of computer time," said Glenn K. Lockwood, a user services consultant at SDSC. "That analysis would have taken more than four years of 24/7 compute time on an 8-core workstation."

The collaboration also demonstrated the need for large-scale, high-performance computing resources when analyzing hundreds of human genomes in constrained timeframes. With 340 teraflops of computing power, 64 terabytes of main memory, and 300 terabytes of , Gordon ranked among the 50 fastest supercomputers in the world when it debuted in late 2011, according to the Top500 list.

According to Lockwood, at the project's peak throughput, the WGS pipeline was using 350 terabytes of storage on SDSC's high-performance storage system and 5,000 processor cores representing 30 percent of the system capacity.

"The Janssen collaboration validated our vision for the Gordon system," said Michael Norman, SDSC's director and principal investigator for the Gordon project. "We saw that emerging big data challenges such as human genomics would dictate new supercomputer architectures where memory and IOPS (I/O operations per second) would be more important than raw computing power, so we designed the system accordingly."

Explore further: Technology turns eyewear into a smart device capable of displaying visual information

add to favorites email to friend print save as pdf

Related Stories

NSF awards $12 million to SDSC to deploy 'Comet' supercomputer

Oct 04, 2013

The San Diego Supercomputer Center (SDSC) at the University of California, San Diego, has been awarded a $12-million grant from the National Science Foundation (NSF) to deploy Comet, a new petascale supercomputer designed to tra ...

Recommended for you

Timeline of the Sony Pictures Entertainment hack

21 minutes ago

It's been four weeks since hackers calling themselves Guardians of Peace began their cyberterrorism campaign against Sony Pictures Entertainment. In that time thousands of executive emails and other documents ...

Two more former Sony workers sue over data breach

21 minutes ago

Two more former employees of Sony Pictures Entertainment are suing the company over the massive data breach in which their personal and financial information was stolen and posted online.

Second security clearance investigation contractor hacked

51 minutes ago

Federal officials say the private files of 48,439 workers may have been compromised by a computer breach at government contractor KeyPoint Government Solutions Inc. The hacking incident is the second this year at a major ...

Digital dilemma: How will US respond to Sony hack?

1 hour ago

The detective work blaming North Korea for the Sony hacker break-in appears so far to be largely circumstantial, The Associated Press has learned. The dramatic conclusion of a Korean role is based on subtle ...

Constantly changing online prices stump shoppers

1 hour ago

Online shopping has become as volatile as stock market trading. Wild, minute-by-minute price swings on everything from clothes to TVs have made it difficult for holiday shoppers to "buy low."

User comments : 0

Please sign in to add a comment. Registration is free, and takes less than a minute. Read more

Click here to reset your password.
Sign in to get notified via email when new comments are made.