Scientists help tame tidal wave of genomic data using SDSC's trestles

Sep 18, 2013

Sequencing the DNA of an organism, whether human, plant, or jellyfish, has become a straightforward task, but assembling the information gathered into something coherent remains a massive data challenge. Researchers using computational resources at the San Diego Supercomputer Center (SDSC) at the University of California, San Diego, have created a faster and more effective way to assemble genomic information, while increasing

In a paper presented the past month at the 39th International Conference on Very Large Databases (VLDB2013) in Riva del Garda, Italy, Xifeng Yan, the Venkatesh Narayanamurti Chair of Computer Science at the University of California, Santa Barbara, explains how he used SDSC's Trestles compute cluster to help develop a new algorithm called MSP (minimum substring partitioning) that helps to assemble genomes with extreme efficiency. MSP is a critical part of a pipeline, or a group of software that assembles entire genomes, with each piece of the software doing one part of the job. Yan and his colleagues were able to optimize one of two steps to use a mere 10 gigabytes of memory without runtime slowdown.

"High-quality genome sequencing is foundational to many critical biological and medical problems," said Yan. "With the advent of massively parallel DNA sequencing technologies how to manage and process the big sequence data has become an important issue. Experimental results showed that MSP can not only successfully complete the tasks on very large datasets within a small amount of memory, but also achieve better performance than existing state-of-the-art algorithms."

According to Yan, his experimental results demonstrate that MSP's improvement in efficiency might soon make it possible to assemble large genomes using smaller, less expensive, commodity clusters rather than requiring high-cost, high performance resources.

Knowing the whole genome of various species underlies biological and medical research, such as understanding evolution pathways or identifying the cause of diseases. However, existing sequencing techniques produce huge amounts – billions for a higher organism such as a human – of overlapping short sequence randomly sampled from the genome. A major challenge in genome research is to assemble those short reads, which vary from ten to several hundred bases, back into the whole genome, a task that requires vast amounts of memory. It would be similar to gluing together an encyclopedia from a haystack of words and sentence fragments.

Using Trestles, Yan and his colleagues demonstrated that MSP reduces one of the steps required so that it uses significantly less memory than widely-used algorithms, removing one of the bottlenecks in processing whole genomes. Algorithms such as Velvet and SOAPdenovo struggle to computationally to prepare a virtual scaffolding upon which to assemble the sequence into complete genomes. MSP, a disk-based partition method, streamlines the creation of such scaffolding, known as a De Bruijn graph. A mammalian-sized processed using other algorithms would consume hundreds of gigabytes of memory, while MSP allows researchers to complete a key step to ten gigabytes of memory without runtime slowdown.

Yan and his colleagues are working on a second step that also consumes a significant amount of memory, and have so far reduced its memory use by two-thirds with the goals of further reductions in the future. Additional researchers include Yang Li, Pegah Kamousi, Fangqiu Han, Shengqi Yang, and Subhash Suri, all with UC Santa Barbara.

The full paper can be viewed at

Explore further: Heaven scent: Finding may help restore fragrance to roses

Related Stories

Sequencing hundreds of chloroplast genomes now possible

Jan 31, 2013

Researchers at the University of Florida and Oberlin College have developed a sequencing method that will allow potentially hundreds of plant chloroplast genomes to be sequenced at once, facilitating studies of molecular ...

Researchers develop software tool for cancer genomics

Aug 26, 2013

Researchers at the Medical College of Wisconsin (MCW) have developed a new bioinformatics software tool designed to more easily identify genetic mutations responsible for cancers. The tool, called DrGaP, is the subject of ...

100K Pathogen Genome Project maps first genomes

May 22, 2013

( —Striking a blow at foodborne diseases, the 100K Pathogen Genome Project at the University of California, Davis, today announced that it has sequenced the genomes of its first 10 infectious microorganisms, including ...

Recommended for you

Study on pesticides in lab rat feed causes a stir

Jul 02, 2015

French scientists published evidence Thursday of pesticide contamination of lab rat feed which they said discredited historic toxicity studies, though commentators questioned the analysis.

International consortium to study plant fertility evolution

Jul 02, 2015

Mark Johnson, associate professor of biology, has joined a consortium of seven other researchers in four European countries to develop the fullest understanding yet of how fertilization evolved in flowering plants. The research, ...

Making the biofuels process safer for microbes

Jul 02, 2015

A team of investigators at the University of Wisconsin-Madison and Michigan State University have created a process for making the work environment less toxic—literally—for the organisms that do the heavy ...

Why GM food is so hard to sell to a wary public

Jul 02, 2015

Whether commanding the attention of rock star Neil Young or apparently being supported by the former head of Greenpeace, genetically modified food is almost always in the news – and often in a negative ...

The hidden treasure in RNA-seq

Jul 01, 2015

Michael Stadler and his team at the Friedrich Miescher institute for Biomedical Research (FMI) have developed a novel computational approach to analyze RNA-seq data. By comparing intronic and exonic RNA reads, ...

User comments : 0

Please sign in to add a comment. Registration is free, and takes less than a minute. Read more

Click here to reset your password.
Sign in to get notified via email when new comments are made.