Scientists help tame tidal wave of genomic data using SDSC's trestles

September 18, 2013, University of California - San Diego

Sequencing the DNA of an organism, whether human, plant, or jellyfish, has become a straightforward task, but assembling the information gathered into something coherent remains a massive data challenge. Researchers using computational resources at the San Diego Supercomputer Center (SDSC) at the University of California, San Diego, have created a faster and more effective way to assemble genomic information, while increasing

In a paper presented the past month at the 39th International Conference on Very Large Databases (VLDB2013) in Riva del Garda, Italy, Xifeng Yan, the Venkatesh Narayanamurti Chair of Computer Science at the University of California, Santa Barbara, explains how he used SDSC's Trestles compute cluster to help develop a new algorithm called MSP (minimum substring partitioning) that helps to assemble genomes with extreme efficiency. MSP is a critical part of a pipeline, or a group of software that assembles entire genomes, with each piece of the software doing one part of the job. Yan and his colleagues were able to optimize one of two steps to use a mere 10 gigabytes of memory without runtime slowdown.

"High-quality genome sequencing is foundational to many critical biological and medical problems," said Yan. "With the advent of massively parallel DNA sequencing technologies how to manage and process the big sequence data has become an important issue. Experimental results showed that MSP can not only successfully complete the tasks on very large datasets within a small amount of memory, but also achieve better performance than existing state-of-the-art algorithms."

According to Yan, his experimental results demonstrate that MSP's improvement in efficiency might soon make it possible to assemble large genomes using smaller, less expensive, commodity clusters rather than requiring high-cost, high performance resources.

Knowing the whole genome of various species underlies biological and medical research, such as understanding evolution pathways or identifying the cause of diseases. However, existing sequencing techniques produce huge amounts – billions for a higher organism such as a human – of overlapping short sequence randomly sampled from the genome. A major challenge in genome research is to assemble those short reads, which vary from ten to several hundred bases, back into the whole genome, a task that requires vast amounts of memory. It would be similar to gluing together an encyclopedia from a haystack of words and sentence fragments.

Using Trestles, Yan and his colleagues demonstrated that MSP reduces one of the steps required so that it uses significantly less memory than widely-used algorithms, removing one of the bottlenecks in processing whole genomes. Algorithms such as Velvet and SOAPdenovo struggle to computationally to prepare a virtual scaffolding upon which to assemble the sequence into complete genomes. MSP, a disk-based partition method, streamlines the creation of such scaffolding, known as a De Bruijn graph. A mammalian-sized processed using other algorithms would consume hundreds of gigabytes of memory, while MSP allows researchers to complete a key step to ten gigabytes of memory without runtime slowdown.

Yan and his colleagues are working on a second step that also consumes a significant amount of memory, and have so far reduced its memory use by two-thirds with the goals of further reductions in the future. Additional researchers include Yang Li, Pegah Kamousi, Fangqiu Han, Shengqi Yang, and Subhash Suri, all with UC Santa Barbara.

The full paper can be viewed at

Explore further: Cancer biology: Charting a tumor's genomic roots

Related Stories

Sequencing hundreds of chloroplast genomes now possible

January 31, 2013

Researchers at the University of Florida and Oberlin College have developed a sequencing method that will allow potentially hundreds of plant chloroplast genomes to be sequenced at once, facilitating studies of molecular ...

Researchers develop software tool for cancer genomics

August 26, 2013

Researchers at the Medical College of Wisconsin (MCW) have developed a new bioinformatics software tool designed to more easily identify genetic mutations responsible for cancers. The tool, called DrGaP, is the subject of ...

100K Pathogen Genome Project maps first genomes

May 22, 2013

( —Striking a blow at foodborne diseases, the 100K Pathogen Genome Project at the University of California, Davis, today announced that it has sequenced the genomes of its first 10 infectious microorganisms, including ...

Recommended for you


Please sign in to add a comment. Registration is free, and takes less than a minute. Read more

Click here to reset your password.
Sign in to get notified via email when new comments are made.