Searching genomic data faster with new algorithm

Jul 10, 2012 by Larry Hardesty

In 2001, the Human Genome Project and Celera Genomics announced that after 10 years of work at a cost of some $400 million, they had completed a draft sequence of the human genome. Today, sequencing a human genome is something that a single researcher can do in a couple of weeks for less than $10,000.

Since 2002, the rate at which genomes can be sequenced has been doubling every four months or so, whereas doubles only every 18 months. Without the advent of new , biologists' ability to generate will soon outstrip their ability to do anything useful with it.

In the latest issue of , MIT and Harvard University researchers describe a new algorithm that drastically reduces the time it takes to find a particular in a database of genomes. Moreover, the more genomes it's searching, the greater the speedup it affords, so its advantages will only compound as more data is generated.

In some sense, this is a algorithm — like the one that allows computer users to compress data files into smaller zip files. "You have all this data, and clearly, if you want to store it, what people would naturally do is compress it," says Bonnie Berger, a professor of applied math and computer science at MIT and senior author on the paper. "The problem is that eventually you have to look at it, so you have to decompress it to look at it. But our insight is that if you compress the data in the right way, then you can do your analysis directly on the compressed data. And that increases the speed while maintaining the accuracy of the analyses."

Exploiting redundancy

The researchers' compression scheme exploits the fact that evolution is stingy with good designs. There's a great deal of overlap in the genomes of closely related species, and some overlap even in the genomes of distantly related species: That's why experiments performed on yeast cells can tell us something about human drug reactions.

Berger; her former grad student Michael Baym PhD '09, who's now a visiting scholar in the MIT math department and a postdoc in systems biology at Harvard Medical School; and her current grad student Po-Ru Loh developed a way to mathematically represent the genomes of different species — or of different individuals within a species — such that the overlapping data is stored only once. A search of multiple genomes can thus concentrate on their differences, saving time.

"If I want to run a computation on my genome, it takes a certain amount of time," Baym explains. "If I then want to run the same computation on your , the fact that we're so similar means that I've already done most of the work."

In experiments on a database of 36 yeast genomes, the researchers compared their algorithm to one called BLAST, for Basic Local Alignment Search Tool, one of the most commonly used genomic-search algorithms in biology. In a search for a particular genetic sequence in only 10 of the yeast genomes, the new algorithm was twice as fast as BLAST; but in a search of all 36 genomes, it was four times as fast. That discrepancy will only increase as genomic databases grow larger, Berger explains.

Matchmaking

The new algorithm would be useful in any application where the central question is, as Baym puts it: "I have a sequence; what is it similar to?" Identifying microbes is one example. The new algorithm could help clinicians determine causes of infections, or it could help biologists characterize "microbiomes," collections of microbes found in animal tissue or particular microenvironments; variations in the human microbiome have been implicated in a range of medical conditions. It could be used to characterize the microbes in particularly fertile or infertile soil, and it could even be used in forensics, to determine the geographical origins of physical evidence by its microbial signatures.

Berger's group is currently working to extend the technique to information on proteins and RNA sequences, where it could pay even bigger dividends. Now that the has been mapped, the major questions in biology are what genes are active when, and how the proteins they code for interact. Searches of large databases of biological information are crucial to answering both questions.

Explore further: Heaven scent: Finding may help restore fragrance to roses

Related Stories

New genome sequencing targets announced

Jul 24, 2006

The U.S. National Human Genome Research Institute has announced several new sequencing targets, including the northern white-cheeked gibbon.

Can you really eat just one?

Jul 29, 2011

A Kansas State University genomicist is hoping an old potato chip slogan -- "betcha can't eat just one" -- will become the mindset of researchers when it comes to sequencing insect genomes.

Researchers predict infinite genomes

Sep 22, 2005

In a new study, TIGR scientists conclude that researchers might never fully describe some bacteria and viruses--because their genomes are infinite.

Researchers sequence 'dark matter of life'

Sep 18, 2011

Researchers have developed a new method to sequence and analyze the dark matter of life—the genomes of thousands of bacteria species previously beyond scientists' reach, from microorganisms that produce ...

New gene prediction method capitalizes on multiple genomes

Dec 20, 2007

Researchers at Stanford University report in the online open access journal, Genome Biology, a new approach to computationally predicting the locations and structures of protein-coding genes in a genome. Gene finding remain ...

Recommended for you

Study on pesticides in lab rat feed causes a stir

Jul 02, 2015

French scientists published evidence Thursday of pesticide contamination of lab rat feed which they said discredited historic toxicity studies, though commentators questioned the analysis.

International consortium to study plant fertility evolution

Jul 02, 2015

Mark Johnson, associate professor of biology, has joined a consortium of seven other researchers in four European countries to develop the fullest understanding yet of how fertilization evolved in flowering plants. The research, ...

Making the biofuels process safer for microbes

Jul 02, 2015

A team of investigators at the University of Wisconsin-Madison and Michigan State University have created a process for making the work environment less toxic—literally—for the organisms that do the heavy ...

Why GM food is so hard to sell to a wary public

Jul 02, 2015

Whether commanding the attention of rock star Neil Young or apparently being supported by the former head of Greenpeace, genetically modified food is almost always in the news – and often in a negative ...

The hidden treasure in RNA-seq

Jul 01, 2015

Michael Stadler and his team at the Friedrich Miescher institute for Biomedical Research (FMI) have developed a novel computational approach to analyze RNA-seq data. By comparing intronic and exonic RNA reads, ...

User comments : 0

Please sign in to add a comment. Registration is free, and takes less than a minute. Read more

Click here to reset your password.
Sign in to get notified via email when new comments are made.