February 8, 2016

Search technique helps researchers find DNA sequences in minutes rather than days

Database searches for DNA sequences that can take biologists and medical researchers days can now be completed in a matter of minutes, thanks to a new search method developed by computer scientists at Carnegie Mellon University.

The method developed by Carl Kingsford, associate professor of computational biology, and Brad Solomon, a Ph.D. student in the Computational Biology Department, is designed for searching so-called "short reads" - DNA and RNA sequences generated by high-throughput sequencing techniques. It relies on a new indexing data structure, called Sequence Bloom Trees, or SBTs, that the researchers describe in a report published online today by the journal Nature Biotechnology.

The National Institutes of Health maintains a humongous database, called the Sequence Read Archive, which contains about three petabases, or sequences totaling three quadrillion base-pairs. The information is useful to a wide swath of researchers, from those asking questions about basic biological processes to those studying potential cancer cures.

"The database contains untold numbers of as-yet undiscovered insights and is heavily used," Kingsford said. "Its main problem is that it's very difficult to search."

Thousands of hard drives would be needed to store these sequences. Searching through the short reads, which are typically 50 to 200 base-pairs each, to see which ones could be assembled to form a target gene of perhaps 10,000 base-pairs, is cumbersome and can take days in some cases, he noted.

Just as an index can speed searches through a book or catalog, the SBT-based index developed by Kingsford and Solomon can greatly speedup searches of this bioinformatics database. They actually represent each short read as a set of fixed-length subsequences, employing data structures called Bloom filters that can efficiently store information in a small space and can test whether an element is part of a set.

At the first level of inquiry, the SBTs can tell whether a target DNA sequence is contained in the database at all. If it is, the search proceeds to the next level, where the SBTs indicate whether the sequence is in one half or the other of the database. At each level, the inquiry branches one way or the other until the desired experiments are identified.

Kingsford and Solomon tested their technique using a database of 2,652 human blood, breast and brain experiments, each of which often contain over a billion base-pairs of RNA sequences. They found that most searches of that database could be completed in an average of 20 minutes. They estimated the comparable search time using existing techniques, known as SRA-BLAST and STAR, would take 2.2 days and 921 days, respectively.

Further speedups are possible because batches of over 200,000 queries can be performed simultaneously, they noted.

More information: Fast search of thousands of short-read sequencing experiments, Nature Biotechnology, DOI: 10.1038/nbt.3442

The SBT method is available as open source code and can be downloaded at www.cs.cmu.edu/~ckingsf/software/bloomtree/

Journal information: Nature Biotechnology

Provided by Carnegie Mellon University

Citation: Search technique helps researchers find DNA sequences in minutes rather than days (2016, February 8) retrieved 18 April 2024 from https://phys.org/news/2016-02-technique-dna-sequences-minutes-days.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

A faster sequence homology search algorithm based on database subsequence clustering

17 shares

Feedback to editors

Search technique helps researchers find DNA sequences in minutes rather than days

Key protein regulates immune response to viruses in mammal cells

Unraveling the mysteries of consecutive atmospheric river events

Research team resolves decades-long problem in microscopy

RNA's hidden potential: New study unveils its role in early life and future bioengineering

Smoother surfaces make for better accelerators

Scientists reveal hydroclimatic changes on multiple timescales in Central Asia over the past 7,800 years

Research reveals a surprising topological reversal in quantum systems

NASA's Juno gives aerial views of mountain and lava lake on Io

Toxic fireproof chemicals can be absorbed through touch, 3D-printed skin model shows

Skyrmions move at record speeds: A step towards the computing of the future

Relevant PhysicsForums posts

Can four legged animals drink from beneath their feet?

Mold in Plastic Water Bottles? What does it eat?

Dolphins don't breathe through their esophagus

Is this egg-laying or something else?

Color Recognition: What we see vs animals with a larger color range

How to Implement Beamforming in Ultrasound Diffraction Tomography

A faster sequence homology search algorithm based on database subsequence clustering

Computational method dramatically speeds up estimates of gene expression

Systematically searching DNA for regulatory elements indicates limits of previous thinking

DNA sequences in GMOs: Largest database now publicly available

One Codex in open beta for genomic data search

Lab creates bioinformatics tool for metagenome analysis

Researchers train a bank of AI models to identify memory formation signals in the brain

Neuronal gateway to essential molecules in learning and memory discovered on atomic scale

Plant sensors could act as an early warning system for farmers

Computer model suggests frozen cells could be used to save northern white rhino from extinction

Making crops colorful for easier weeding by robots

Disease-resistant strains of carp provide advancements in aquaculture, enhance gefilte fish quality

Medical Xpress

Tech Xplore

Science X

Search technique helps researchers find DNA sequences in minutes rather than days

Key protein regulates immune response to viruses in mammal cells

Unraveling the mysteries of consecutive atmospheric river events

Research team resolves decades-long problem in microscopy

RNA's hidden potential: New study unveils its role in early life and future bioengineering

Smoother surfaces make for better accelerators

Scientists reveal hydroclimatic changes on multiple timescales in Central Asia over the past 7,800 years

Research reveals a surprising topological reversal in quantum systems

NASA's Juno gives aerial views of mountain and lava lake on Io

Toxic fireproof chemicals can be absorbed through touch, 3D-printed skin model shows

Skyrmions move at record speeds: A step towards the computing of the future

Relevant PhysicsForums posts

Related Stories

A faster sequence homology search algorithm based on database subsequence clustering

Computational method dramatically speeds up estimates of gene expression

Systematically searching DNA for regulatory elements indicates limits of previous thinking

DNA sequences in GMOs: Largest database now publicly available

One Codex in open beta for genomic data search

Lab creates bioinformatics tool for metagenome analysis

Recommended for you

Researchers train a bank of AI models to identify memory formation signals in the brain

Neuronal gateway to essential molecules in learning and memory discovered on atomic scale

Plant sensors could act as an early warning system for farmers

Computer model suggests frozen cells could be used to save northern white rhino from extinction

Making crops colorful for easier weeding by robots

Disease-resistant strains of carp provide advancements in aquaculture, enhance gefilte fish quality

Newsletter sign up

Donate and enjoy an ad-free experience