A new tool to correct DNA sequencing errors using consensus and context

September 3, 2014 by Paul Greenfield, Microsoft
A new tool to correct DNA sequencing errors using consensus and context

The rapid development of next-generation DNA sequencing has revolutionized biological and ecological research in the last few years. The cost of DNA sequencing has fallen dramatically, and sequencing machines are becoming a standard piece of lab equipment. Low-cost sequencing is enabling researchers to uncover the gene differences that make some people more susceptible to diseases; to explore the genetic makeup microbial communities from the human gut or the bottom of the ocean; and to rapidly identify the organism responsible for a life-threatening infection.

But while the costs of sequencing have plummeted, the accuracy of the data produced has improved only slowly: about 1 percent of the bases generated are still called incorrectly. The bioinformatics community has responded to this problem by building specialized tools that use the inherent redundancy in sequence data to find and repair miscalls and other sequencing errors. Tests have shown that incorporating the best of these error-correction tools into standard bioinformatics analytical pipelines can result in much better quality genomes and more accurately called gene variants.

However, accurately correcting errors turns out to be a difficult problem, largely because of the repetitive and ambiguous nature of genomes. It is easy to correct simple substitution errors, such as when 50 sequence reads say that a given base is an A, and only the read being corrected says it's a G. Such simple errors are well handled by downstream tools such as assemblers and aligners. The challenge is making the right correction when there are multiple plausible corrections—such as when 50 reads say A, 49 say G, and the read being corrected says T—as happens whenever reads fall across the end of a repeated region within a genome. Just to make things more challenging, this correction has to be done without any knowledge of the genomes being sequenced, and the only clues about which corrections are '"right" comes from the itself.

My colleagues and I at the Commonwealth Scientific and Industrial Research Organisation (CSIRO) have just released a new error correction tool we've developed for use by the research community. We call it "Blue." Blue is a high-performance C# application that runs natively on Windows systems, and under Mono on Linux and OS X. As we reported in a paper published in Bioinformatics, test results show that Blue is significantly faster than other available tools—especially on Windows—and is also more accurate as it recursively evaluates possible alternative corrections in the context of the read being corrected.

Another uncommon feature of Blue is that it can correct all three types of possible errors (substitutions, deletions, and insertions), making it suitable for use of data produced by the Roche 454 and Life Technologies Ion Torrent systems. Blue also allows for the correction of one set of reads with a consensus derived from another set of reads, and this capability has been used to correct small numbers of long (and expensive) Roche 454 reads with a consensus derived from a large file of cheaper (but shorter) Illumina reads. This "cross-correction" method has been used very effectively to improve the quality of several reference assemblies, ranging in size from bacteria to moths and grasses.

Explore further: New software automates and improves phylogenomics from next-generation sequencing data

More information: Paul Greenfield, Konsta Duesing, Alexie Papanicolaou, and Denis C. Bauer. "Blue: correcting sequencing errors using consensus and context." Bioinformatics first published online June 11, 2014 DOI: 10.1093/bioinformatics/btu368

Blue and its associated tools can be downloaded from CSIRO Bioinformatics: www.bioinformatics.csiro.au/blue/

Related Stories

Researchers develop tool to evaluate genome sequencing method

January 2, 2013

Advances in bio-technologies and computer software have helped make genome sequencing much more common than in the past. But still in question are both the accuracy of different sequencing methods and the best ways to evaluate ...

Going deep to improve maize transcriptome

April 29, 2014

A team of researchers from the U.S. Department of Energy Joint Genome Institute (DOE JGI), the University of California, Berkeley, and the Great Lakes Bioenergy Research Center generated an ultra-deep, high quality transcriptome–the ...

New approach to 'spell checking' gene sequences

May 10, 2012

(Phys.org) -- A PhD student from CSIRO and the University of Queensland has found a better way to 'spell check' gene sequences and help biologists better understand the natural world.

Recommended for you

Scientists shed light on biological roots of individuality

February 16, 2018

Put 50 newborn worms in 50 separate containers, and they'll all start looking for food at roughly the same time. Like members of other species, microscopic C. elegans roundworms tend to act like other individuals their own ...

Plants are given a new family tree

February 16, 2018

A new genealogy of plant evolution, led by researchers at the University of Bristol, shows that the first plants to conquer land were a complex species, challenging long-held assumptions about plant evolution.


Please sign in to add a comment. Registration is free, and takes less than a minute. Read more

Click here to reset your password.
Sign in to get notified via email when new comments are made.