Unexpected cross-species contamination in genome sequencing projects

November 18, 2014, PeerJ

As genome sequencing has gotten faster and cheaper, the pace of whole-genome sequencing has accelerated, dramatically increasing the number of genomes deposited in public archives. Although these genomes are a valuable resource, problems can arise when researchers misapply computational methods to assemble them, or accidentally introduce unnoticed contaminations during sequencing.

The first complete bacterial genome, Haemophilus influenzae, appeared in 1995, and today the public GenBank database contains over 27,000 prokaryotic and 1,600 eukaryotic genomes. The vast majority of these are draft genomes that contain gaps in their sequences, and researchers often use these draft sequences for future analyses.

Each project begins with a DNA source, which varies depending on the species. For animals, blood is a common source, while for smaller organisms such as insects the entire organism or a population of organisms may be required to yield enough DNA for sequencing. Throughout the process of DNA isolation and sequencing, contamination remains a possibility. Computational filters applied to the raw sequencing reads are usually effective at removing common laboratory contaminants such as E. coli, but other contaminants may be more difficult to identify.

In a new study in PeerJ, authors from Johns Hopkins University discovered contaminating bacterial and viral sequences in "draft" assemblies of animal and that had been deposited in GenBank. These may cause particular problems for the rapidly growing field of microbiome analysis, when sequences labeled as animal in origin actually turn out to be microbial.

In an even more surprising finding, the authors discovered the presence of cow and sheep DNA in the supposedly finished genome of a pathogenic bacterium, Neisseria gonorrhoeae. Although deposited in GenBank as a finished genome, the bacterium apparently was a that was submitted as complete, with erroneous DNA inserted in five places. If taken at face value, this data would appear to be a startling case of lateral gene transfer, but the correct explanation appears to be more mundane.

These findings highlight the importance of careful screening of DNA sequence data both at the time of release and, in some cases, for many years after publication.

Explore further: Breaking down DNA by genome

Related Stories

Breaking down DNA by genome

October 31, 2014

New DNA sequencing technologies have greatly advanced genomic and metagenomic studies in plant biology. Scientists can readily obtain extensive genetic information for any plant species of interest, at a relatively low cost, ...

Novel bacterium linked to cord colitis syndrome

August 8, 2013

(HealthDay)—A novel bacterium is associated with cord colitis syndrome, a complication of umbilical-cord hematopoietic stem-cell transplantation, according to a study published in the Aug. 8 issue of the New England Journal ...

Sequencing hundreds of chloroplast genomes now possible

January 31, 2013

Researchers at the University of Florida and Oberlin College have developed a sequencing method that will allow potentially hundreds of plant chloroplast genomes to be sequenced at once, facilitating studies of molecular ...

Scientists re-imagine how genomes are assembled

November 25, 2013

Scientists at the University of Massachusetts Medical School (UMMS) have developed a new method for piecing together the short DNA reads produced by next-generation sequencing technologies that are the basis for building ...

Recommended for you

Visualizing 'unfurling' microtubule growth

November 13, 2018

Living cells depend absolutely on tubulin, a protein that forms hollow tube-like polymers, called microtubules, that form scaffolding for moving materials inside the cell. Tubulin-based microtubule scaffolding allows cells ...

DNA structure impacts rate and accuracy of DNA synthesis

November 13, 2018

The speed and error rate of DNA synthesis is influenced by the three-dimensional structure of the DNA. Using "third-generation" genome-wide DNA sequencing data, a team of researchers from Penn State and the Czech Academy ...

0 comments

Please sign in to add a comment. Registration is free, and takes less than a minute. Read more

Click here to reset your password.
Sign in to get notified via email when new comments are made.