Contamination found in nearly a quarter of genome databases

Feb 18, 2011 By Christine Buckley
Mark Longo, a Graduate student in molecular and cell biology, and associate professor Rachel O'Neill. Photo by Dan Buttrey

(PhysOrg.com) -- UConn scientists say the results could complicate disease identification in humans.

A new genomics study by molecular biologists at the University of Connecticut has shown that at least 22 percent of non-human databases are contaminated with human DNA. Their results imply that this level of contamination could also exist in records of the human genome, which could produce major problems in identifying human diseases.

Associate professor Rachel O’Neill, graduate student Mark Longo, and associate professor Michael O’Neill of the molecular and cell biology department in the College of Liberal Arts and Sciences published their findings today in an online edition of the journal PLOS One.

Longo says that he had originally been scanning the genome of zebrafish and comparing it with the human genome to find what are called ultraconserved regions, or bits of DNA that are so ancient they are similar among species that are distantly related, like humans and fish.

But, to Longo’s surprise, he found a region of DNA that was identical to one in humans and couldn’t be a part of the fish genome. That’s when he knew that the fish genome database he was using was contaminated.

“Contamination in these databases could be from people’s skin or hair, or it could be DNA from other sequence libraries kept in the same facility,” says Longo. “We knew we needed to quantify this to see how many of the databases contained human contamination.”

The researchers gathered sequences from all the major global DNA repositories, including the archives at the National Center for Biotechnology Information, the University of California Santa Cruz, the Joint Genome Databases, and the Ensembl genome browser. Any sequencing project funded by federal funds is required to be deposited in one of these archives.

Using a section of DNA that is specific to primates and abundant in the human genome, the researchers identified 454 non-primate genomes out of the 2,027 they sampled as contaminated with human DNA.

Rachel O’Neill says this result led them to reason that if these non-human genome databases were contaminated with human DNA, then it’s just as likely that many human databases would be contaminated as well. But, she says, the catch is that it’s virtually impossible to identify a foreign bit of in a human genome database.

“In sequencing, you have to put all the pieces of the genome together like a big jigsaw puzzle. The pieces that don’t fit stand out,” Longo says. “But if you’re working on a human puzzle, it’s like working on a three-billion piece puzzle, and it’s all black.

“It’s virtually impossible to find human contamination in human genome databases,” she adds, because they simply don’t stand out as anything unusual in a human genome. This, she says, could lead to some terrible mistakes.

A portion of the National Center for Biotechnology Information includes a Cancer Genome Atlas: a library documenting mutations that occur in cancer cells. O’Neill says there’s no room for error in these databases.

“It would be very upsetting to be told you have a mutation for breast cancer, when in fact you don’t, and it was just a contamination from another sample,” she says.

O’Neill emphasizes that scientists need to exercise extreme caution when performing their sequencing, and that they should validate results through tests in their own laboratories before submitting them to databases. Longo points out that the UConn researchers found contaminations in some sequences that they had produced in their own laboratories, which they then discarded. O’Neill says these practices should be the norm.

“We’re compounding this problem in our rush to move forward with genomics,” she says. “Millions of dollars are invested each year in these sequence databases, but we’re plowing ahead with less caution than we should. The result is that we might have a harder time recognizing the etiology of something like cancer.”

Longo notes that in his analysis, there was one type of DNA database that showed no contamination at all: that of influenza. Because viruses are so dangerous, great care is taken in their preparation, he says – much more than is usually taken with a commonplace and harmless genome. This kind of caution should be extended to all sequencing, says O’Neill.

“The sequencing world has moved in leaps and bounds,” she says. “It’s time for validation to catch up.”

Explore further: Scientists sequence complete genome of E. coli strain responsible for food poisoning

Related Stories

Study finds 'masculine' women get more promotions at work

Jan 27, 2011

Women who demonstrate stereotypical masculine traits should be mindful of their behavior if they want to get ahead in the workplace. That is the finding of researchers at George Mason University and Stanford University who ...

Modern society made up of all types

Nov 04, 2010

Modern society has an intense interest in classifying people into ‘types’, according to a University of Melbourne Cultural Historian, leading to potentially catastrophic life-changing outcomes for those typed – ...

The chocolate genome, unwrapped

Oct 28, 2010

Halloween is about monsters, ghouls and most of all, goodies. Kids might be more concerned about the quantity of treats that drop into their plastic pumpkin baskets than their quality -- but for chocolate, ...

Recommended for you

Sorghum and biodiversity

15 hours ago

It is difficult to distinguish the human impact on the effects of natural factors on the evolution of crop plants. A Franco-Kenyan research team has managed to do just that for sorghum, one of the main cereals ...

Going to extremes for enzymes

Sep 01, 2014

In the age-old nature versus nurture debate, Douglas Clark, a faculty scientist with Berkeley Lab and the University of California (UC) Berkeley, is not taking sides. In the search for enzymes that can break ...

User comments : 13

Adjust slider to filter visible comments by rank

Display comments: newest first

dogbert
5 / 5 (1) Feb 18, 2011
Disturbing. Since only a small portion of human DNA was used to identify contamination in non-human DNA databases, the percentage of contamination could be much higher in both human and non-human databases.
Jayded
not rated yet Feb 18, 2011
Wonder what this means to the legal world?
dogbert
not rated yet Feb 18, 2011
Wonder what this means to the legal world?


I wonder that too. Lawyers have always presented DNA evidence to juries as if it were essentially infallible. It is not infallible and this information my provide defense attorneys with a counter argument.
SkiSci
not rated yet Feb 18, 2011
Intriguing work.
MorituriMax
1 / 5 (1) Feb 19, 2011
Wonder what this means to the legal world?


I wonder that too. Lawyers have always presented DNA evidence to juries as if it were essentially infallible. It is not infallible and this information my provide defense attorneys with a counter argument.


Probably exactly nothing, since people have millions of copies of their DNA throughout their bodies, and I doubt the same contamination is present in every single one.
dogbert
not rated yet Feb 19, 2011
MorituriMax,
If nearly a quarter of DNA in databases, which were carefully gathered in a manner to minimize contamination, are contaminated, what level of contamination can be expected of DNA evidence gathered from dirty crime scenes where there have been no means to insure uncontaminated samples?

Suppose a match to a suspect has a 1 in a million chance of being wrong. Now recognize that the sample taken at the crime scene has a 50% chance of contamination. The 1 million to 1 probability of guilt has just dropped to 1 in 2 (50%), since there is only a 50% probability that the sample is not contaminated.

Such "evidence" becomes meaningless.
trekgeek1
not rated yet Feb 19, 2011
Wonder what this means to the legal world?


Though the contamination is probably a small percentage of the total DNA present. The article states that the researchers found a bit of DNA while looking at ancient strands. If at a crime scene, they find the DNA of a person and it has small portions of another person, they'll probably go with whichever DNA is predominant.
dogbert
not rated yet Feb 19, 2011
They might go many ways with that. Point is, you cannot blindly assume DNA analysis has the value it is purported to have. It is possible to convict people on DNA data who are innocent of the crime and were not at the crime scene.
Ethelred
5 / 5 (1) Feb 21, 2011
Forensic DNA testing doesn't really go on the specifics of the DNA. It goes on where the DNA is cut by a specific set of enzymes. PCR is done on the sample a restriction enzyme is used to cut the DNA into sections and then then the resulting DNA segments are spread out in a gel .

Contamination shows up lighter than the predominant set of DNA. So it is usually possible to compare the sample with a suspects sample even with some contamination.

DNA sequencing is different because the cheap way to sequence was to chop it all up and then put it in the right order with a computer program. They don't see it as a whole just a set of A-G-C-T mostly for proteins. You would not have contamination looking like a ghost image as it does in the forensic testing.

IIRC Ventner's technique ignored a lot of the DNA that didn't code for proteins. Might have stopped that shortcut by now.

Ethelred
gwargh
3 / 5 (2) Feb 21, 2011
Suppose a match to a suspect has a 1 in a million chance of being wrong. Now recognize that the sample taken at the crime scene has a 50% chance of contamination. The 1 million to 1 probability of guilt has just dropped to 1 in 2 (50%), since there is only a 50% probability that the sample is not contaminated.

Such "evidence" becomes meaningless.

First of all, you're calculating the probability of the evidence given innocence (i.e. P(DNA match | innocent)) and should instead be doing probability of guilt given the evidence (i.e. P(guilt|DNA match)), which, if you'd listen to most biologists, isn't all that high. It's fairly hard to find samples to get DNA analysis from in most crime scenes, since the chance of contamination is always very high. Unfortunately, a society that's led by expectations of CSI-esque evidence believes that it's only a matter of looking hard enough.
Secondly, as Ethelred pointed out, you don't sequence evidence, you gel it.
gwargh
3 / 5 (2) Feb 21, 2011
Overall, I'd rate this news as slightly alarming, but nothing terribly hampering to genetics. Most researchers don't use any single genome fully, and are only interested in a few selected genes. When working with large comparisons in between genomes, regions matching extremely well are scrutinized by comparisons in between taxa. i.e. if I find a very similiar gene in rice and humans, I check several different plants and animals for that gene as well (since if it's an ortholog, we'd expect to see it in most lineages, not just two very distant ones). Contamination is problematic, however, when comparing genes from closely related taxa.
210
3 / 5 (2) Feb 22, 2011
Wha? No, "Contamination interfering with evolution..." arguments?
"Contamination is problematic, however, when comparing genes from closely related taxa"
" But, she says, the catch is that it’s virtually impossible to identify a foreign bit of human DNA in a human genome database."
m SHUCKS!!! That is and ha been my argument!!
“In sequencing, you have to put all the pieces of the genome together like a big jigsaw puzzle. The pieces that don’t fit stand out, “But if you’re working on a human puzzle, it’s like working on a three-billion piece puzzle, and it’s all black.
“It’s virtually impossible to find human contamination in human genome databases,” she adds, because they simply don’t stand out as anything unusual in a human genome. This..., could lead to some terrible mistakes. INDEED, or bad assumptions!
Ethelred
5 / 5 (1) Feb 23, 2011
No, "Contamination interfering with evolution..." arguments?
Well we had rational posters.
SHUCKS!!! That is and ha been my argument!!
Your argument is that the world is young and a psycho god drowned all but 8 humans. For being imperfect despite being created perfectly by an omniscient and omnipotent god that was so inept that his angels were fornicating with his humans.
But if you’re working on a human puzzle, it’s like working on a three-billion piece puzzle, and it’s all black.
No. They used overlapping sections to figure out how to put it together. And that was the stuff done a decade ago. They used one person's DNA. Not surprisingly it turned out to be Craig Vetner's for the work done at his labs. Guess who's dog was used for the Canine genome. Yes, Vetner's.
could lead to some terrible mistakes.
No. Just some errors nothing terrible involved and it can all be fixed for a much cheaper cost than the original work.

Ethelred