New database of 660,000 assembled bacterial genomes sheds light on the evolution of bacteria  

New database of 660,000 assembled bacterial genomes sheds light on the evolution of bacteria  
Fig 1. Species composition of the 639,981 high-quality assemblies. (A) Relative proportions of species to the data as a pie chart. Note that 90% of the assemblies are from 20 bacterial species. (B) Fraction of assemblies covered by accumulating bacterial species. (C) Tracking proportions of the top 10 bacterial species for each year. The data underlying this figure may be found in Credit: DOI: 10.1371/journal.pbio.3001421

A vast, curated collection of bacterial genomes has been created that allows the community unprecedented access to data.

Ninety percent of the bacterial genomes sequenced belong to a restricted set of only 20 , out of an estimated 45,000, highlighting the knowledge gaps in available genomic and showing how this distorts our view of bacterial diversity, new research has suggested.

In a new study, from the Wellcome Sanger Institute and EMBL's European Bioinformatics Institute (EMBL-EBI), researchers standardized all bacterial genome data held in the European Nucleotide Archive (ENA) before 2019, creating a searchable and accessible database of genomic assemblies.

In the research, published on 9 November 2021 in PLOS Biology, researchers reviewed all of the bacterial data available as of November 2018 and assembled it into over 660,000 genomes. This has been released as a new open access database designed to help scientists all around the world answer on bacterial evolution, by considering all data in a standardized and comprehensive manner.

In addition to this, over 300,000 of these genomes had never been fully assembled before. This study highlights the composition of the current genomic data resources, showing biases in the data submitted to these archives and therefore our window into bacterial diversity.

Genomic data exist in public archives as unprocessed raw sequences, or assembled data that have been processed with multiple different techniques. When these are assembled in a standardized and comprehensive way, people can search and analyze all existing data i.e. the whole genetic picture. When the whole database is processed in this way it allows data to be seen in this wider context, rather than being limited to looking at snap shots of genomic data archives in isolation.

While analyzing the data contained in the public archives, the researchers were surprised to find that the majority of data come from the same 20 species of bacteria. Notably, almost one third of the total data came from Salmonella enterica, a bacterium well known to causes foodborne illness.

Whilst Salmonella infections can lead to hospitalisations and are important causes of deaths worldwide, there are many other important pathogens that are not well represented in this data archive. There is also a lack of data on the bacteria known to keep us healthy such as those making up the gut microbiome.

By highlighting the gaps in the data, researchers hope to ensure that others are aware how the data are skewed, how this might impact on our interpretation of the data, and to encourage discussion around these issues in research. The dataset is now live and available for free access across the globe.

"The exercise gave us a detailed overview of the bacteria sequenced over the last 30 years. It confirms that researchers have been focusing on a small number of pathogens from a restricted number of sources. Such a narrow focus restricts our ability to truly understand key questions in bacterial evolution and public health, including the sources of antimicrobial resistance. We know that the genes that confer antimicrobial resistance exist in a much wider range of species than just those few pathogens that are the focus of attention for funders. By expanding and standardizing the archive data, we can get a clearer picture of what is going on. This study highlights the need to widen the range of bacterial species we sequence, and to create better mechanisms for sharing the data with the community, to help answer priority questions for researchers and public health authorities alike," says Dr. Zamin Iqbal, co-senior author and Group Leader at EMBL-EBI.

"I study genomic elements that are able to move freely between different bacteria, many of which can contribute to the spread of antimicrobial resistance genes. To do this, I need to search and analyze as many as possible in a simple and fast way. Public data can be quite messy and need to be processed uniformly, including quality control, before they can be used for this type of analysis. So along with a few colleagues, we decided to 'tidy up' the data and make it easier for everyone to ask essential research questions," says Dr. Grace Blackwell, first author and joint EMBL-EBI and Sanger Institute postdoctoral fellow.

"We rely on the genomic archives to provide the context to our research on public heath questions and for basic science. It is against these data we identify new species, view the emergence of new pathogens or antimicrobial resistance genes, or see the pathways through which bacteria move across the globe. It is our intellectual point of reference. By processing it uniformly we have tried to show the huge opportunities and wealth of biological data that are hidden in these genomes as well as making people aware of any possible limitations. This database will enable new opportunities for science and we want to ensure people are able to access it fully through this collaborative study," says Professor Nicholas Thomson, co-senior author and head of the Parasites and Microbes Programme at the Wellcome Sanger Institute.

More information: Grace A. Blackwell et al, Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences, PLOS Biology (2021). DOI: 10.1371/journal.pbio.3001421

Journal information: PLoS Biology

Citation: New database of 660,000 assembled bacterial genomes sheds light on the evolution of bacteria   (2021, November 10) retrieved 17 June 2024 from
This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

Unparalleled inventory of the human gut ecosystem


Feedback to editors