December 13, 2022 dialog

Hashing complements alignment-based methods for bacterial genome annotation

by Oliver Schwengers

DNA sequencing has changed biology like nothing else since the origin of species theory. In particular, the way we investigate microbial life has fundamentally changed. Today, we are able to sequence DNA with unprecedented speed and resolution, so that we are even able to sequence genomes of microbes that have never been described or cultivated before. At the same time, whole-genome sequencing of known—most pathogenic—species, has become a routine methodology carried out worldwide as a daily business.

This, in turn, constantly increases the amount of publicly stored sequences, which are equally becoming a treasure trove and a hurdle both at the same time. For many sequence-based computational analyses, comprehensive and thorough genome annotations play a crucial role as a common starting ground. And for a long time this has been perceived as a solved problem.

But, the daily influx of new genome and gene sequences into public databases poses new issues for the rapid annotation of microbial genomes. In particular, the search for similar or identical protein-coding genes has become a large-scale bioinformatics search problem like a needle in a haystack—an astonishingly large haystack, nowadays.

In this context, we're facing two diametrically diverging developments. On one hand, public databases are flooded with similar and near-identical protein sequences. For instance, these include those of utmost relevance like antimicrobial resistance genes and virulence factors—sequences which can be crosslinked with tons of useful information from many public databases. On the other hand, countless new sequences emerge from metagenome projects sequencing of what is often referred to as microbial dark matter. However, for many of these sequences no additional information is available at all.

Two distinct bioinformatic challenges arise from this situation: first, the exact identification of known sequences, and second, the functional description of rare or even unknown sequences—both in the order of hundreds of millions. To address these challenges, we tried an alignment-free protein sequence hashing strategy coupled with two hierarchical sequence alignment steps as a new approach to this problem. Our work was published in the journal Microbial Genomics.

To exactly identify known protein sequences, we used a hash function that maps input data of arbitrary lengths to fixed-size binary fingerprints. These hash functions are well-known from so-called checksum calculations due to an important characteristic: they are extremely fast to compute, much faster than traditional sequence alignments.

To take advantage of this, we created a compact, local database with hash fingerprints of more than 220 million protein sequences. In a second step, we pre-assigned high-quality annotations and cross-links to further external databases. Of note, these demanding large-scale computations are only required once at the database compilation step which we regularly conduct upon new releases. For the actual genome annotation process, we can use this dense information storage at runtime and thus achieve exact sequence identifications and ultra-fast lookups of related information.

We also reduced overall storage requirements to one third even though additional rich annotation information is included like gene symbols, EC numbers, GO terms, protein products and external database accessions. This information is a valuable resource to connect sequences at hand with related sequences stored in public databases.

Interestingly enough, this alignment-free approach also helped to substantially avoid computationally expensive alignments which follow as a fallback search strategy for unidentified sequences. In a hierarchical two-step process, remaining protein sequences were searched via traditional sequence alignments against protein cluster representative sequences. First, more than 99 million dense protein clusters were screened for matches followed by a second search using more-relaxed thresholds screening more than 13 million wider clusters.

Potentially negative runtime effects of these huge protein cluster databases were mitigated by the described alignment-free sequence identification approach. Finally, all annotation information for identified protein sequences and related clusters were combined giving specific information precedence over more general information.

This hierarchical approach is part of a larger annotation workflow also comprising the annotation of non-coding RNA and DNA features, e.g., tRNAs, rRNAs, ncRNAs, CRISPR arrays, origin of replications and many more. Bakta is available as a command line tool and as a scalable web service at https://bakta.computational.bio

This story is part of Science X Dialog, where researchers can report findings from their published research articles. Visit this page for information about ScienceX Dialog and how to participate.

More information: Oliver Schwengers et al, Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification, Microbial Genomics (2021). DOI: 10.1099/mgen.0.000685

Oliver Schwengers is a microbial bioinformatics PostDoc researcher at the Bioinformatics and Systems Biology department at the JLU Giessen. His research activities focus on the analysis and characterization of bacterial genomes and plasmids based on whole-genome sequencing data as well as the development of fully automated and scalable bioinformatics software tools. He loves to regularly collaborate with researchers from medical, environmental and space microbiology in an interdisciplinary manner.

Citation: Hashing complements alignment-based methods for bacterial genome annotation (2022, December 13) retrieved 17 July 2024 from https://phys.org/news/2022-12-hashing-complements-alignment-based-methods-bacterial.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

Scientists expand entomological research using genome editing

13 shares

Feedback to editors

Hashing complements alignment-based methods for bacterial genome annotation

New 3D anatomical atlas of the African clawed frog increases understanding of development and metamorphosis processes

Intensive farming could raise risk of new pandemics, researchers warn

Scientists develop new AI method to create material 'fingerprints'

Study shows frogs can quickly increase their tolerance to pesticides

Nature-based solutions to disaster risk from climate change are cost-effective, study confirms

Astronomers discover what may be 21 neutron stars orbiting sun-like stars

Scientists use machine learning to predict diversity of tree species in forests

Physicists pool skills to better describe the unstable sigma meson particle

Telescope tag-team discovers 10 strange and exotic pulsars

NASA transmits hip-hop song to deep space for first time

Relevant PhysicsForums posts

Understanding COVID Quarantine Guidance

New and Interesting Publications Relevant to the Origin of Life

The Cass Report (UK)

Medical tape cut off blood flow to fetus?

Is meat broth really nutritious?

Havana Syndrome

Scientists expand entomological research using genome editing

A new data analysis approach identifies disease-associated splicing variants

DNA sequence enhances our understanding of the origins of jaws

Statistical tool finds 'gaps' in DNA data sets shouldn't be ignored

DNA strong bonding—a long-term commitment or many brief relationships?

Amid genomic data explosion, scientists find proliferating errors

A new addition to the CRISPR toolbox: Teaching the gene scissors to detect RNA

Ancient microbes offer clues to how complex life evolved

Plant research could pave the way for growing crops with seawater

Unique characteristics of previously unexplored protein discovered

Researchers map fibrillization process to reveal mechanisms of amyloid polymorphism

Discovery of a hybrid lineage offers clues to how trees adapt to climate change

Medical Xpress

Tech Xplore

Science X

Hashing complements alignment-based methods for bacterial genome annotation

New 3D anatomical atlas of the African clawed frog increases understanding of development and metamorphosis processes

Intensive farming could raise risk of new pandemics, researchers warn

Scientists develop new AI method to create material 'fingerprints'

Study shows frogs can quickly increase their tolerance to pesticides

Nature-based solutions to disaster risk from climate change are cost-effective, study confirms

Astronomers discover what may be 21 neutron stars orbiting sun-like stars

Scientists use machine learning to predict diversity of tree species in forests

Physicists pool skills to better describe the unstable sigma meson particle

Telescope tag-team discovers 10 strange and exotic pulsars

NASA transmits hip-hop song to deep space for first time

Relevant PhysicsForums posts

Related Stories

Scientists expand entomological research using genome editing

A new data analysis approach identifies disease-associated splicing variants

DNA sequence enhances our understanding of the origins of jaws

Statistical tool finds 'gaps' in DNA data sets shouldn't be ignored

DNA strong bonding—a long-term commitment or many brief relationships?

Amid genomic data explosion, scientists find proliferating errors

Recommended for you

A new addition to the CRISPR toolbox: Teaching the gene scissors to detect RNA

Ancient microbes offer clues to how complex life evolved

Plant research could pave the way for growing crops with seawater

Unique characteristics of previously unexplored protein discovered

Researchers map fibrillization process to reveal mechanisms of amyloid polymorphism

Discovery of a hybrid lineage offers clues to how trees adapt to climate change

Newsletter sign up

Donate and enjoy an ad-free experience