Predicting protein folding from single sequences with Meta AI ESM-2

Researchers from Facebook AI Research (FAIR) at Meta AI have published a paper in the journal Science detailing a machine-learning-created database of 617 million predicted protein structures. The ESMFold language model described the structures 60 times faster than DeepMinds AlphaFold2, though with less reported accuracy.

The fold predictions were completed in just two weeks on a cluster of about 2,000 GPUs. The initial sequence lengths ranged from 20 to 1,024 nucleotides. 365 million predictions were made with good confidence, and ∼225 million predictions fell within a high confidence of accuracy.

According to the report, "Evolutionary-scale prediction of atomic-level protein structure with a language model," a random sample of 1 million high-confidence results showed that 767,580 proteins have a sequence identity below 90% to any sequence in UniRef90, a database of known protein sequences. Researchers believe this indicates that the proteins are distinct from existing UniRef90 sequences.

The Meta AI team then compared the sample of predicted structures with known structures in the Protein Data Bank, a database for three-dimensional protein structures. At thresholds 0.5 TM-score, 12.6% (125,765 proteins) were without a structural component match. Based on this, researchers estimate that about 28 million proteins (12.6% of 225 million) with high-confidence predictions could characterize regions of protein structure that are distant from existing knowledge.

Predictions based on sequences

A protein begins as a linear sequence of nucleotides copied from DNA (transcription), creating messenger-RNA, a raw ingredient wish list of the protein it will become. The mRNA nucleotides are then translated into amino acids (the raw ingredients). This chain of amino acids then undergoes an incredible transformation into a complex three-dimensional folded shape that, depending on its folded structure, carries out specific intricate cellular functions.

How a protein or enzyme folds in part determines its function because it limits and optimizes what it can interact with. The structure creates an opening or "lock" that only operates with the correct molecular "key." People have been using these lock and key enzymes for everything from the food industry and beer brewing to textiles and biofuel without a detailed understanding of how the proteins are actually folded.

Laundry detergents typically contain several types of enzymes, some of which will be cellulases that break down plant material. When the cellulase enzyme encounters cellulose from a grass stain, the cellulose becomes the key that fits the lock. The enzyme triggers a chemical reaction breaking down the bonds within the grass stain. The same enzyme will do nothing when encountering a lipstick or grease stain, that may be a job for another enzyme.

A single protein enzyme might perform a task thousands or even millions of times per second without breaking, offering industries a low-energy powerhouse of a catalyst and making enzymes an instrumental technology.

Every system in our body also relies on proteins to carry out biological functions. Because the folded structure of a protein is crucial to the activity it can engage in, understanding this structure is critical to understanding how they work when investigating causes of disease.

The ability to predict how a protein will fold based on the primary sequence of amino acids (raw ingredients) would allow medical researchers to better understand protein metabolite interactions and biological functions throughout the body. This higher-resolution understanding could identify hidden disease traits, accelerate research into new or better treatments and somewhat revolutionize modern medicine. Understanding precisely how structure follows the form of raw ingredients (translated mRNA) would also allow researchers to build custom proteins to perform specific tasks in healthcare and industry.

In the decades preceding AI prediction models, scientists modeled the structures of about 190,000 proteins of interest. Machine learning has now generated hundreds of millions of predictions that still need to be confirmed and studied to be useful. While still not reliable enough to replace the slower methodical X-ray crystallography for structure or a controlled assay experiment for function, AI is just getting started. The knowledge gained in the decades to come will likely eclipse everything that came before.

More information: Zeming Lin et al, Evolutionary-scale prediction of atomic-level protein structure with a language model, Science (2023). DOI: 10.1126/science.ade2574

Journal information: Science

Predicting protein folding from single sequences with Meta AI ESM-2

Predictions based on sequences

Why synonymous mutations are not always silent

Vast DNA tree of life for plants revealed by global science team using 1.8 billion letters of genetic code

Biomolecular condensates: Study reveals poor predictive power of established liquid-liquid phase separation assays

A key gene helps explain how the ability to glide has emerged over-and-over during marsupial evolution

Researchers unveil PI3K enzyme's dual accelerator and brake mechanisms

Unveiling the mysteries of cell division in embryos with timelapse photography

COVID-19 virus disrupts protein production: Researcher discusses her recent findings

World's chocolate supply threatened by devastating virus

Lead-vacancy centers in diamond as building blocks for large-scale quantum networks

Researchers create nanostructures for efficient and sustainable degradation of pollutants

New method makes finding bat roosts easier for conservationists

Research combines DNA origami and photolithography to move one step closer to molecular computers

Enhanced CRISPR method enables stable insertion of large genes into the DNA of higher plants

Giant virus discovered in wastewater treatment plant infects deadly parasite

Climate change supercharged a heat dome, intensifying 2021 fire season, study finds

Social change may explain decline in genetic diversity of the Y chromosome at the end of the Neolithic period

Nanofibers rid water of hazardous dyes: Researchers develop efficient filters based on cellulose waste

New model extends theory of pattern formation to the nano-cosmos

AI designs active pharmaceutical ingredients quickly and easily based on protein structures

Donate and enjoy an ad-free experience

Predicting protein folding from single sequences with Meta AI ESM-2

Predictions based on sequences

Let us know if there is a problem with our content

Thank you for taking time to provide your feedback to the editors

Donate and enjoy an ad-free experience

Share article

E-MAIL THE STORY