November 17, 2022 dialog

The world's largest quantum chemistry dataset to empower new materials design and drug discovery

by Artur Kadurin

Predicting the properties of an object is a most natural task for machine learning (ML) algorithms, and molecules or crystals are not an exception. Every drug discovery or materials design pipeline depends on the ability to predict a future product's physical and chemical properties.

However, in contrast to more conventional domains of ML, such as images or texts, it is much more expensive in both money and time to validate the work of your models. To assess the quality, you must go to the wet lab to synthesize the structure and then perform real-world experiments to evaluate every single property. In addition to that, to train a machine learning model to predict molecular properties, you need access to relevant data for every property of interest, while the quality will depend on the size and diversity of your dataset.

A more general approach is to try to predict all the properties of the atomic system ab initio. Luckily, there is a fundamental theory behind the processes occurring on the quantum level. The Schrödinger equation allows us to explicitly calculate what is going on between atoms and electrons. For scientists, it means that we can simulate the behavior of a molecule or material and explicitly calculate its properties, at least in theory.

In practice, the amount of calculations needed for a precise solution of the Schrödinger equation grows exponentially with the number of electrons. However, there exist a wide variety of numerical methods that solve it on different levels of precision. These methods comprise a hierarchy that trades off accuracy against computational cost. Density functional theory (DFT) provides us with reasonably precise methods with feasible computation costs for systems of dozens of atoms.

Deep learning for quantum chemistry

Recent advances in deep learning (DL), especially in graph convolution networks, opened a whole new field of research—neural networks for quantum chemistry. Instead of predicting a specific property of a molecular structure, these methods are aiming to assess molecular conformation—the 3D arrangement of the atoms in a molecule, by predicting its quantum properties.

In particular, there are a number of papers focused on the substitution of computationally expensive DFT calculation with relevantly cheap neural network solutions. The vast majority of these works are limited to experiments performed only on a few or even single structures. It restricts the generalization and questions the applicability of these models to real-world problems.

nablaDFT dataset

On the path to solving the problem of access to suitable data, we at the DL in Life Sciences research group from AIRI, Artificial Intelligence Research Institute, decided to compute and share the biggest (so far) quantum chemistry dataset calculated on the DFT level of theory. The research was published in Physical Chemistry Chemical Physics and performed in collaboration with scientists from the Skolkovo Institute of Science and Technology and the St. Petersburg Department of Steklov Mathematical Institute. Together with the data we reimplement and evaluate several state-of-the-art neural network models on two common tasks: prediction of potential energy (a) and DFT Hamiltonian (b) for a given molecular conformation.

The dataset available via GitHub contains over 5 million conformations for over 1 million drug-like molecules together with quantum properties such as conformational energy, DFT Hamiltonian matrix, wave functions, and many others. It takes about 5 min of CPU time on average for a single conformation computation, which sums up to about 50 years of CPU time for the whole dataset.

Benchmark results

To benchmark models in different settings we divide the test set into three subsets:

Molecular conformations for structures presented in the training set.
Molecular conformations for structures not presented in the training set.
Molecular conformations for structures with scaffolds were not presented in the training set.

All the models were trained in a multi-molecular setup. According to our results, the best model achieves a mean average error of 3.2x10^-2 hartrees (~20 kcal/mol) on the separate structures test set on the task of conformational energy prediction, while the chemical accuracy achievable in a wet lab is about 1kcal/mol. Not surprisingly, most of the models perform better when tested on new conformations of already-seen molecular structures. Even a simple linear regression model shows an improvement from 4.7x10^-2 Hartree MAE to 4.0x10^-2 hartrees.

Conclusion

Though it remains a challenge to obtain models that are close to chemical accuracy, our experimental evidence shows that larger datasets lead to better ML models.

While we plan to keep replenishing the already collected dataset in order to contribute to the development of artificial intelligence technologies, we would like to invite the community to contribute to the benchmark by evaluating novel models on the proposed dataset.

This story is part of Science X Dialog, where researchers can report findings from their published research articles. Visit this page for information about ScienceX Dialog and how to participate.

More information: Kuzma Khrabrov et al, nablaDFT: Large-Scale Conformational Energy and Hamiltonian Prediction benchmark and dataset, Physical Chemistry Chemical Physics (2022). DOI: 10.1039/D2CP03966D

Journal information: Physical Chemistry Chemical Physics

Artur Kadurin is former Chief AI Officer at Insilico Medicine, a company utilizing Deep Learning techniques for drug discovery and aging research. He is now leading the research group "DL in Life Sciences" at Artificial Intelligence Research Institute, AIRI. He and colleague Kuzma Khrabrov can be contacted via email (kadurin@airi.net, khrabrov@airi.net) if you need any help in running your experiments on their data.

Citation: The world's largest quantum chemistry dataset to empower new materials design and drug discovery (2022, November 17) retrieved 21 June 2024 from https://phys.org/news/2022-11-world-largest-quantum-chemistry-dataset.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

Breakthrough reported in machine learning-enhanced quantum chemistry

271 shares

Feedback to editors

The world's largest quantum chemistry dataset to empower new materials design and drug discovery

Deep learning for quantum chemistry

nablaDFT dataset

Conclusion

New insights into how cell shape influences protein transport rates

An alternative way to manipulate quantum states

New photonic chip spawns nested topological frequency comb

Scientists discover surprising link between ancient biology and restricted human hair growth

Spectroscopic technique that singles out water molecules lying on the surface reveals how they relax after being excited

Insecticides contribute to drop in butterfly species across US MidWest: Study

Wild chimpanzees seek out medicinal plants to treat illness and injuries, study finds

Study finds plants store carbon for shorter periods than thought

Behavioral and computational study shows that social preferences can be inferred from decision speed alone

Family conditions may have more of an impact on upward social mobility than gender inequality

Relevant PhysicsForums posts

How to balance an equation for the incomplete combustion of acetic acid?

Active ingredients in carbon deposit solvent

At the atomic level, why is voltage added when batteries are connected in series?

D-orbital splitting with CASSCF

Cannot open GBW file with ORCA input

Hydrochloric Acid, NaOH, and English Ivy

Breakthrough reported in machine learning-enhanced quantum chemistry

Neural networks make sense of complex electron interactions

Machine learning speeds up quantum chemistry calculations

Equivariant representations for molecular Hamiltonians and N-center atomic-scale properties

Computers excel in chemistry class

Artificial intelligence solves Schrödinger's equation

Spectroscopic technique that singles out water molecules lying on the surface reveals how they relax after being excited

Chemists develop technique for extending nitrene reactions to three days

Researchers create new class of materials called 'glassy gels'

Boosting the synthesis of stable sugar compounds with a novel nature-inspired approach

Old drugs new tricks—novel approach shows 'enormous potential' for rapid antibiotic discovery

New technology provides electrifying insights into how catalysts work at the atomic level

Medical Xpress

Tech Xplore

Science X

The world's largest quantum chemistry dataset to empower new materials design and drug discovery

Deep learning for quantum chemistry

nablaDFT dataset

Conclusion

New insights into how cell shape influences protein transport rates

An alternative way to manipulate quantum states

New photonic chip spawns nested topological frequency comb

Scientists discover surprising link between ancient biology and restricted human hair growth

Spectroscopic technique that singles out water molecules lying on the surface reveals how they relax after being excited

Insecticides contribute to drop in butterfly species across US MidWest: Study

Wild chimpanzees seek out medicinal plants to treat illness and injuries, study finds

Study finds plants store carbon for shorter periods than thought

Behavioral and computational study shows that social preferences can be inferred from decision speed alone

Family conditions may have more of an impact on upward social mobility than gender inequality

Relevant PhysicsForums posts

Related Stories

Breakthrough reported in machine learning-enhanced quantum chemistry

Neural networks make sense of complex electron interactions

Machine learning speeds up quantum chemistry calculations

Equivariant representations for molecular Hamiltonians and N-center atomic-scale properties

Computers excel in chemistry class

Artificial intelligence solves Schrödinger's equation

Recommended for you

Spectroscopic technique that singles out water molecules lying on the surface reveals how they relax after being excited

Chemists develop technique for extending nitrene reactions to three days

Researchers create new class of materials called 'glassy gels'

Boosting the synthesis of stable sugar compounds with a novel nature-inspired approach

Old drugs new tricks—novel approach shows 'enormous potential' for rapid antibiotic discovery

New technology provides electrifying insights into how catalysts work at the atomic level

Newsletter sign up

Donate and enjoy an ad-free experience