Enabling Thermochemistry Estimation using Deep Learning
Integrated Uncertainty Estimation and Automatic Selection of Complex Molecules using Active Learning
Science Achievement
MIT researchers developed an automated system to continually perform quantum chemistry calculations and use the results to retrain a deep learning model for predicting the thermochemistry of complex polycyclic molecules. A novel approach for estimating uncertainties in these predictions was used to identify which new molecules should be refined with quantum chemistry and added to the training data set. NERSC resources enabled performing many quantum chemistry calculations automatically and in parallel.
Impact
Detailed chemical mechanisms are important for modeling and understanding combustion and soot formation processes. Because many hundreds of species and thousands of reactions appear in such mechanisms, automated chemical mechanism generation tools are required which rely on rapid estimation of thermochemical parameters of potentially millions of candidate species. Therefore, quantum chemistry is too expensive for all but the most important molecules and conventional estimation methods cannot easily adapt to the many unusual molecules encountered during mechanism generation. The machine learning estimator affords a much more versatile framework that easily incorporates uncommon molecular structures, such as those involving multiple fused cycles, and thus facilitates the investigation of more energy-efficient and environmentally beneficial processes.
Research Details
A machine learning model that predicts thermochemical parameters (enthalpy of formation, entropy, and heat capacities) of molecules was initially trained on a large published dataset of molecular structures. The model only requires a two-dimensional graph representation of a molecule augmented by some basic features, such as atom types and valence structure, as its input. To accurately capture the complex structure of polycyclic molecules, an additional atomic feature used was the number of rings of each size containing the atom.
In addition to the thermochemistry predictions, their uncertainties can be estimated by training an ensemble of models simultaneously by varying which units in the neural network are turned on. In an approach called active learning, uncertainty estimation enables automatic selection of new molecules that are in need of refinement with quantum chemistry calculations. These calculations were performed in an efficient parallel fashion on the Cori supercomputer at NERSC. Furthermore, NERSC was used to calculate a highly accurate benchmark dataset for comparison between the machine learning model, low-level quantum chemistry, and conventional thermochemistry estimation methods.
In ongoing work, Cori is being used to perform thousands of high-level quantum chemistry calculations (explicitly correlated coupled cluster) in order to improve the accuracy of the machine learning model. Due to their expensive nature, many fewer high-level calculations are possible than are already available as training data for the model. Therefore, the researchers are using the knowledge encoded in the machine learning model to initialize parameters in a second model and thus enable accurate predictions using much less data.
The team used the resources on Cori to run QChem and Molpro calculations (~800,000 CPU hrs, 10,000 jobs). Data was saved on the HPSS tape archive. NERSC staff compiled a new version of QChem using OpenMP to reduce the load on the Slurm scheduler which was caused by the MPI version of QChem.
Related Links
Li, Yi-Pei; Han, Kehang; Grambow, Colin A.; Green, William H., "Self-Evolving Machine: A Continuously Improving Model for Molecular Thermochemistry"; JOURNAL OF PHYSICAL CHEMISTRY A, 123:2142-2152; 2019 MAR 14, 10.1021/acs.jpca.8b10789
About NERSC and Berkeley Lab
The National Energy Research Scientific Computing Center (NERSC) is a U.S. Department of Energy Office of Science User Facility that serves as the primary high performance computing center for scientific research sponsored by the Office of Science. Located at Lawrence Berkeley National Laboratory, NERSC serves almost 10,000 scientists at national laboratories and universities researching a wide range of problems in climate, fusion energy, materials science, physics, chemistry, computational biology, and other disciplines. Berkeley Lab is a DOE national laboratory located in Berkeley, California. It conducts unclassified scientific research and is managed by the University of California for the U.S. Department of Energy. »Learn more about computing sciences at Berkeley Lab.