Message-passing neural networks for high-throughput polymer screening

Machine learning methods have shown promise in predicting molecular properties, and given sufficient training data machine learning approaches can enable rapid high-throughput virtual screening of large libraries of compounds. Graph-based neural network architectures have emerged in recent years as the most successful approach for predictions based on molecular structure, and have consistently achieved the best performance on benchmark quantum chemical datasets. However, these models have typically required optimized 3D structural information for the molecule to achieve the highest accuracy. These 3D geometries are costly to compute for high levels of theory, limiting the applicability and practicality of machine learning methods in high-throughput screening applications. In this study, we present a new database of candidate molecules for organic photovoltaic applications, comprising approximately 91,000 unique chemical structures.Compared to existing datasets, this dataset contains substantially larger molecules (up to 200 atoms) as well as extrapolated properties for long polymer chains. We show that message-passing neural networks trained with and without 3D structural information for these molecules achieve similar accuracy, comparable to state-of-the-art methods on existing benchmark datasets. These results therefore emphasize that for larger molecules with practical applications, near-optimal prediction results can be obtained without using optimized 3D geometry as an input. We further show that learned molecular representations can be leveraged to reduce the training data required to transfer predictions to a new DFT functional.


I. INTRODUCTION
High-throughput computational screening offers the ability to explore large regions of chemical space for particular functionality, greatly enhancing the efficiency of material development efforts [1][2][3] .Due to its favorable balance between computational cost and chemical accuracy, density functional theory (DFT) has served as the workhorse of high-throughput computational material design.However, while DFT sacrifices chemical accuracy for numerical efficiency, DFT calculations are still too slow to screen the vast combinatorial landscape of potential chemical structures 4,5 .As an alternative to detailed quantum chemistry calculations, fully empirical machine learning (ML) predictions offer calculation times nearly six orders of magnitude faster than DFT (O(10 −3 s) for ML, O(10 3 s) for DFT on approximately 30 heavy atom molecules).Machine learning approaches have recently been effective in reproducing DFT results given sufficient training data 6 and therefore offer an opportunity to efficiently screen much larger libraries of compounds without further reduction in chemical fidelity.
Developing ML pipelines for molecular property prediction often involves encoding variable-sized molecules as a finitedimensional vector.Traditional approaches use group contribution methods, molecular fingerprints, and molecular descriptors to convert molecular structures into a suitable input for dense neural networks or other ML models [7][8][9][10][11][12][13] .However, hand-engineered molecular features may not sufficiently capture all the variability present in the space of chemically a) Electronic mail: peter.stjohn@nrel.govfeasible compounds.Neural network architectures that operate directly on graph-valued inputs have been developed 14 , allowing 'end-to-end' learning on molecular space.In this approach, models simultaneously learn both how to extract appropriate features as well as how to use these features to make accurate predictions.End-to-end learning techniques have supplanted traditional methods in image recognition and computer translation, similar applications where determining a suitable fixed-size numerical representation of the input data is difficult.
A number of approaches for end-to-end learning on molecules have recently been unified into a single theoretical framework known as Message Passing Neural Networks (MPNNs), and even more recently as Graph Networks 15,16 .In MPNNs, predictions are generated from input graphs with node and edge features.The network comprises a sequence of layers, including a number of message passing layers and a readout layer.In the message passing layers, node-level state vectors are updated according to graph's connectivity and the current states of neighboring nodes.Following a number of message passing layers, the readout layer generates a single graph-level vector from node-level states.These networks have demonstrated best-in-class performance on all properties in the QM9 computational dataset, a benchmark dataset for molecular property prediction consisting of DFToptimized 3D coordinates and energies for 134, 000 molecules with nine or fewer heavy atoms 17 .Further modifications of the MPNN framework have demonstrated even higher accuracies [18][19][20] .However, both Gilmer et al. 15 and more recent studies have noted that optimized, equilibrium 3D molecular geometries were required to achieve optimal accuracy on the QM9 dataset.Since obtaining minimum-energy atomic coordinates is a numerically intensive task, this requirement is arXiv:1807.10363v2[physics.comp-ph]5 Apr 2019 limiting for applications in high-throughput chemical screening -particularly for molecules with a large number of atoms.
While effective, deep learning requires large amounts of data in order to learn appropriate feature representations 21 .However, many applications of deep learning have benefited from transfer learning, where weights from a neural network trained on a large dataset are used to initialize weights for a related task with limited data 22 .In this way, the model's ability to extract useful features from inputs -learned from the larger dataset -is transferred to the new regression task, improving predictive accuracy with fewer training samples.In the molecular space, previous studies have shown that models are able to successfully predict molecules outside their training set 23,24 , improve their predictive accuracy with additional training on molecules from a different distribution than the prediction target 25 , and estimate non-equilibrium atomic energies at a higher level of theory by pretraining networks on lower-level calculations 26 .
In this study, we apply a MPNN to a newly developed computational dataset of 91, 000 molecules with optoelectronic calculations for organic photovoltaic (OPV) applications.For OPV applications, single-molecule electronic properties play a role in determining overall device efficiency [27][28][29] , and the search space of molecular structures is sufficiently large that experimental exploration is impractical.Machine learning approaches have previously been used to predict the properties of candidate OPV materials [30][31][32] , and a recent study demonstrated that a gap still exists between models that consider XYZ coordinates and those based only on SMILES strings 32 .While chemical structures of candidate molecules can be rapidly enumerated (referred to as a molecule's 2D geometry), calculating atomic positions at a high level of theory is computationally prohibitive when screening millions of possible molecules.We therefore design a ML pipeline to predict optoelectronic properties (e.g.ε HOMO , ε LUMO , optical excitation energy) directly from a molecule's 2D structure, without requiring 3D optimization using DFT.We demonstrate that for the types of molecules considered in this study, MPNNs trained without explicit spatial information are capable of approaching chemical accuracy, and show nearly equivalent performance to models trained with spatial information.Moreover, we show that weights from models trained on one DFT functional are able to improve performance on an alternative DFT functional with limited training data, even when the two target properties are poorly correlated.This application demonstrates that high-throughput screening of molecular libraries (in the millions of molecules) can be accomplished at chemical accuracy quickly with machine learning methods without the computational burden of DFT structure optimization.Additionally, these results indicate that the best neural network architectures trained on existing small-molecule quantum chemical datasets may not be optimal when molecular sizes increase.We therefore make the newly developed OPV dataset considered in this work (with both 2D and 3D structures) publicly available for future graph network architecture development.

A. Dataset preparation
The database considered in this study contains calculations performed with several DFT functionals and basis sets (denoted functional/basis below) using the Gaussian 09 electronic structure package with default settings 33 .A web interface to the database is available at [anonymized].The structures consist of combinations of building blocks, largely single and multi-ring heterocycles commonly found in OPV applications 2,27,28 .The database is primarily focused on quantifying the behavior of polymer systems, and therefore calculations were performed at a range of oligomer lengths to extrapolate to behavior at the polymer limit 34 .Two datasets were extracted from the database by selecting entries performed with the two functional/basis combinations with the greatest number of calculations, B3LYP/6-31g(d) and CAM-B3LYP/6-31g.Each dataset consists of monomer structures, with or without 3D structural information, and associated DFT-calculated optoelectronic properties.Molecular structures were encoded using SMILES strings 35 , optimized 3D coordinates (when used) were stored in SDF files.The specific electronic properties we predict are: the energy of highest occupied molecular orbital for the monomer (ε HOMO ); the lowest unoccupied molecular orbital of the monomer (ε LUMO ); the first excitation energy of the monomer calculated with time-dependent DFT (gap); the spectral overlap (integrated overlap between the optical absorption spectrum of a dimer and the AM1.5 solar spectrum); In addition to these properties, we also predict electronic properties that have been extrapolated to the polymer limit, including the polymer ε HOMO , polymer ε LUMO , polymer gap, and the optical ε LUMO (sum of the polymer ε HOMO and polymer gap).In addition to polymers, the database also contains soluble small molecules for solution-processable OPV devices 36,37 .As these molecules are not polymerized, these entries lack information on extrapolated polymer electronics.These entries were included in the training set, but excluded from the validation and test sets.
In order to screen a larger number of molecules, conformational sampling of each molecule was not performed; instead a single optimization was performed for each molecule or oligomer.The primary B3LYP/6-31g(d) dataset consists of approximately 91, 000 molecules with unique SMILES strings, approximately 54, 000 of which contain polymer properties.Of these 54, 000 with polymer properties, 5, 000 were randomly selected for each of the validation and test sets.Transfer learning was examined with a secondary dataset consisting of results from the CAM-B3LYP/6-31g functional.This dataset consists of approximately 32, 000 unique molecules, 17, 000 of which contain polymer results.From the 17, 000 with polymer properties, 2, 000 were selected for the validation and test sets.For both datasets, training data were randomly selected from the remainder small molecule and monomer results.Prior to prediction, each property is scaled to have zero median and unit inner quartile range (followed by an inverse transformation after prediction).
Determining an appropriate optimal (or target) error rate that is representative of a best-case validation loss is an important step in optimizing the hyperparameters of a ML pipeline.In previous studies, target errors were determined based on estimated experimental chemical accuracies for each of the regression tasks 6,15 .However, since many of these parameters are not directly measurable experimentally, we sought to determine a target error directly from the data.We therefore used calculation results from conformational isomers: molecules with identical connectivity but different 3D structure.Due to the size of the considered molecules, energy minimization routines can often converge to different lowest-energy states, with slightly altered optoelectronic properties.Since our model only considers atomic connectivity, it cannot distinguish between conformational isomers and predictions for molecules with identical SMILES strings will yield identical predictions.By iterating over all pairs of conformers in the dataset, we calculate a mean absolute error (MAE) to establish a representative lower limit for predictive accuracy for a model that does not consider 3D atom positions.These optimal errors are presented in Table I.

B. Message Passing Architecture
The molecules considered in this study and used as building blocks for OPV polymers are relatively large, with a maximum size of 201 atoms and 424 bonds (including hydrogens).Inputs to the neural network are generated from the molecules' SMILES strings, and consists of discrete node types, edge types, and connectivity matrix.Atoms are categorized into discrete types based on their atomic symbol, degree of bonding, and whether or not they are present in an aromatic ring.Bonds are similarly categorized into discrete types based on their type (single, double, triple, or aromatic), conjugation, presence in a ring, and the atom symbols of the two participating atoms.
A schematic of the neural network is shown in Figure 1.The message passing step was implemented using the matrix multiplication method 15,38 , where messages m are passed between neighboring atoms, where v is the node index, N(v) are the neighboring nodes, e vw is the bond type, h t v is the feature vector for node v at step t, and A e vw is a learned weight matrix for each bond type.
The update step was implemented as a gated recurrent unit block 15 , Initial atom embeddings, h 0 v , are initialized randomly for each atom class and learned as additional model parameters.The dimension of the atom state was chosen to be 128, with M = 3 message-passing layers.The readout function used was similar to the one used by Duvenaud et al. 14 , but uses only the fi-nal hidden state of the recurrent atom unit to generate a wholegraph feature vector ŷ: where W is a learned weight matrix.The dimension of ŷ was chosen to be 1024.This summed fingerprint is then passed through a series of two fully connected layers with batch normalization and ReLU activation functions (dimensions 512 and 256, respectively), before being passed to an output layer corresponding to each property prediction.When 3D molecular geometries were considered, the SchNet structure with edge updates from Jørgensen, Jacobsen, and Schmidt 20 was used.A nearest-neighbor cutoff of 48 was used to determine the connectivity matrix of passed messages.The dimension of the atom hidden state was chosen as C = 64, and separate models were trained for each of the eight target properties.As the targets are mainly orbital energies, we similarly use a average in the readout function.SchNet-like models were trained with the ADAM optimizer with an initial learning rate of 1E − 4, and a decay rate of 1E − 5 per epoch.The models used a batch size of 32 and were trained for 500 epochs.

C. Software
Message passing operations were implemented using Keras and Tensorflow.
Scikit-learn was used to scale the prediction targets, and rdkit was used to encode the atoms and bonds as integer classes.A python library used to implement the MPNNs described in this study is available on Github (github.com/nrel/nfp) and installable via pip.All datasets, model scripts, and trained model weights for the models described in Table I are available at https://cscdata.nrel.gov/#/datasets/ad5d2c9a-af0a-4d72-b943-1e433d5750d6.

D. Hyperparameter optimization
For the 2D model, model sizes (atom vector dimension, molecule vector dimension, number and size of dense layers) were increased until training errors fell below the target optimal error rate while the model still fit on single GPU (Tesla K80) with a batch size of 100.Models were optimized using the ADAM optimizer.Learning rates were varied between 1E-2 and 1E-5, with 1E-3 yielding the best result.Explicit learning rate decay was also noticed to improve optimization, a decay value of 2E-6 each epoch was used.Models were trained for 500 epochs.Methods for explicit regularization, including dropout and l 2 schemes were tried, but did not decrease the validation loss.All models (including refitting weights during transfer learning) used early stopping by evaluating the validation loss every 10 epochs and using the model which yielded the lowest validation loss.1.Schematic of the message passing framework for 2D structures.Input molecules are labelled according to their atom and bond types.Atom embedding layers are used to initialize the weights of the message passing layers.Molecule-level feature vectors are generated through an output layer which pools all atoms through summation, which is then passed to a series of dense layers to generate a final prediction.Dimensions of each layer for the multi-task model are shown in gray.For single-task models all dimensions are identical except for the final output layer, which has dimension 1.

A. Prediction performance on B3LYP/6-31g(d) results
The largest database consists of calculations performed at the B3LYP/6-31g(d) level of theory.By comparing calculation results for molecules with identical SMILES strings but different 3D geometries, a baseline error rate was established for models that only considered SMILES strings (2D features) as inputs.This error rate was relatively low: for ε HOMO , the mean absolute error (MAE) between pairs of conformers was 28.0 meV, lower than both the target "chemical accuracy" of 43 meV used in Faber et al. 6 and the MAE reached by the current best-performing model on the QM9 dataset, 36.7 meV 20 .
Two strategies were used to train models using only 2D coordinates.First, a series of models were trained for each property (Table I, "2D, single-task").These models were capable of closely matching DFT results, with MAEs in orbital energies approximately 10 meV higher than the calculated optimal error.These errors, 32.1 meV for ε HOMO , are lower than state-of-the-art models on the QM9 dataset, suggesting 2D connectivity is sufficient to specify molecular properties for these types of molecules.Next, a single model was trained to simultaneously predict all eight target properties ("2D, multitask").This model greatly improves prediction speed while demonstrating similar error rates to the single-task models.
For comparison, models were also trained using DFToptimized 3D coordinates.The MPNN structure of these models were adapted from that of Jørgensen, Jacobsen, and Schmidt 20 , and a single model was trained for each target property.Resulting error distributions were similar to those of models trained on only 2D coordinates (Table I, "3D, DFT"; Figure 2).The similarity in error distributions between models which consider 3D and 2D further indicates that for the molecules considered in this database, 2D structural information is sufficient to specify optoelectronic properties.Errors for the 3D model were smaller for monomer and dimer properties (gap, ε HOMO , ε LUMO , spectral overlap), while slightly larger for extrapolated polymer properties.This effect may suggest that polymer properties are less dependent on the monomer's precise 3D configuration.Approximate 3D coordinates can be computed rapidly using empirical force fields, for instance the UFF force field 39 .Molecules in the dataset were re-optimized using the UFF force field, in order to determine approximate 3D coordinates at a much lower computational cost.Models were then retrained using these approximated geometries.The resulting prediction accuracies were worse than even the 2D models, indicating that using poor-quality molecular geometries gives worse results than omitting 3D features (Table I, "3D, UFF").
We next explored the effect of training set size on prediction accuracy for models trained on 2D structures.Repeated optimizations of the multi-task model were performed with sub-sampled training data with the validation set, test set, and FIG. 2. Distributions in prediction error for held-out data.Distributions in prediction errors for test-set molecules from each model summarized in Table I.Histograms in differences in calculated values between pairs of conformational isomers are shown in gray.Lines represent kernel density estimates for prediction errors from each model.model architecture held constant across all experiments.As expected, additional training data causes out-of-sample predictive performance to improve, shown in Figure 3A.The model's accuracy asymptotically approaches the optimal error rate at the largest training set sizes.

B. Transfer learning to an alternate DFT functional
Finally, we examined whether the molecular representations learned from the large-scale B3LYP/6-31g(d) dataset improved predictive performance on a related regression.Endto-end learning models perform two tasks: they extract salient features from the input data and recombine these features to generate a prediction.Inside the network, higher level representations of the data are produced by subsequent layers before ultimately leading to a predicted value.Transferring weights to a new model from a model trained on a closely correlated target can therefore preserve much of the logic and higher-level representations of the previous model.However, even transferring weights from a poorly correlated target can aid models by preserving low-level features useful for both targets.
To test the effectiveness of transfer learning with the proposed MPNN structure, a second, smaller dataset of polymer band gap values calculated using the CAM-B3LYP/6-31g functional was used as a benchmark task.Two models trained on B3LYP/6-31g(d) data were used to initialize the weights for a new polymer band gap prediction model: first, a model trained on the same parameter calculated via B3LYP/6-31g(d), and, as a more difficult example, a model trained on the B3LYP/6-31g(d) monomer band gap.Correlation coefficients were used as a measure of the similarity between the old and new prediction targets.The correlation coefficients between the CAM-B3LYP/6-31g polymer band gap and B3LYP polymer and monomer band gaps were 0.93 and 0.48, respectively, for molecules present in both the CAM-B3LYP/6-31g and B3LYP/6-31g(d) datasets Figure 3C.
Test and validation sets of 2,000 polymer species were reserved, and the remaining data was sub-divided into training sets of increasing size.All transfer learning strategies were compared against a reference model with random weight initialization for all layers (i.e., no transfer learning).The results of all model predictions on the test set are shown in Figure 3B.For each model, performance is compared to an estimated upper-bound error.For the reference model this error was equal to the data's standard deviation, assuming a worst-case model would always predict the mean value of the prediction target.For the models with transferred weights, upper-bound errors were found assuming new targets were calculated by linearly transforming the old prediction target to best match the new target.The root mean squared error (RMSE) for these two base-case models were calculated as 360 meV and 150 meV for B3LYP/6-31g(d) monomer band gap and polymer band gap, respectively.For models with weight transfer, performance superior to these estimated upper error limits indicates that the model has retained the ability to extract and process salient features of the molecules related to the new prediction target -rather than simply recalling and rescaling the previously learned output.
For very small training set sizes (on the order of 200 molecules), models performed near the estimated upper error bound with the notable exception of the model with weakly correlated transferred weights, which had a substantially lower test set error than expected.This result demonstrates that pretraining models on even slightly related prediction targets could likely improve out-of-sample prediction accuracy when the available data is limited by allowing the MPNNs to learn useful molecular features.As the available training data is increased, both models with transferred weights demonstrate a concomitant decrease in their test set error below their estimated upper bound error.In particular, the model with weights transferred from the strongly correlated task shows superior performance at all training set sizes, requiring nearly an order of magnitude less data to reach RMSE values of 100 meV.At the largest training set sizes all three models approach the optimal error rate (estimated through conformers with duplicated SMILES strings), indicating that knowledge encapsulated in transferred weights is eventually replaced with knowledge gained through the new training data.

IV. CONCLUSIONS
In this study, we have demonstrated near-equivalent prediction accuracies from both 2D and 3D structural features in MPNN architectures, both of which closely approach the estimated 2D lower-bound error from conformational optimization.While studies on the QM9 dataset have shown that 3D coordinates are required for accurate predictions, using these data as inputs mandates that full DFT calculations still be performed for each molecule.The necessity of 3D coordi-nates for the QM9 dataset might be explained by the substantially smaller molecules considered (≤ 29 atoms, including hydrogens) when compared with our newly generated OPV database (≤ 201 atoms).Additionally, since they are exhaustively generated according to computational rules, molecules in QM9 frequently contain complex structural features that might only be captured through the explicit use of 3D coordinates.Our new public database might therefore serve as a more representative molecular learning benchmark for electronic structure calculations.
We have shown that a deep neural network pretrained on one DFT functional was able to improve predictive performance on a related DFT functional, especially in the case of limited data.This performance improvement is dependent on the correlation between tasks, but even weights transferred from a network trained on a weakly correlated task were able to improve accuracy.These results help to confirm the immense value of machine learning approaches in scientific domains both to increase the fidelity of DFT simulations and to augment them, allowing for high throughput screening and guided search.Future work will therefore explore the ability of pretrained neural networks to improve prediction accuracy on experimental data and other important targets with limited available data.

A B3LYP/ 6 -
FIG. 3. Effect of training set size on predictive performance (A) Training on B3LYP/6-31g(d).Models gradually approach the optimal error rate as training set size increases.(B) Transfer learning to predict polymer band gap calculated with CAM-B3LYP/6-31g.For each model, performance is compared to both the optimal error rate and an estimated upper error bound based on a simple linear model (dotted lines).(C) Illustration of the similarity between old and new prediction tasks considered during transfer learning.Plot of CAM-B3LYP/6-31g polymer band gap (new) versus the single-target properties used for pretraining: monomer band gap (left) and polymer band gap (right).Points represent molecules with results calculated via both functionals.

TABLE I .
Mean absolute errors (MAEs) for test set predictions for models trained on B3LYP/6-31g(d) results.The conformers column reports MAE between calculations for pairs of conformational isomers, representing an optimal error rate for models trained on 2D coordinates.Distributions of prediction errors are shown in Figure2.