Open-access Machine Learning Prediction of the Most Intense Peak of the Absorption Spectra of Organic Molecules

Abstract

Accurate knowledge of electronic molecular properties of excited states is fundamental for understanding the behavior of functional materials for organic electronics and sensors. In this work, we focus on determining the properties of the most intense peak in the electronic absorption spectra of organic molecules. For this purpose, we employed the quantum chemistry QM-symex dataset, which has approximately 173,000 organic molecules and time-dependent density functional theory (TD-DFT) data of the first ten electronic absorption transitions. Each one is identified by its Cartesian coordinates. From data in the original QM-symex, we built a new dataset named QM symex-modif that contains molecules in simplified molecular input line entry system (SMILES) format and properties related to the main electronic transition. We then employed twenty machine learning (ML) algorithms to investigate oscillator strengths, excitation energies, transition orbitals, and the highest occupied molecular orbitals (HOMOs). As inputs for the ML algorithms, we used several chemical descriptors for each molecule generated in the RDKit tool employing the corresponding SMILES format. The generated input descriptors significantly improved the accuracy of the ML predictions for these key photophysical properties. Very good mean absolute errors (MAEs) were obtained for the test set composed of 45,056 molecules, namely, an MAE of 0.035 for oscillator strengths, 0.09 eV for excitation energies, 1.24 and 0.62 for the initial and final transition molecular orbital (MO) numbers (i.e., for each molecule, their position in the MO listing), respectively, and 0.014 for HOMO numbers, with coefficient of determination (R2) values consistently exceeding 0.94, thus demonstrating the accuracy of the models. Additionally, a Shapley additive explanation (SHAP) analysis was carried out to evaluate the importance of the input parameters for the investigated ML models. We found several interesting relationships involving the input parameters. In particular, molecular weight holds significant importance in our ML models for determining the target HOMO numbers and the transition orbitals.

Keywords:
machine learning; organic electronics; maximum absorption peak; excited state properties; QM-symex dataset; simplified molecular input line entry system (SMILES) format


Introduction

The electronic properties of molecules in their electronic excited states are important for various applications, including organic electronics, photovoltaics, and molecular sensors.1,2 Therefore, discovering new materials and molecules can open up new possibilities.3 However, the high costs of developing and producing unique materials and molecules and the vast range of possible combinations4 can make traditional approaches inefficient. Today, there are about 104 synthetic materials in use, a fraction of the 1080 possible chemical combinations in the chemical space.5

Since the discovery of new specific materials depends on identifying and quantifying absorption-related electronic properties,6 it is crucial to develop new approaches to accelerate this process. Accurate prediction of these electronic properties is then critical. In this context, the oscillator strength plays a pivotal role. The oscillator strength from the ground to the excited state must be nonzero for a transition to occur. In organic photovoltaics, this transition forms the exciton (electron-hole) pair, ultimately leading to the desired electric current. In fluorescence processes, according to Kasha’s rule, light emission preferably occurs from the first excited state (S1),7 although violations of this rule can occur.8 Therefore, ensuring that the highest oscillator strength among the manifold of molecular excited states (S1, S2, …) is accurately determined is essential, as this transition increases the likelihood of an efficient nonradiative transition to the fluorescent state S1.

In addition to oscillator strengths, excitation energies are a critical factor in photophysical processes.9 For instance, accurate excitation energy values are essential for predicting the behavior of fluorophores under different conditions and for designing molecules with desirable luminescent properties.10 Furthermore, the highest occupied molecular orbital (HOMO) and the main molecular orbitals involved in a transition provide detailed information about the changes in the electronic distribution during the excitation, thus allowing the optimization of molecular structures.11 By integrating these three factors, namely, oscillator strengths, excitation energies, and orbital information, more effective design and optimization of chromophores for specific applications can be achieved, thereby improving their performance and stability.

Typically, new materials are obtained by modifying existing ones based on chemical intuition.3 Although accurate quantum chemical computational methods are available for determining various molecular properties, widely used approaches like time-dependent density functional theory (TD-DFT)12 tend to describe localized states well but often represent charge-transfer states poorly.1 Furthermore, conducting an extensive screening of favorable molecules is computationally expensive. Even after this step, considerable time is required to synthesize and experimentally measure the properties of the candidate molecules.13

Given the challenge of approximating results and the high computational costs, many studies have combined chemical data with machine learning (ML) approaches. These studies are prominent in topics such as excited states of molecules,14-20 estimation of absorption and emission wavelengths in molecules,21-26 and molecular orbitals,27-30 among others. Our group has explored ML techniques to investigate the impact sensitivity of energetic molecules,31 building up a consistent new set of Hammett’s constants,32 materials for applications in organic electronic devices,33 and thermophysical properties of fluids in different thermodynamic phases.34

Several recent works on molecular excited states are closely related to the present study. Kang et al.3 investigated oscillator strengths and excitation energies using a dataset of 500,000 molecules sourced from PubChem35 and the random forest (RF) ML algorithm. Verma et al.19 similarly used only 100,000 molecules from the QM-symex36 dataset to compute excitation energies using message passing neural networks (MPNNs). Kim et al.37 built a database of 42,008 organic molecules whose structures and electronic properties were obtained using DFT and TD-DFT calculations. Tan et al.38 used 13594 molecules from their previous work39 to make excited state predictions using a deep neural network.

Some investigations40-47 on molecular excited states focus on developing datasets of chemical properties. Yet, these datasets often pose challenges for ML applications, in several cases due to their file formats.48 For example, QM-symex, which is used in this work, provides data on molecular geometries and various properties relating to the first ten excited states, where each molecule and corresponding properties are stored in a different file in Cartesian (xyz) format, totaling 173,000 different files in Cartesian format. This structure makes it challenging to access the information in the dataset through algorithms. As input for ML algorithms, this approach is less manageable than using just one file in standard formats such as CSV (comma-separated values) or JSON (JavaScript Object Notation). Furthermore, the former approach requires a much higher computational and storage cost due to the number of files to be opened. To overcome this limitation, we propose extracting different target properties from each Cartesian entry from the QM-symex dataset and aggregating them into a single CSV file. This new dataset simplifies the application of ML techniques to investigate excited state properties.

One important motivation was the work of Kang et al.,3 who also studied oscillator strengths and excitation energies using RF models. However, they did not perform overfitting checks or explore various ML models. Like them, we employ the simplified molecular input line entry system (SMILES)49 format to generate descriptors for the ML models using the RDKit tool.50 However, in contrast, we addressed these gaps by employing techniques that screen different ML algorithms, cross validation procedures, and hyperparameter tuning, which they did not utilize. These steps allowed us to evaluate the performance of the model and ensure robust predictions rigorously. Furthermore, by enhancing data accessibility and validation against overfitting, this work significantly contributes to the study of excited properties of molecules using ML models.

The Methodology section of this work presents the process for building a new dataset derived from QM-symex and its use in our ML models. In the Results and Discussion section, our approach is applied to the target properties, and the results are analyzed. Finally, the Conclusions section summarizes our findings and highlights the main contributions of this study.

Methodology

The original QM-symex dataset was generated using the TD-DFT quantum chemical methods to compute the electronic properties of the first ten singlet and triplet excited electronic states of about 172,736 molecules. The B3LYP/6-31G level of theory was used to generate the QM-symex dataset for these calculations. Each molecule is stored in a separate file in Cartesian (xyz) format.

We first build the new dataset derived from the QM symex database, named QM-symex-modif, as follows. After converting the Cartesian coordinate files of each molecule into a SMILES format and collecting the target properties in each case (the main transition orbital numbers of the singlet most intense electronic absorption peak, the corresponding oscillator strengths, and transition energies), we generate chemical descriptors as inputs for the ML models using the RDKit library.50,51 The SMILES representation simplifies complex hierarchical rules and procedures, often challenging for chemists, into a format easily processed by ML algorithms,49 and is required by RDKit. For any given chemical structure, the SMILES notation can have equally multiple valid forms, thus providing flexibility in the molecular representation. For instance, the caffeine molecule, with the chemical formula C8H10N4O2, can be represented in different SMILES formats, such as the two indicated in Figure 1, and several others.

Figure 1
Two possible SMILES representations of the molecular structure of caffeine. The representation is a linear notation that describes the molecular structure using ASCII (American standard code for information interchange) characters, thus allowing the structure information to be easily processed by computers. The first line presents a canonical SMILES form, while the second is a variation that accurately depicts the structure of the caffeine.

The RDKit library is used to obtain descriptors, fingerprints (digital representations or descriptors that capture the structural, physicochemical, or other properties of a molecule in a concise and numerical form), and many other molecular properties. This library can be combined with different programming languages, including Python,52 Java,53 and C#.54 The molecular properties from RDKit are used as input descriptors for our models (algorithms) in this work. Obtaining molecular descriptors and fingerprints from RDKit can be efficiently achieved using the SMILES representation.55

We focus on the HOMO numbering and the following electronic excited properties of the molecules: oscillator strengths (OS), excitation energies (EE), and the main initial (MOi) and final (MOf) transition molecular orbital numbers. For each molecule, these numbers represent the orbital position in the MO listing. The workflow is depicted in Figure 2.

Figure 2
Workflow used in the ML models of this work. (a) Construction of the QM-symex-modif database; (b) construction of X (descriptors) and yn (target properties) sets; (c) development of the machine learning models with X and y datasets.

The new dataset QM-symex-modif also allowed a comparison between the models developed in this work and others from the literature. The original QM-symex dataset collects electronic properties for the first ten electronic excited states of each molecule. In the QM-symex-modif dataset, information regarding the main transition orbitals is indicated by the initial (MOi) and final (MOf) molecular orbital numbers. For each molecule, these orbitals are written as a function of its HOMO and LUMO (lowest unoccupied molecular orbitals) numbers in the following way. The MOi is written as HOMO - x and MOf as LUMO + y, where x and y are integers, and HOMO and LUMO are numbered for a given molecule indicating their position in the ordering of the molecular orbitals, as mentioned above. Figure 3 illustrates a transition from HOMO - 1 (MOi) to LUMO + 1 (MOf). If for the molecule in Figure 3 HOMO = 100, we have HOMO - x = HOMO - 1 = 99 and LUMO + y = LUMO + 1 = 101.

Figure 3
Molecular electronic excitation process: HOMO-LUMO orbital diagram.

To convert from Cartesian coordinates to the SMILES format, we used Open Babel,56 available for Python. The process for building the QM-symex-modif has the following steps:

(i) convert each molecule from Cartesian format to SMILES with Open Babel;

(ii) extract from each QM-symex entry the properties used in the ML models of this work and some others and;

(iii) store the collected data entries in CSV format, which is more convenient than the original separate files for each molecule for use in ML models. For each molecule, save the SMILES representation and the values of the extracted molecular electronic properties in separate entries of a CSV file.

The QM-symex-modif dataset created in this way is in CSV format. It is structured with the following columns: SMILES representation of the molecule, HOMO numerical value, strongest transition number in the 1-10 interval, orbital symmetry, excitation energy, wavelength, oscillator strength, spin multiplicity, and main transition orbitals (MOi and MOf). This process is depicted in the workflow shown in Figure 2. The total number of molecules (entries) in the QM-symex-modif dataset is 172,736.

The ML algorithms in this work were implemented using the Scikit-learn library (Scikit-learn-1.4.0)35 in Python. Several molecular fingerprints were obtained using RDKit, among them are the MACCS50,57 keys with the GetMACCSKeysFingerprint function and the Morgan fingerprints with the GetMorganFingerprintAsBitVect function, also known as extended-connectivity fingerprints (ECFP).58 By default, MACCS fingerprints use a 167-bit vector. Detailed descriptions of these fingerprints can be found in the Supplementary Information (SI) section. The Morgan fingerprinting algorithm identifies and encodes circular substructures in the molecule, using them to set bits in a user-specified fingerprint length. After identifying all structures, the fingerprint may be compressed to achieve an appropriate bit density.49 A Morgan fingerprint function generated by RDKit is a 2048-bit vector by default, but smaller values can be chosen. Each bit of the generated vector will be used as an entry descriptor for the ML models. Therefore, to avoid making the input vector too long and demanding high computational power, a 512-bit vector was chosen for the GetMorganFingerprintAsBitVect function. Additionally, 54 other descriptors were selected and are listed in Table S1 (SI section). All descriptors we found in RDKit that could be employed with a SMILES format were used.

The entire descriptor generation process is performed before any ML algorithm is trained. In this process, data is divided into a set of molecular descriptors for each molecule collected in a X matrix, which are generated with the SMILES entries in RDKit as described above. The target properties for each molecule are collected in a yn vector, one for each property n. In our case, each yn contains the excitation energies, oscillator strengths, MOi, MOf, and the HOMOs numbers. Data rows that do not contain all target properties or have values abnormally high and dispersed from the rest of the data are eliminated. To finalize the data cleaning and the preprocessing of the new dataset, the X matrix of the descriptors is normalized using the StandardScaler tool from Scikit-learn. The training set includes the same descriptors for each SMILES representation (i.e., the same input properties for each molecule obtained from RDKit). Therefore, separate vectors were created for each of the selected target properties of QM-symex-modif, namely, OS (y1), EEs (y2), MOi (y3), MOf (y4), HOMOs (y5), and the X matrix having 733 columns for each molecule. In other words, each row of X has 733 descriptors (512 Morgan; 167 MACCs; 54 chemical RDKit descriptors, each listed in Table S1 (SI section)). Figure 2b shows the workflow for storing each molecule and its properties, a single entry in the X matrix of the descriptors. All this data is collected in the X matrix, and the yn vectors are erased when the ML algorithms finish. This occurs because these sets are assembled within the ML algorithm in Python, which, when finished, reset all variables, matrices, and vectors.

A preliminary screening of the ML models was carried out using the Lazy Predict59 tool from Scikit-learn. This tool is based on the quantification of uncertainties employing well-established metrics. We set Lazy Predict to test 41 ML algorithms for each target property (OS, EE, MOi, MOf, HOMO); because of library-related errors, only 39 were tested for the HOMO property. The results of these trials are collected in Tables S2 to S6 (SI section). Due to computational limitations, it was impractical to train all models; for this reason, we chose the three models that most frequently (the first three in the list are not always the same models) appeared among the top ten results for all five target properties. The selected three algorithms used in this work were the extra trees,35 random forest (RF),60 and extreme gradient boosting (XGB).61

The Lazy Predict tool pre-screened traditional ML algorithms such as linear regression, decision trees, and random forest. Therefore, for comparison purposes, we also used an artificial neural network (ANN) model, which bears a different architecture from those used.62 The ANN model consists of interconnected units called artificial neurons, which process information across multiple layers. In our case, we selected a feedforward neural network of the multilayer perceptron (MLP) type because it is well-suited for handling complex relationships between input features and target properties.63 Each target property was investigated using the four ML algorithms.

In total, 20 models (4 for each target property) were trained using the molecules in the QM-symex-modif dataset, to predict the five target properties. After the data preprocessing and transformation stage described above, the dataset did not retain its original size, as some molecules could not be interpreted by RDKit or, in some cases, lacked some target properties in the original QM-symex dataset. From the processed QM-symex-modif dataset, the extra trees, random forest, and XGB algorithms were trained with a 70% data split for training and 30% for testing. The test set size was chosen to ensure enough data for testing and to maintain the quality of the results.

For the ANN model, a split of 70% was used for training and 30% for testing. Within the training set, 30% of the data was reserved for validation, thus creating a distinct validation set not used during training, ensuring that the validation process does not directly interfere with the fit of the model. This validation set was essential for fine-tuning the hyperparameters of the ANN and monitoring the performance of the model during training to avoid overfitting (i.e., an algorithm fits very well the training data but gives poor results in the test set).

The Scikit-learn GridSearchCV tool was used for the extra trees, random forest, and XGB models to determine the optimal hyperparameters for each algorithm. The GridSearchCV tool optimized the hyperparameters for each model based on evaluation metrics. The hyperparameters for the ANN were tested without automatic tools, testing the hyperparameters little by little and manually. Following the identification of the best algorithms, cross-validation with three folds (i.e., three different splits within the training set) was applied, and we evaluated the models using the following metrics: MAE (mean absolute error), MSE (mean squared error), RMSE (root mean squared error), and R2 (coefficient of determination), defined in SI section. A 3-fold cross-validation was chosen over a 10-fold (typically present in GridSearchCV tests) to balance computational efficiency with model robustness, especially given the size and complexity of the dataset. While a typical 10-fold cross-validation can offer a more thorough evaluation, the increased computational cost and training time involved are excessive relative to the expected gains in performance. These error metrics are used because similar values across the training, validation, and test sets indicate good model generalization and the lack of overfitting.64 Additionally, the Shapley additive explanations (SHAP)65 technique, originated in game theory, was used to assess the impact of each descriptor on the final outcome, thus providing explainability for the results. Figure 2c shows the workflow for evaluating the proposed ML models.

Results and Discussion

QM-symex-modif dataset

From the original dataset with about 173,000 molecules with different files in Cartesian format, a single file for the new QM symex-modif dataset contains 172,736 SMILES strings and 11 properties (1 SMILES entry corresponds to a set of Cartesian coordinates of a single molecule). Among these properties are the values of OS, EE, transition orbitals (MOi, MOf), and HOMO numbers for each molecule. The other properties were included for potential users of the QM symex modif dataset. The computer code developed to create this new molecular dataset and the generated QM symex modif dataset can be found in our GitHub repository.66

Figure 4 illustrates the distribution of the target properties collected in the QM-symex-modif dataset. In the new dataset, only 13 and 8% of the molecules display oscillator strengths greater than 0.5 and 1.0, respectively, while approximately half of the molecules have a maximum oscillator strength below 0.15 (Figure 4a). Approximately 93% of the molecules have excitation energies less than 5 eV (Figure 4b). The distribution of the MOi (Figure 4c), MOf (Figure 4d), and HOMO (Figure 5) values show that most orbitals have values in the range of 150-450.

Figure 4
Distribution of values of different molecular properties in the QM-symex-modif dataset. (a) Oscillator strength values; (b) excited state energy, in eV; (c) initial molecular orbital (MOi) values and (d) final molecular orbital (MOf) values.

Figure 5
Distribution of HOMO numbers in the QM-symex-modif dataset.

Error metrics

Tables S7 to S10 (SI section) display the test set, and for the four algorithms, the MAE, MSE, RMSE, and R2 error metrics for each target property. Overall, the results are outstanding and similar for all models. In our case, the error metrics for the test set show that random forest yields the best results; the exception is the HOMO which is better predicted by the ANN. Therefore, from now on, we will discuss the results of the RF and ANN models.

During the training step, to verify that the RF and ANN models are not overfitting, as mentioned above, we carried out for the four target properties (EE, OS, MOi and MOf) cross-validation with 3 folds, and for the ANN model, the metric of the validation set was evaluated. For these four properties, the error metrics are low and show very similar metrics for each fold in the RF model, an indication that the model is not overfitting the training data (Tables S11 to S14, SI section). The results for the 20 (4 for each target property) models initially investigated are collected in our GitHub repository.66

For the test set, Figures 6 and 7 plot the original values from the QM-symex-modif dataset and the predicted values using the RF model for the OS, EE, MOi e MOf properties, and ANN for the HOMO numbers. The corresponding error metrics are in Tables S7 to S10 (SI section).

Figure 6
Original values from the QM-symex-modif dataset and the predicted values using the RF model. (a) Oscillator strengths; (b) excited state energy, in eV; (c) initial molecular orbital (MOi) values and (d) final molecular orbital (MOf) values.

Figure 7
Original HOMO numbers from the QM-symex-modif dataset and the predicted values employing the ANN model.

We now discuss the error metrics for the test set and the RF for the first four properties. For the OS (Figure 6a), we obtained the following metrics (values between parentheses): MAE (0.035); MSE (0.007); RMSE (0.085); and R2 (0.94). For EE, the error metrics for the test set of our RF model (Figure 6b) were: MAE (0.09 eV); MSE (0.02 eV); RMSE (0.16 eV); and R2 (0.97 eV). For the transition MOi and MOf (Figures 6c and 6d), the metrics were respectively: MAE (1.24); MSE (3.76); RMSE (1.94); R2 (0.99) and MAE (0.62); MSE (1.22); RMSE (1.11); R2 (0.99). Now for the HOMO numbers of the test set (Figure 7) and the ANN model, the error metrics are very good: MAE (0.014); MSE (0.018); RMSE (0.14), and R2 (0.99).

Comparison with the literature

Some previous works used similar ML techniques, so we compare error metrics from their model with our RF approach. Tan et al.38 employed 13,594 molecules from the thermally activated delayed fluorescence (TADF) dataset38 in a neural network, reporting the following error metrics for their predictions of absorption OS: R2 (0.75), RMSE (0.0847), and MAE (0.0575). They used a much smaller dataset and, as input, a vector of nuclear charges and the interatomic distance matrix in their machine learning model. Kang et al.,3 employing the original QM symex dataset, reported an RMSE of 0.084 for their RF model, while our work produced an RMSE of 0.085. Our work uses a 733-dimensional input vector, while Kang et al.3 used a 4301-dimensional input vector and a database from PubChem35 2.9 times larger than ours for oscillator strengths (see Table S15 in SI section).

Verma et al.19 employ a directed message passing neural network (MPNN) machine learning model trained to calibrate the results of excited state energies. Their MAE value was 0.14 eV for a dataset of only 100,000 molecules from the original QM-symex compared to ours (0.09 eV) for the larger dataset of 172,736 molecules including those 100,000.67 Kang et al.3 reported RMSE values for a dataset of 500,000 molecules sourced from PubChem35 using as input a vector with 4,301 RDKit chemical descriptors (Morgan fingerprint 4,096 bits + 166 MACCS keys + 39 descriptors). They found an RMSE value of 0.43 eV compared with ours of 0.16 eV, roughly proportional to the size of the data set. Kim et al.,36 with a dataset of only 42,000 molecules, also using RDKit chemical descriptors, have an MAE value of 0.14 eV, slightly higher than ours that employs a dataset more than four times larger. This data can be seen in Table S16 (SI section). The comparison was made only for the OS and EE properties because, as far as we know, we did not find works dealing with the ordering (numbering) of the molecular orbitals, only with their energies.

Visual meaning analysis

We now examine in our QM-symex-modif dataset the effect of specific molecular descriptors on the target properties. For this purpose, we used the RF feature_importances tool in Scikit-learn to identify attributes of greater importance using a visualization approach, which involves plotting graphs involving two to three attributes. Through an image analysis (i.e., visual meaning), we selected those plots where it was possible to observe some relationship between the attributes (in the remaining ones, randomness in the combination of data was observed).

In the QM-symex-modif dataset, we found that molecules with a molecular weight (ExactMolWt descriptor) exceeding 1500 atomic mass units (amu) have an almost zero probability of displaying an oscillator strength greater than 0.5 (Figure S1a, SI section). When considering the number of heavy atoms in a molecule (CalcNumHeavyAtoms descriptor - number of non-hydrogen atoms), molecules containing 60 or more heavy atoms have a nearly 100% chance of having oscillator strength values below 0.5. Those with more than 35 heavy atoms generally exhibit oscillator strength below 1.0 (Figure S1c). Regarding the number of rotatable bonds in a molecule (CalcNumRotatableBonds descriptor), there is a nearly 100% likelihood that molecules with 11 or more rotatable bonds will have an oscillator strength below 1.0. This is likely due to the increased conformational flexibility introduced by a higher number of rotatable bonds, which can lead to a less rigid structure. A more flexible molecule may experience greater vibrational and rotational relaxation, thereby reducing the likelihood of a strong electronic transition and resulting in a lower oscillator strength (Figure S1b). We also noted that when the number of these bonds is odd, the oscillator strength values tend to be lower than 0.9, with the maximum values slightly lower than those of the preceding and succeeding even-numbered bonds. By combining the determination of heavy atoms with molecular weight, it becomes apparent that molecules in the QM-symex-modif dataset with the highest oscillator strengths typically have a molecular weight below 1100 amu and fewer than 30 heavy atoms (non-hydrogen atoms) (Figure S1d). This type of comparison between two or more characteristics has not been done in previous works in the literature.

For the excitation energy of the most intense transition, we found an interesting correlation between two topological connectivity indices, chi0 and chi1. This kind of comparison of a pair of descriptors was done for several pairs; we discuss here those that have a visual meaning. The chi0 and chi1 descriptors correspond to the so-called Kier Hall connectivity indices,68 which reflect how atoms are interconnected within a molecule, forming a network or graph where atoms are the nodes and bonds are the edges. These indices use this topological information to quantify structural aspects of the molecule that may influence its chemical and biological properties. Molecules in the QM-symex-modif dataset with the chi0 and chi1 values below 20 units and 12 units, respectively, have the highest excitation energies for the largest oscillator strength (most intense transition) among the first ten singlet excited states (Figure S2a, SI section). Furthermore, molecules with fewer than 15 heteroatoms (atoms other than carbon and hydrogen - described by the heteroatoms descriptor) and a BCUT2D[1] value (a low value indicates that a molecule has a low molecular mass) near ten often display excitation energies above 6 eV (Figure S2b). This behavior can be explained by the fact that a limited number of heteroatoms minimizes disruptive influences on the carbon-hydrogen bonding network, thus allowing for higher energy states.69 For the molecular orbital target properties, no visual correlations were found.

SHAP analysis

We applied the SHAP65 analysis to the RF model to rationalize the results by examining 300 molecules selected randomly. Several tests were necessary to choose this number of molecules for SHAP analysis due to the limited available computational resources.

For the oscillator strength target property, SHAP plots are depicted in Figures S3 and S4 (SI section), and an illustrative SHAP plot can be seen in Figure 8. The MACCS_165 fragment descriptor has the highest absolute impact on the model (0.09), thus indicating that the presence of any cyclic structure within a molecule, regardless of ring size or specific type, has an appreciable contribution for optical oscillator strengths in the RF model. These structures, particularly the aromatic ones, favor the delocalization of π-electrons, thereby increasing the likelihood of electronic transitions, which can enhance the oscillator strength. The second most impactful descriptor was MACCS_99 (0.07), which indicates the presence of a double bond between two carbon atoms, i.e., an ethylene group (C=C). Double bonds in molecules, such as an ethylene group, can promote electronic conjugation along the carbon chain, leading to a more delocalized charge distribution. This delocalization, as just said, can increase the likelihood of electronic transitions that enhance energy absorption and, consequently, the oscillator strength value. The third most impactful descriptor was CalcNumRings (0.06), the number of rings in a molecule. Molecules containing rings tend to have more rigid and less flexible structures, which favors electron delocalization, thus increasing the oscillator strength values.

Figure 8
Beeswarm plot of the RF model for the oscillator strengths.

The SHAP plots for the excitation energies are shown in Figures S5 and S6 (SI section). When we analyze the excitation energies modeled by the RF model, we found that the molecular descriptor HeavyAtomMolWt, representing the sum of the atomic weights of the heavy (non-hydrogenic) atoms in a molecule, had the most significant impact, with an absolute effect of 0.19. Due to the presence of heavy atoms, molecules with higher molecular weights tend to have more complex structures and a more significant number of modes for energy absorption and transfer, thereby influencing excitation energies and other critical molecular properties. The second most contributing descriptor was the molecular fragment Morgan_273, depicted in Figure 9a. If this fragment is part of a conjugated or aromatic system, it can significantly affect the excitation energy due to electron delocalization. The third most impactful descriptor for the excitation energies was CalcNumAliphaticCarbocycles, which counts the number of aliphatic carbocyclic rings present in a molecule. The presence of these rings can modify the frontier orbital energies, directly affecting the excitation energy.

Figure 9
Morgan molecular fragments. (a) Fragment Morgan_273: the central atom in the environment is represented in blue with one double bond and two single bonds. (b) Fragment Morgan_182: the central atom in the environment is represented in blue with one double bond and one single bond.

The SHAP plots for the MOi and MOf orbitals of the 300 molecules (Figures S7 to S10, SI section) for the RF models showed that the input attribute with the largest impact in this case is ExactMolWt (68.37 for MOi and 68.69 for MOf), which gives the exact molecular weight of a molecule. Variations in molecular weight led to differences in size and structure, which can directly influence the energy levels and separation between molecular orbitals, such as the HOMO and LUMO. The absolute importance of this molecular weight vastly outweighs almost all other descriptors, making it the predominant factor used in the RF regression models.

For the HOMO numbers investigated with the ANN model, SHAP plots are depicted in Figures S11 and S12 (SI section). The SHAP absolute values indicate that the input attribute (descriptor), among the 733 investigated, with the largest impact on the predicted values is ExactMolWt (32.4), which gives the exact molecular weight of a molecule. Variations in molecular weight lead to differences in size and structure, which can directly influence the result. Figure S13 (SI section) provides a compelling graphical analysis, revealing that the HOMO number closely correlates with the molecular weight divided by four. This relationship underscores the significant role of the HOMO number in the performance of the model. The alignment between predicted and observed results further highlights the accuracy and reliability of the model. The HeavyAtomMolWt descriptor was the second attribute with the more significant impact on the HOMO (27.51); it gives the molecular weight considering only heavy atoms excluding the hydrogens. This value can also be directly related to the total molecular weight. The third most significant impact was CalcLabuteASA (9.27), an attribute that returns the solvent-accessible surface area calculated by Labute’s method.70

Transition orbital prediction

By combining the HOMO, MOi and MOf numbers, we could predict the x and y values for the transition orbitals of the most intense absorption peak, namely, MOi = HOMO - x and MOf = LUMO + y. Therefore, using only the SMILES of a molecule, our models predicted for the most intense absorption peak the (output) values of HOMO, LUMO, EE, OS, MOi, MOf, x, and y, illustrated in Figure 10.

Figure 10
Example of a complete forecast using the top 5 models.

Conclusions

In this work, we investigated the prediction of critical photophysical properties of molecules, namely, oscillator strengths, excitation energies, and transition orbitals of the most intense singlet transition using different machine learning models. For this purpose, we modified the QM symex dataset, which contains approximately 173,000 molecules, each stored in a separate file, with detailed information on their electronic transitions. The new dataset has the following information: for each molecule in SMILES format, the orbital symmetry, the excited state number (the most intense absorption transition among the first 10 singlet), excitation energies, oscillator strengths, main transition orbitals, wavelength, spin, and the HOMO numerical value.

For each molecule of the 173,000 set, we used the SMILES format in the new dataset to generate with the RDKit tool 733 (512 Morgan + 167 MACCs + 54 chemical descriptors) descriptors and fingerprints to use in different machine learning models. By computing different error metrics, we selected three machine learning algorithms (extra trees, RF e XGB), and a neural network, to predict the five photophysical properties for a test set with 45,056 molecules extracted from the 173,000 set. The computed error metrics indicate that our results are more or similarly accurate compared to other studies using datasets of varying sizes ranging from 13,594 to 500,000 molecules. The best prediction of the excited state properties (oscillator strengths, excitation energies, and transition orbitals) was achieved with the RF algorithm. In contrast, for the HOMOs, the best model was an ANN.

The chemical descriptors generated with the RDKit tool combined with the SHAP analysis allowed us to rationalize the results. In particular, we found that the molecular weight is of appreciable importance in determining the value of the HOMOs and the transition molecular orbitals in the developed models. For oscillator strengths and excitation energies of the most intense transition, the SHAP analysis showed that Morgan fingerprints and MACCS are as important or better than other descriptors, such as the heavy atom weight.

Using only a molecule in SMILES format, we showed in this work that it was possible to accurately predict the five photophysical properties with the constructed models employing the developed new QM-symex-modif dataset. This was only possible because the algorithms used RDKit to automatically create the descriptors used as inputs for the models.

Supplementary Information

Supplementary information is available free of charge at https://jbcs.sbq.org.br as a PDF file.

Data Availability Statement

The source code for this work, the parameters for the ML models, the input files, the SHAP values, and the output examples can be found in our Github repository, available at: https://github.com/Quimica-Teorica-IME/Ml-pred-of-electronic-excited-state-properties-of-molecules.

Acknowledgments

I. B. thanks the Brazilian agencies Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq, grant No. 304148/2018-0 and 409447/2018-8) and Fundação Carlos Chagas Filho de Amparo à Pesquisa do Estado do Rio de Janeiro (FAPERJ, grant No. E-26/204.294/2024) for funding this research. R. C. S. thanks Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) for a PhD scholarship. Support for this research also came from National Institute of Science and Technology on Molecular Sciences (INCT-CiMol) grant CNPq 406804/2022-2, and Nano Network grant FAPERJ E-26/200.008/2020. We thank Dr Nathália M. P. Rosa for useful discussions.

References

  • 1 Dral, P. O.; Barbatti, M.; Nat. Rev. Chem. 2021, 5, 388. [Crossref]
    » Crossref
  • 2 Yang, X.; Zhang, J.; Yoshizoe, K.; Terayama, K.; Tsuda, K.; Sci. Technol. Adv. Mater. 2017, 18, 972. [Crossref]
    » Crossref
  • 3 Kang, B.; Seok, C.; Lee, J.; J. Chem. Inf. Model. 2020, 60, 5984. [Crossref]
    » Crossref
  • 4 Kirkpatrick, P.; Ellis, C.; Nature 2004, 432, 823. [Crossref]
    » Crossref
  • 5 Lilienfeld, O. A.; Müller, K. R.; Tkatchenko, A.; Nat. Rev. Chem. 2020, 4, 347. [Crossref]
    » Crossref
  • 6 Kim, E.; Lee, Y.; Lee, S.; Park, S. B.; Acc. Chem. Res. 2015, 48, 538. [Crossref]
    » Crossref
  • 7 Valle, J. C.; Catalán, J.; Phys. Chem. Chem. Phys. 2019, 21, 10061. [Crossref]
    » Crossref
  • 8 Braun, G.; Borges, I.; Aquino, A. J. A.; Lischka, H.; Plasser, F.; do Monte, S. A.; Ventura, E.; Mukherjee, S.; Barbatti, M.; J. Chem. Phys. 2022, 157, 154305. [Crossref]
    » Crossref
  • 9 Lakowicz, J. R.; Principles of Fluorescence Spectroscopy; Springer: Boston, MA, 2006. [Crossref]
    » Crossref
  • 10 Valeur, B. In Topics in Fluorescence Spectroscopy, vol. 4; Lakowicz, J. R., eds.; Springer: Boston, MA, 2002, p. 21. [Crossref]
    » Crossref
  • 11 Gajalakshmi, D.; Tamilmani, V.; Jaccob, M.; Theor. Chem. Acc. 2021, 140, 81. [Crossref]
    » Crossref
  • 12 Huix-Rotllant, M.; Ferré, N.; Barbatti, M. In Quantum Chemistry and Dynamics of Excited States: Methods and Applications; González, L.; Lindh, R., eds.; Wiley: New York, USA, 2020, ch. 2. [Crossref]
    » Crossref
  • 13 Song, H. O.; Lee, B.; Bhusal, R. P.; Park, B.; Yu, K.; Chong, C. K.; Cho, P.; Kim, S. Y.; Kim, H. S.; Park, H.; PLoS One 2012, 7, e48459. [Crossref]
    » Crossref
  • 14 Bai, Y.; Vogt-Maranto, L.; Tuckerman, M. E.; Glover, W. J.; Nat. Commun. 2022, 13, 7044. [Crossref]
    » Crossref
  • 15 Cignoni, E.; Suman, D.; Nigam, J.; Cupellini, L.; Mennucci, B.; Ceriotti, M.; ACS Cent. Sci. 2024, 10, 637. [Crossref]
    » Crossref
  • 16 Nayak, P. K.; Perez, C. M.; Liu, D.; Prezhdo, O. V.; Ghosh, D.; Chem. Mater. 2024, 36, 3875. [Crossref]
    » Crossref
  • 17 Srsen, S.; von Lilienfeld, O. A.; Slavicek, P.; Phys. Chem. Chem. Phys. 2024, 26, 4306. [Crossref]
    » Crossref
  • 18 Terrones, G. G.; Duan, C.; Nandy, A.; Kulik, H. J.; Chem. Sci. 2023, 14, 1419. [Crossref]
    » Crossref
  • 19 Verma, S.; Rivera, M.; Scanlon, D. O.; Walsh, A.; J. Chem. Phys. 2022, 156, 134116. [Crossref]
    » Crossref
  • 20 Vinod, V.; Maity, S.; Zaspel, P.; Kleinekathoefer, U.; J. Chem. Theory Comput. 2023, 19, 7658. [Crossref]
    » Crossref
  • 21 Fliszkiewicz, B.; Chem. Data Collect. 2022, 37, 100810. [Crossref]
    » Crossref
  • 22 Mai, J.; Lu, T.; Xu, P.; Lian, Z.; Li, M.; Lu, W.; Dyes Pigm. 2022, 206, 110647. [Crossref]
    » Crossref
  • 23 Sun, D.; Zhou, C.; Hu, J.; Li, L.; Ye, H.; Food Chem. 2023, 408, 135166. [Crossref]
    » Crossref
  • 24 Zhang, K.; Wang, Z.; Liu, H.; Perea-Lopez, N.; Ranasinghe, J. C.; Bepete, G.; Minns, A. M.; Rossi, R. M.; Lindner, S. E.; Huang, S. X.; Terrones, M.; Huang, S.; ACS Photonics 2022, 9, 2963. [Crossref]
    » Crossref
  • 25 Senanayake, R. D.; Yao, X.; Froehlich, C. E.; Cahill, M. S.; Sheldon, T. R.; McIntire, M.; Haynes, C. L.; Hernandez, R.; J. Chem. Inf. Model. 2022, 62, 5918. [Crossref]
    » Crossref
  • 26 Stanev, V.; Maehashi, R.; Ohta, Y.; Takeuchi, I.; Artif. Intell. Chem. 2023, 1, 100002. [Crossref]
    » Crossref
  • 27 Storm, F. E.; Folkmann, L. M.; Hansen, T.; Mikkelsen, V. K.; J. Mol. Model. 2022, 28, 313. [Crossref]
    » Crossref
  • 28 Welborn, M.; Cheng, L.; Miller III, T. F.; J. Chem. Theory Comput. 2018, 14, 4772. [Crossref]
    » Crossref
  • 29 Woon, K. L.; Chong, Z. X.; Ariffin, A.; Chan, C. S.; J. Mol. Graphics Modell. 2021, 105, 107891. [Crossref]
    » Crossref
  • 30 Wang, F.; Vasilyev, V.; Artif. Intell. Chem. 2023, 1, 100023. [Crossref]
    » Crossref
  • 31 Duarte, J. C.; da Rocha, R. D.; Borges, I.; Phys. Chem. Chem. Phys. 2023, 25, 6877. [Crossref]
    » Crossref
  • 32 de Castro, G. M.; Duarte, J. C.; Borges, I.; J. Org. Chem. 2023, 88, 9791. [Crossref]
    » Crossref
  • 33 de Castro, G. M.; Borges, I.; J. Comput. Chem. 2023, 44, 2256. [Crossref]
    » Crossref
  • 34 Máximo-Canadas, M.; Duarte, J. C.; Nichele, J.; Alves, L.; Pereira, L. O.; Ramos, R.; Borges, I.; ChemRxiv 2024 [Crossref]
    » Crossref
  • 35 Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Perrot, M.; Duchesnay, E.; J. Mach. Learn. Res. 2011, 12, 2825. [Link] accessed in February 2025
    » Link
  • 36 Liang, J.; Ye, S.; Dai, T.; Zha, Z.; Gao, Y.; Zhu, X.; Sci. Data 2020, 7, 400. [Crossref]
    » Crossref
  • 37 Kim, J. H.; Kim, H.; Kim, W. Y.; Bull. Korean Chem. Soc. 2022, 43, 645. [Crossref]
    » Crossref
  • 38 Tan, Z.; Li, Y.; Zhang, Z.; Penfold, T.; Shi, W.; Yang, S.; Zhang, W.; New J. Chem. 2023, 47, 9550. [Link] accessed in February 2025
    » Link
  • 39 Tan, Z.; Li, Y.; Wu, X.; Zhang, Z.; Shi, W.; Yang, S.; Zhang, W.; RSC Adv. 2023, 13, 1031. [Crossref]
    » Crossref
  • 40 Zia, K.; Khan, S. A.; Ashraf, S.; Nur-e-Alam, M.; Ahmed, S.; Ul-Haq, Z.; J. Mol. Struct. 2021, 1231, 129953. [Crossref]
    » Crossref
  • 41 Zhang, L.; Zhang, S.; Owens, A.; Yurchenko, S. N.; Dral, P. O.; Sci. Data 2022, 9, 84. [Crossref]
    » Crossref
  • 42 Johnson, M. S.; Dong, X.; Dana, A. G.; Chung, Y.; Farina Jr., D.; Gillis, R. J.; Liu, M.; Yee, N. W.; Blondal, K.; Mazeau, E.; Grambow, C. A.; Payne, A. M.; Spiekermann, K. A.; Pang, H. W.; Goldsmith, C. F.; West, R. H.; Green, W. H.; J. Chem. Inf. Model. 2022, 62, 4906. [Crossref]
    » Crossref
  • 43 Chatr-Aryamontri, A.; Ceol, A.; Palazzi, L. M.; Nardelli, G.; Schneider, M. V.; Castagnoli, L.; Cesareni, G.; Nucleic Acids Res. 2007, 35, D572. [Crossref]
    » Crossref
  • 44 Zdrazil, B.; Felix, E.; Hunter, F.; Manners, E. J.; Blackshaw, J.; Corbett, S.; de Veij, M.; Ioannidis, H.; Lopez, D. M.; Mosquera, J. F.; Magarinos, M. P.; Bosc, N.; Arcila, R.; Kizilören, T.; Gaulton, A.; Bento, A. P.; Adasme, M. F.; Monecke, P.; Landrum, G. A.; Leach, A. R.; Nucleic Acids Res. 2024, 52, D1180. [Crossref]
    » Crossref
  • 45 Oughtred, R.; Rust, J.; Chang, C.; Breitkreutz, B. J.; Stark, C.; Willems, A.; Boucher, L.; Leung, G.; Kolas, N.; Zhang, F.; Dolma, S.; Coulombe-Huntington, J.; Chatr-aryamontri, A.; Dolinski, K.; Tyers, M.; Protein Sci. 2021, 30, 187. [Crossref]
    » Crossref
  • 46 Irwin, J. J.; Tang, K. G.; Young, J.; Dandarchuluun, C.; Wong, B. R.; Khurelbaatar, M.; Moroz, Y. S.; Mayfield, J.; Sayle, R. A.; J. Chem. Inf. Model. 2020, 60, 6065. [Crossref]
    » Crossref
  • 47 Himmetoglu, B.; J. Chem. Phys. 2016, 145, 134101. [Crossref]
    » Crossref
  • 48 Posenitskiy, E.; Chilkuri, V. G.; Ammar, A.; Hapka, M.; Pernal, K.; Shinde, R.; Borda, E. J. L.; Filippi, C.; Nakano, K.; Kohulák, O.; Sorella, S.; Castro, P. O.; Jalby, W.; Ríos, P. L.; Alavi, A.; Scemama, A.; J. Chem. Phys. 2023, 158, 174801. [Crossref]
    » Crossref
  • 49 Weininger, D.; Weininger, A.; Weininger, J. L.; J. Chem. Inf. Comput. Sci. 1989, 29, 97. [Crossref]
    » Crossref
  • 50 Landrum, G.; RDKit Documentation, 2019. [Link] accessed in February 2025
    » Link
  • 51 Landrum, G.; MACCSkeys with Python code, version 2023.03; London, England, 2024. [Link] accessed in February 2025
    » Link
  • 52 van Rossum, G.; Python Programming Language, version 3.8; Python Software Foundation, Wilmington, DE, USA, 2024. [Link] accessed in February 2025
    » Link
  • 53 Gosling, J.; Joy, B.; Java Programming Language, version 23; Oracle, Austin, TX, USA, 2024. [Link] accessed in February 2025
    » Link
  • 54 Hejlsberg, A.; C# Programming Language, version 8.0; Microsoft Corporation, Redmond, USA, 2024. [Link] accessed in February 2025
    » Link
  • 55 Munshi, J.; Chen, W.; Chien, T.; Balasubramanian, G.; J. Chem. Inf. Model. 2021, 61, 134. [Crossref]
    » Crossref
  • 56 Open Babel, version 3.0; The Open Babel Team, Boston, MA, USA, 2024. [Link] accessed in February 2025
    » Link
  • 57 Durant, J. L.; Leland, B. A.; Henry, D. R.; Nourse, J. G.; J. Chem. Inf. Comput. Sci. 2002, 42, 1273. [Crossref]
    » Crossref
  • 58 Rogers, D.; Hahn, M.; J. Chem. Inf. Model. 2010, 50, 742. [Crossref]
    » Crossref
  • 59 Pandala, S. R.; Lazy Predict, version 2023; Shankar Rao Pandala, Karnataka, India, 2024. [Link] accessed in February 2025
    » Link
  • 60 Breiman, L.; Mach. Learn. 2001, 45, 5. [Crossref]
    » Crossref
  • 61 Lundberg, S. M.; Lee, S. I.; SHAP: SHapley Additive exPlanations, version 0.46.0; GitHub, USA, 2025. [Link] accessed in February 2025
    » Link
  • 62 Walczak, S.; Advanced Methodologies and Technologies in Artificial Intelligence, Computer Simulation, and Human-Computer Interaction; IGI Global Scientific Publishing: Pennsylvania, USA, 2019, p. 40. [Crossref]
    » Crossref
  • 63 Taud, H.; Mas, J. F. In Geomatic Approaches for Modeling Land Change Scenarios; Olmedo, M. C.; Paegelow, M.; Mas, J. F.; Escobar, F., eds.; Springer: Cham, 2018, p. 451. [Crossref]
    » Crossref
  • 64 Carvalho, D. V.; Pereira, E. M.; Cardoso, J. S.; Electronics 2019, 8, 832. [Crossref]
    » Crossref
  • 65 Lundberg, S. M.; Lee, S. I.; 31st Conference on Neural Information Processing Systems (NIPS 2017); Long Beach, CA, USA, 2017. [Link] accessed in February 2025
    » Link
  • 66 GitHub repository, https://github.com/Quimica-Teorica-IME/Ml-pred-of-electronic-excited-state-properties-of-molecules, accessed in February 2025.
    » https://github.com/Quimica-Teorica-IME/Ml-pred-of-electronic-excited-state-properties-of-molecules
  • 67 Grimme, S.; Bannwarth, C.; J. Chem. Phys. 2016, 145, 054103. [Crossref]
    » Crossref
  • 68 Hall, L. H.; Kier, L. B.; J. Mol. Graphics Modell. 2001, 1, 4. [Crossref]
    » Crossref
  • 69 Bhatta, R. S.; Tsige, M.; Polymer 2015, 56, 293. [Crossref]
    » Crossref
  • 70 Labute, P.; J. Mol. Graphics Modell. 2000, 18, 464. [Crossref]
    » Crossref

Edited by

  • Editor handled this article:
    Paula Homem de Mello (Associate)

Publication Dates

  • Publication in this collection
    21 Mar 2025
  • Date of issue
    2025

History

  • Received
    21 Oct 2024
  • Accepted
    07 Mar 2025
location_on
Sociedade Brasileira de Química Instituto de Química - UNICAMP, Caixa Postal 6154, 13083-970 Campinas SP - Brazil, Tel./FAX.: +55 19 3521-3151 - São Paulo - SP - Brazil
E-mail: office@jbcs.sbq.org.br
rss_feed Stay informed of issues for this journal through your RSS reader
Accessibility / Report Error