Open-access How Could Perform a Predictive Model Created Using the Heterogeneous Chemical Space of Compounds against Trypanosoma cruzi?

Abstract

The limited treatment options for Chagas disease (CD), coupled with the side effects of approved drugs, highlight the urgent need to identify novel therapeutic candidates for this neglected disease. To address this challenge, we developed a strategy to assess whether existing data on this topic could support a comprehensive approach for predicting the antitrypanosomal properties of natural and synthetic compounds against Trypanosoma cruzi. The chemical space constructed in this study comprised various classes of compounds, which were subjected to machine learning-based screening to evaluate the robustness of different methodologies for handling heterogeneous data. The optimized models accurately predicted the antitrypanosomal activity of multiple compounds in the external test set, achieving correlation coefficient of the training set (R2) = 0.995, robustness (Q2) = 0.935, root mean squared error (RMSE) = 0.366, percentage split (P2) = 980, and correlation with predictions on the external test set (P2(test set)) = 0.888. Additionally, twenty-five compounds were tested in vitro against T. cruzi trypomastigotes, and the model successfully predicted their logarithm of half maximal inhibitory concentration (pIC50) values. This study proposes an approach for developing in house methodologies to identify promising candidates for biological assays by leveraging machine learning within a heterogeneous chemical space.

Keywords:
Trypanosoma cruzi; QSAR; trypanocidal activity; Random Forest; machine learning techniques


Introduction

Chagas disease (CD), caused by the protozoan Trypanosoma cruzi, is mainly transmitted by the infected hematophagous triatomine vector. High incidence of CD is often associated with poor sites and mostly endemic to tropical areas, such as Latin America.1 Currently, people have been exposed to many others routes of infection such as oral transmission by ingestion of contaminated food, vertical transmission, blood transfusion, or organ transplantation.2 Many cases of infection are already identified in non-endemic areas such as Europe, North America, Japan and Australia, due to human migration.3 Additionally, it is estimated that about 5.7 million people are infected by T. cruzi worldwide, resulting in at least 7,000 deaths per year.4

There are two main phases in CD: (i) acute phase, in which the absence or no specificity of symptoms (such as fever and malaise) appear from one to two weeks after infection; and (ii) chronic phase, admitting determined (in which individuals develop cardiac, digestive or both issues), or undetermined symptoms (keeping asymptomatic over time or even a lifetime). The treatment in the first phase results in a high chance of cure. However, if not properly treated, the condition may advance to the chronic phase.5

The treatment of CD includes drugs such as benznidazole and nifurtimox, and the first is generally more efficacious. However, dermatological, gastrointestinal and neurological effects are reported.2 Since these drugs show effectiveness during the acute phase of CD, the challenge is the correct identification in the early stages of infection to initiate the treatment.6 Another aggravating problem relies on the resistance developed by different strains of T. cruzi, highlighting a lack of efficient drugs against CD.6 Furthermore, in Brazil, only benznidazole is available for the clinical treatment of patients and is the only one in the market in several countries. Since the 80’s, nifurtimox was first commercially discontinued in Brazil and later in other South American countries due to numerous side effects.7

The life cycle of T. cruzi comprises three forms of the parasite: epimastigote, amastigote, and trypomastigote, in which the first two are reproductive and the last, the infective. Trypomastigotes reach cells and tissues, differentiating to amastigotes for multiplication. Once they multiply, they return to trypomastigote, causing host cell lysis to reach the extracellular space, resulting in an increasing number of parasites available to enter into healthy organs and tissues, spreading the infection.8,9 Thus, this clinical form becomes a promising target for novel potential drugs, and the focus of our research group and this work.

In the search for effective alternatives against T. cruzi, several reports investigated the antitrypanosomal effect of a wide range of compounds of several classes. These initiatives resulted in a vast and heterogeneous chemical space of natural products, semi-synthetic, and synthetic compounds with antitrypanosomal potential. In addition, the use of machine learning and artificial intelligence has been largely used to support and to develop new alternatives and the discovery of new compounds against neglected diseases.10 In this context, we compiled literature data to investigate machine learning techniques for developing a robust predictive model capable of estimating antitrypanosomal activity. This strategy led to the construction of a quantitative structure-activity relationship (QSAR) model within a heterogeneous chemical space, utilizing molecules reported in the scientific literature. Several parameters were considered during dataset construction, optimization, and model validation, including feature selection, outlier treatment, resampling strategies, algorithm tuning, scrambling, goodness-of-fit assessment, robustness evaluation, predictive power estimation, correlation coefficient analysis, and root mean squared error (RMSE) calculations for both training and test sets. To evaluate the performance of this methodology, several compounds were tested in vitro against T. cruzi trypomastigotes. The QSAR model developed herein accurately predicted the biological properties of compounds with high robustness and predictive power, using the infective form of T. cruzi as the study target. It is essential to emphasize that this comprehensive technique, capable of predicting the activity of diverse compound classes against CD, contributes significantly to the drug discovery process aiming to improving the health conditions of infected individuals.

Experimental

Heterogeneous datasets

The dataset was elaborated with several classes of compounds tested against trypomastigotes of T. cruzi. The prediction model was composed of a training set and an external test set for evaluating improvements. Compounds that populated the training set reached 1000 molecules tested for the antitrypanosomal activity. In addition, an in house external test set was created with different classes of compounds with 10 unique instances with the sole purpose of monitoring the results of optimizations. The training dataset was checked against duplicates using Jaccard Index of similarity based on their molecular descriptor values. The value of Jaccard Index (JI) = 1 (100% of similarity) was accepted for excluding repeated instances with same attributes values.

Y-axis normalization (the axis of biological properties)

The values of biological activities obtained as half maximal inhibitory concentration (IC50) comprehended different scales, such as µg mL-1, mM, µM, M and pIC50. The values were normalized by converting them to the logarithmic scale based on the negative logarithm of the molar concentration (-log[M] = pIC50), normalizing values of antitrypanosomal activity for each instance. A histogram of pIC50 values is available at Figure S1 (Supplementary Information (SI) section). This normalization step of the Y-axis improves the predictive power of the biological activity of novel instances.11,12

General structure normalization half maximal inhibitory concentration

Individual “.mol” files of each compound were obtained from Chemspider,13 Pubchem14 or by converting the International Union of Pure and Applied Chemistry (IUPAC) name from the source paper into a chemical software. Each structure was curated and named according to its classification and number from the source paper. The “.mol” file (2D) created for each instance was submitted to the software Avogadro15 for structure checking and to build a rough 3D geometry.16 The 3D structure was exported as a MOPAC file17 (calculation: geometry optimization; method: PM7; format: cartesian as “papernumber-author-compoundclass-compoundnumber.mop”). Each structure was minimized with the software MOPAC2016 using the semi-empirical method PM7.18 This minimization resulted in a singular “.out” file, which was converted into “.mol” file with 3D coordinates. The final minimized structure was submitted to the software PaDEL-Descriptor version 2.2119 for the calculation of 2D and 3D molecular descriptors.20

Molecular descriptors calculation

The set of molecular descriptors was calculated for each compound using the software PaDEL-Descriptor version 2.21.20 The results included all the 2D and 3D descriptors. To the PaDEL standardize section, only the standardize nitro groups was selected. In total, 1875 descriptors were calculated (1444 2D, and 431 3D) for each minimized structure.

Feature selection

The 1875 molecular descriptors, calculated individually for each instance, were organized in a global dataset (descriptors values as attributes) associated with their respective Y-axis (biological properties as variables dependent of X) in a matrix instances vs. attributes type. This dataset was submitted to the Machine Learning Tool Weka-3.8.1,21,22 in a “.csv” comma delimited file format. In the preprocess tab, the dataset was loaded and the biological activity was selected as the class attribute. Filters were applied to the entire matrix to select features in which the variance significantly explains the instances according to their biological properties. Three supervised methods were adopted for features selection with the CfsSubsetEval (-P 1 -E 1) evaluator using (i) Best First: -D1 -N5; (ii) Greedy Stepwise: -T -1.7976931348623157E308 -N 1 -num-slots 1; and (iii) Genetic Search: -Z 20 -G 20 -C 0.6 -M 0.033 -R 20 -S 1. After feature selection, the evaluator ReliefFAttributeEval (-M -1 -D 1 -K 10) was applied to the final group of attributes with Ranker (-T -1.7976931348623157E308 -N -1) as search engine. The Ranker algorithm sorts attributes by their individual evaluations in descending order (e.g., most important attributes first). The first 20 attributes were kept in each final training dataset along with the class attribute (pIC50).

Selection of classifier models

The selection of classifier algorithms comprehended the available Machine Learning Techniques in Weka 3.8.1 with capabilities for numeric class and attributes. The algorithm selection step was based on the values of Q2 (robustness) obtained with the training set originally with 20 best attributes ranked. Two raw datasets ((a) 20 attributes selected by Best First and Greedy Stepwise, and (b) 20 attributes selected by Genetic Search) with their 20 ranked attributes were individually submitted to several classification methods: (i) functions: Gaussian processes (with Polykernel, Puk, and RBFKernel), LeastMedSq, LinearRegression, MultilayerPerceptron, RFBNetwork, RBFRegression, SimpleLinearRegression, and SMOReg; (ii) lazy: IBk, KStar, and LWL; (iii) rules: DecisionTable, M5Rules, and ZeroR; (iv) trees: DecisionStump, M5P, RandomForest, RandomTree, and RepTree. Each algorithm resulted in one value of robustness (Q2) based on 10-fold cross-validation for each subset (a and b). A cut-off value was accepted as Q2 > 0.55 for selection of algorithms.

Reducing the model complexity

Attributes were internally removed from the dataset in the software Weka-3.8.1 to reduce the complexity and to evaluate the resulting robustness (Q2) from 20th to the 10th, respecting the ranking order. Each reduced subset was tested for robustness (Q2) with 10-fold cross-validation. The better cost-benefit (lower number of attributes with higher robustness and moderate complexity) was selected for optimizations.23 The number of attributes accepted was selected by comparing robustness values resulted from the selected classifier models and RMSE.

External test sets

An in house dataset was created only for monitoring and evaluation of the classifier and training set improvements. A series of 10 unique compounds had their structures normalized, attributes selected and Y-axis added according to the method. The external test set included a fenarimol derivative;24 an aryl thiosemicarbazone;25 a lactone natural product enhydrin;26 a phenylpyrimidine-5-carboxamide derivative;27 a methanesulfonamide derivative;27 a phenolic natural product named penta-O-galloyl-²-D-glucose;28 eupomatenoid 5;29 polygodial;30 jacaranone;31 and a quinoxalidine-oxadiazole derivative.32 The external test set was randomly chosen and populated with values of molecular descriptors for these compounds and their original values of pIC50. The values obtained for their classifications were only used for monitoring each step in our investigation.

Resampling

The dataset was optimized using different strategies. The external test set was used from the first to the last optimization monitoring improvements of internal parameters (lower prediction errors and higher correlation coefficients with the Y-axis). Strategies: exclusion of outliers (identified in the Targeted Projection Pursuit (TPP) version 1.0.6 Projection Plot of Weka); Class Balancer (as a supervised filter for balancing classes of instances); SMOTE I and II (resamples the dataset by applying the Synthetic Minority Oversampling Technique); and replicating of instances (add replicates of each instance in the original and balanced dataset).

Random Forest tuning

Algorithm tuning was applied to Random Forest for reaching higher values of goodness-of-fit, robustness, and predictive power. Random Forest parameters: number of iterations = 200; maxDepth = 0; seed = 6; number of features = 0; BreakTiesRandomly = true; other parameters = default.

Overfitting detection

The scrambling technique was used for testing the model against overfitting. This validation step was performed by destroying the structure-activity relationship of each instance by randomly changing their values of pIC50. The activity values were calculated according to the best model and the method was tested for adjusting the outputs to the new situation.33 The final scrambled dataset was then tested with the optimized method for the evaluation of goodness-of-fit, robustness and predictive power, aiming to obtain values lower than 0.5, the maximum accepted for validating prediction models.34-36 The final dataset is available at the open science repository.37

Antitrypanosomal assay

The test set with the values of antitrypanosomal activity of 25 compounds was kindly provided from Instituto Butantan, in which the results of biological activity against trypomastigotes of T. cruzi followed the methodology described ahead. Trypomastigotes of T. cruzi (Y strain) were maintained in Rhesus monkey kidney cells (LLC MK2 ATCC CCL 7), cultivated in Roswell Park Memorial Institute (RPMI)-1640 medium supplemented with 2% fetal bovine serum (FBS) at 37 °C in a 5% CO2-humidified incubator. To determine the 50% inhibitory concentration (IC50) against T. cruzi, trypomastigotes were obtained from LLC-MK2 cultures previously infected, seeded at 1 × 106 cells per well in 96-well plates, and incubated with serial dilutions of tested compounds (150 1.71 µM), during 24 h at 37 °C in a 5% CO2-humidified incubator. The trypomastigote viability was determined by the resazurin assay (0.011% in phosphate buffered saline (PBS)). The optical density was determined in FilterMax F5 (Molecular Devices) at 570 nm.38

Results and Discussion

Currently, global prediction models are possible due to mainly four factors: (i) availability of data; (ii) great number of tools for calculating molecular descriptors (iii) sophisticated machine learning methods available for different purposes; and (iv) processing power that made possible the use of these methods in low and high dimensional datasets.39 In addition, artificial intelligence innovations is revealing a high priority in drug design through the enhancement of machine learning approaches by the collection of chemical and pharmacological data available.40

The great challenge of creating prediction models is to estimate properties of new molecules in which algorithms need to find patterns in heterogeneous datasets to create outputs as decision-makers.41 Heterogeneous methods are naturally dense, reducing their complexity is an essential step for respecting parsimony.23,42 The investigation made herein presented a reduced number of attributes and simple resampling, which resulted in appropriate values of internal and external validations.

An extensive literature review provided 1000 compounds tested against trypomastigotes of T. cruzi and their antitrypanosomal activity. The obtained antitrypanosomal activity comprehended values in µg mL-1, mM, µM, M and pIC50 that were converted to pIC50, affording a uniform scale for the Y-axis.11,12 All structures were obtained from their respective paper converted as “.mol” files curated in the software Avogadro 1.2.011 and converted to 3D structures using MOPAC.43 The 3D structures were minimized using the algorithm PM744 and submitted to the software PaDEL 2.220 for molecular descriptors calculation. The resulting matrix containing 1000 instances and 1876 attributes (including the Y-axis as class attribute) was submitted to feature selection methods in the software Weka 3.8.1.41

Datasets were submitted to a global evaluation of machine learning techniques to select methods for improvements with four classes of algorithms: Functions (10 algorithms), Lazy (3 algorithms), Rules (3 algorithms), and Trees (5 algorithms) based on their robustness performances (Q2-leave-many-out 10-fold cross-validation).

The dataset was populated with amines, amides, amidines and analogues (7% of instances); several classes of compounds in clinical use (7% of instances); fenarimols derivatives (12%); lapachone derivatives (4%); several classes of natural products (32%); pyrone and pyranone derivatives (3%); quinonoids derivatives (6%); thiazoles, thiazolidines, and thiazolidiones derivatives (12%); thiosemicarbazones (5%); and triazoles (6%). Other classes of compounds comprehended around 6% of the instances. These classes were organized for illustrative purposes in Figure 1. The chemical space of the method is available in the SI section (Table S1).

Figure 1
Simplified chemical space derived from the literature, associated with compounds exhibiting antitrypanosomal activity.

After the attribute selection step, two identical datasets were created using different selection models, according to Figure 2 (dataset a, with attributes selected by Best First and Greedy Stepwise, which resulted in the same selected attributes; and dataset b, with attributes from Genetic Search methods). For more information about the selected attributes see SI section (Tables S2-S3). The only function-based method that reached appropriate values of Q2 was Gaussian processes-Puk. Lazy methods represented by IBK, KStar and LWL, were not robust enough to build the method, in which only KStar reached the minimum Q2 (Q2 > 0.55) considered adequate. Rule-based algorithms did not show good results with the numeric class, and ZeroR presented the lower robustness of all algorithms. The Random Forest algorithm reached the highest robustness value for the model. In addition, the subset a, with attributes from Best First and Greedy Stepwise (which resulted in identical descriptors selected) presented higher robustness than the subset b with attributes selected by Genetic Search. The algorithms with appropriate values of Q2 were chosen respecting the cut-off admitted (Q2 > 0.55) (Figure 2).

Figure 2
Robustness (Q2) of each classifier model evaluated with Weka 3.8.1 obtained with the heterogeneous dataset. These Machine Learning techniques were chosen for the evaluation of their robustness due to their capability of dealing with numeric classes, which is associated with the antitrypanosomal activity scaled as the negative logarithm of the molar concentration (Y-axis) of each instance. Methods: Functions, Lazy, Rules and Trees algorithms. The dashed vertical line represents the cut-off value of robustness (Q2 > 0.55) considered for algorithms selection.

Three methods were selected: (i) Gaussian processes-Puk (Q2 = 0.616/0.543 for Best First and Greedy Stepwise/Genetic Search, respectively; algorithm type: Functions); (ii) KStar (Q2 = 0.561/0.496; algorithm type: Lazy); and (iii) Random Forest (Q2 = 0.634/0.611; algorithm type: Trees).

The complexity of the model was explored and attributes were excluded one at a time from the 20th to the 10th, respecting the ranking order. This analysis was performed for reducing complexity and maintaining robustness as a cost-benefit procedure. The new datasets were tested against robustness based on the results of 10-fold cross-validation, in which the smallest number of attributes that sustained appropriated values of Q2 was selected. This strategy respects the parsimony principle reaching a balance between the number of attributes and Q2 with only the necessary information for modeling.42 A subset of 12 attributes from Best First and Greedy Stepwise, and 13 attributes from Genetic Search provided the best cost-benefit number of attributes vs. Q2 using Random Forest (Figure 3). The subset of attributes selected by Genetic Search did not overwhelm the robustness reached with Best First and Greedy Stepwise. The Q2 obtained with Random Forest to the subset a with 12 attributes exhibited higher robustness than the Q2 obtained with 20 attributes (Q2(20 attributes) = 0.618; Q2(12 attributes) = 0.632) and presented smaller root mean squared error (RMSE(20 attributes) = 0.829; RMSE(12 attributes) = 0.8284). The RMSE is associated with the precision of the prediction model, where a small RMSE is appropriate for an elevated correlation between actual and predicted values.20 These results suggest that equivalent robustness has been achieved with a less complex model.

Figure 3
Reducing the complexity of each subset by removing one attribute at a time and testing their robustness. The 20 attributes firstly selected by Best First and Greedy Stepwise (BF-GS), and Genetic Search (GS) were evaluated with the three selected algorithms Gaussian Processes-Puk, KStar, and Random Forest for the model. The heatmap was calculated in the software Gitools 2.3.145 and the values of robustness were clustered by Q2 (columns) using Euclidean distance. *Number of attributes selected by BF-GS (12 attributes for the subset a). **Number of attributes selected by GS (13 attributes from the subset b. Both with Q2 > 0.60.

The subset b with 20 attributes selected by Genetic Search reached a Q2(20 attributes) = 0.610 with Random Forest, and the most appropriate Q2 was achieved with 13 attributes (Q2(13 attributes) = 0.602). A RMSE(20 attributes) = 0.8631 was slightly better than the RMSE(13 attributes) = 0.868 for this subset. The final list of attributes selected by Best First and Greedy Stepwise presented only one 3D descriptor against three of Genetic Search. This suggests that the Best First and Greedy Stepwise selected fewer complex attributes and thus, resulted in a simpler method. The dataset was searched for outliers based on the Targeted Projection Pursuit (TPP) v. 1.0.6, which allows exploring high dimensional data, seeing the effect of classification and clustering.46 Four outliers were identified and, replaced by new instances and the distribution was observed. The new dataset was submitted to TPP and re-checked. The final dataset presented a more homogeneous distribution of instances and biological activities (Figure 4).

Figure 4
Projection plot based on descriptors (axis) and biological values of activity (circles) obtained with the Targeted Projection Pursuit tool in Weka 3.8.1. (a) Projection including outliers. (b) The chemical space projected with outliers replaced by new instances resulting in a more homogeneous distribution.

Figure 5
Results of internal (R2, Q, and P2) and external validations (P2 test set) for each resampling method applied to the original dataset (with outliers removed). *Dashed line represents the minimum value acceptable for prediction models for R2, Q2, and P2, which is 0.6. **Dashed line represents the maximum value acceptable for scrambling technique for R2, Q2, and P2, which is 0.5. Scrambling and tuning were applied only to Random Forest algorithm.

The results of internal and external validation reached with the dataset without outliers were higher for Gaussian Processes and Puk and Kstar, and equivalent for Random Forest. Hence, the dataset was submitted to improvements and resampling, as well as tuning procedures.

An external test set containing 10 unique compounds was randomly created for monitoring improvements reached by resampling and algorithm tuning. The test set included molecules with values of fragC from 13.04 to 762.26 (fragC is a descriptor that calculates the fragments complexity of a molecular system based on its structure, considering rings, rings systems, side chains, functional groups, totaling twelve functions).47 Resampling strategies, such as Synthetic Minority Oversampling Technique (SMOTE I, II, and replication), and class balancer were explored using the optimized dataset. Results were evaluated based on goodness-of-fit (R2: given by the correlation coefficient of the training set), robustness (Q2: given by the 10-fold cross-validation results), percentage split (P2: given by the internal predictive power of the model-split from 65 to 99%), and correlation with predictions on the external test set (P2(test set)) (Figure 5). Validation values from 65 to 99% of instances (only used for our monitoring) are available in the SI section (Table S4).

Figure 5 shows the four parameters evaluated for each resampling technique and their respective values of goodness-of-fit, robustness, internal, and external predictive power. These resampling strategies were performed independently. Two classes were created by clustering instances of the original dataset with TPP in Weka 3.8.1. The result demonstrated that classes were balanced containing 532 (cluster 1) and 468 (cluster 2) instances each (proportion 1.13:1). Class balancer was applied and instances were reallocated totaling 500 instances in each class. Although improvements were achieved in some cases, the internal parameters (R2, Q2, and P2) declined. The resampling strategy, SMOTE, was applied to the same clusters created with TPP for synthetically adding instances to the dataset. Results were similar to those of class balancer. The new synthetic dataset after SMOTE I contained 1468 instances, in which 468 were created by replicating those present in cluster 2. A second resampling, SMOTE II, applied to the synthetic dataset created by SMOTE I reached good values of internal validation with Gaussian processes-Puk, KStar, and Random Forest (with 2000 instances, in which more 532 instances were synthetically created to the cluster 1) (Figure 5). Thus, a resampling process based on replicating instances was performed and higher values of internal and external validation were achieved. Random Forest exhibited better performance and was tuned. The final algorithm configuration accepted (number of interactions = 200; maxDepth = 0; seed = 6; number of features = 0; breaktiesrandomly = true) resulted in R2 = 0.995, Q2 = 0.935 - RMSE = 0.366, P2 = 0.980, and P2(test set) = 0.888. Scrambling values: R2 = 0.365, Q2 = 0.033 - RMSE = 7.374, P2 = 0.000, with no overfitting detected. Additionally, the difference between R2 and Q2 in the model is lower than 0.3, indicating that the model is not suspect of overfitting.48 After optimizations, prediction results on the external test set were accurate, estimating the antitrypanosomal potential of compounds in the test set correctly (Table 1). The interpretation of results is available at SI section (Table S5 and Figure S2).

Table 1
Actual and predicted values compared for the 10 instances randomly chosen for monitoring. Predictions values of pIC50 for the external test set were obtained with the subset a and the tuned Random Forest algorithm

Based on the validation values obtained for the model and accurate predictions, a new experimental test set was obtained containing twenty-five structurally diversified compounds and was submitted to the prediction model and in vitro antitrypanosomal assay. The same procedures for molecular minimization and descriptors calculations were performed for each structure resulting in a new test set. The molecular complexity of the experimental test set ranged from 12.05 to 724.1 (fragC values), with different classes of compounds. The chemical space of training set and test set is available in the SI section (Figure S3). Table 2 demonstrates the pIC50 (-log[M]) and IC50 (µM) obtained from the in vitro assay and their predicted values for antitrypanosomal activity.

Table 2
Results of the biological assay against trypomastigotes of T. cruzi and the prediction model, both measured in pIC50 and IC50

The selection of a machine learning technique capable of dealing with the chemical space against trypomastigotes of T. cruzi pointed to Random Forest as the most appropriate algorithm for a predictive model based on a heterogeneous chemical space. Random Forest is sensitive to the number and type of instances. In this sense, resampling methods dramatically affect its predictive power.49 Data modifications such as removing outliers, class balancer, and SMOTE are applied for this specific purpose,21 where synthetically replicating instances afforded optimal validation results for predictions.

In this regard, imbalanced data introduces a bias in the performance of these methods due to the class majority.50 The chemical space, explored and clustered using the TPP, showed that the dataset is balanced with regard to the number of instances and the range of biological activity (pIC50: cluster(replicated) 1 = 934, cluster 2(replicated) = 1066, status: balanced - proportion 1.13:1), suggesting that the data is appropriate for good predictive models.

Random Forest reached higher validation values for the heterogeneous chemical space due to its ability to increase the association of similar instances while decision trees are created. For Random Forest-based methods, each decision tree is randomly formed in training and partitioned for increasing homogeneity of instances resulting in similar groups of compounds in a tree. The predictive power of the method is benefited by the number of homogeneous trees in which the performance is achieved by the wisdom of crowds effect.39 In addition, Random Forest deals with high-dimensional data, multiples mechanisms of action, and can be used for classification and regression.51

The k-Nearest Neighbors (k-NN) algorithm, represented in this work by the algorithm KStar, even reaching good validation values, did not reach a considerable prediction score due to its inefficiency to extrapolate outputs from high-dimensional data.51 The predictions of KStar in all resampling steps resulted in values lower than P2 = 0.35 for the external test set, not satisfactory for the model. On the other hand, Gaussian processes is considered a powerful method for nonlinear regression with easy implementation, and does not require subjective determination of parameters. The algorithm can handle a large pool of descriptors for the selection of important ones, it is inherently resistant to overtraining, and estimates uncertainty in predictions.52 However, results of Gaussian Processes afforded values of R2, Q2 and P2 lower than those obtained for Random Forest, even before tuning (Figure 5).

The chemical space created a wide application domain for the method, containing from simple (catechol, fragC = 8.02) to complex structures (phenazine derivative, fracgC = 709.05) (Figure 6). The identification of the application of the method was performed by directly associating each tested instance with its nearest neighbors. The instances in the external test set were clustered with training instances using Hierarchical Clustering Analysis measured by Euclidean distance (max distance: 2.745; average distance: 1.373). Smaller distances were attributed to similar molecular descriptors and/or molecules. For example, the chemical space identified for the compound 1,2,3,4,6-penta-O-galloyl-²-D-glucose28 is supported by more than 100 representatives (natural and semi-synthetic) including chalcones, flavonoids, quinonoids, and gallic acid derivatives. In a threshold of 0.9 (Euclidean distance unity), quinonoids and flavonoids were the majority (75 instances). The pIC50 of 1,2,3,4,6-penta-O-galloyl-² D-glucose, the most complex structure on the external test set (fragC = 762.26) was accurately predicted: error(log) = -0.106 (pIC50(actual) = 4.176 (66.68 µM); pIC50(predicted) = 4.07 (85.11 µM)).

Figure 6
Level of molecular complexity based on the fragC descriptor results. These structures illustrate the instances in the dataset that created the chemical space for predictions of pIC50 values. The structures showed herein are components of the method, which presents catechol as the simplest molecular system (fragC = 8.02) whereas the most complex molecular system was a phenazine derivative (fragC = 709.05).

Based on the errors obtained by the output of the predictive model, two test instances, jacaranone25 and heterocyclic-69,24 were selected in a minimum threshold that covered at least 20 instances of the dataset, as the minimum number of compounds for a prediction model. It was possible to identify a homogeneous chemical space for heterocyclic-69, which clustered with 25 fenarimol derivatives in a threshold of 0.378. Jacaranone clustered with 26 instances in a heterogeneous chemical space containing fenarimols, compounds in clinical use, triazoles, triclosan derivatives and quinonoids, with a threshold of 0.487.

The molecular descriptors values for the cluster of heterocyclic-69 are very similar. On the other hand, the cluster associated with jacaranone presents molecules with discrepant descriptor values, resulting in a heterogeneous chemical space for predictions. The error obtained for heterocyclic-69 was minimal (error(log) = 0.264; -0.168 µM; pIC50(actual) = 6.433; pIC50(predicted) = 6.697), while the error for jacaranone was considerably high (error(log) = 0.517; -49.27 µM; pIC50(actual) = 4.150; pIC50(predicted) = 4.667). These results are first attributed to the chemical space identified for each tested instance, as discussed, and concerning their molecular descriptors and the possibility of the predictive model to use similar instances for an accurate output. In this sense, a heterogeneous chemical space within lower threshold values diminishes accuracy, as observed for jacaranone. This interpretation follows the basis of the similar property principle, e.g., similar molecules have similar properties. However, this statement is universally valid not only for similar molecules but also for compounds with similar descriptors. In this regard, training and test sets should share equivalent chemical space otherwise it will result in poor prediction values.39 Our interpretation on those statements suggests that an additional step could be used in order to investigate the outputs from this method: clustering novel instances with the entire dataset. An example of this interpretation is showed ahead on Figure 7.

Figure 7
Sub-clusters of heterocyclic-69 and jacaranone, based on their chemical space for predicting the pIC50. These sub-clusters were obtained with a threshold that covered at least 20 instances for each tested compound. The type of chemical space, homogeneous or heterogeneous and the prediction accuracy are demonstrated. Clustering proceedings were performed in the software MEV 4.9.053 using Hierarchical Cluster Analysis measured by Euclidean distance.

The next step in our work was to use a simple parameter for analyzing predictions from the method. Thus, the results of the predictive model were interpreted according to Cos et al.,54 (with modifications), which discusses that a compound must exhibit an IC50 below 25 µM to be considered relevant against T. cruzi. In this sense, outputs around 0 to 50 µM were considered good candidates for experimental analysis trying to focus on this range of activity.54

It is important to highlight that the prediction method proposed herein do not aim to provide an exact value of the antitrypanosomal activity, but a range of values considered for interpretations. Consequently, it indicates a range of biological activities quantified by the method and interpreted by the cut-off established, which a compound corresponds, due to its molecular features. The interpretation of the outputs following this statement reached an accuracy of 88% on predictions, showing that the predictions of atorvastatin (µM(actual) = 60.1 µM(predicted) = 49.09, false negative), butylated hydroxytoluene (µM(actual) = 36.4, µM(predicted) = 124.45, false positive) and cloxazolam (µM(actual) = 81.6, µM(predicted) = 11.43, false negative) were considered incorrect according with our interpretation. Other instances in the external dataset had their range of predictions considered true positives. The interpretation of the accuracy (created only for in house purposes) is available in the SI section (Table S5).

Concerning QSAR methodologies for T. cruzi, homogeneous chemical spaces are reported continuously (e.g., same class of compounds). The use of several techniques is associated with coumarins,55 flavonoids,35,56 quinolones,57 ²-aminocarbonyl derivatives,58 diphenyl naphthoquinones,59 and arylthiazolylhydrazones.60 These reports individually represent a small part of the chemical space. The use of machine learning by them is restricted to some cases, as in a previous work of our group,56 on which artificial neural networks were created and improvements were obtained for prediction models using flavonoids against T. cruzi. In another case, more than 500 compounds of a restricted chemical space were used for the development of a simple QSAR method for N-oxide containing heterocycles derivatives, where the authors discussed critical electronic properties of compounds against T. cruzi without using machine learning or the development of prediction methods.61 Another study represents a model created with a homogeneous chemical space of sesquiterpene lactones derivatives with 1635 compounds against trypomastigotes of T. cruzi. The authors created the prediction model based on Random Forest, reaching an accuracy from 65.1 to 84.3% in cross-validation and external validation.7 These results corroborate the use of Random Forest for enhancing the performance of prediction models, and illustrate that, even in a heterogeneous dataset with a complex and heterogeneous chemical space proposed herein, the predictions afforded good accuracy, in some cases higher than those models with one class of compounds mentioned before (homogeneous QSAR).

In this sense, this work intended to employ free and user-friendly software, and literature data for exploring machine learning techniques to understand how a method created by gathering information from scientific literature could perform. The method was developed and results of experimental (in vitro) and predicted biological activities were obtained and compared. Several improvements, such as dataset resampling, and algorithms selection were systematically investigated, resulting in 88% of accuracy for outputs interpretation. The strategies and approaches applied herein can shed light to several classes of biological activities that could be focus of novel methodologies for building predictive models, and to estimate or classify compounds based on their chemical space and biological activity.

Conclusions

Random Forest was selected, among other machine learning techniques, as the most appropriate tool for this study, resulting in a high correlation between experimental and predicted values for the antitrypanosomal activity. The strategies applied were substantial for reaching adequate values of goodness-of-fit, robustness and predictive power for predictions with no overfitting detected. The method exhibited an accuracy of 88% for the prediction of several classes of compounds, identifying promising candidates for experimental assays. It is important to highlight that this approach illustrates how novel methods can be built from scientific data in order to select candidates for biological screenings. Our general conclusion is that heterogeneous chemical spaces could provide promising tools for biological assays of novel compounds. Further, additional investigations are underway in order to improve and expand this methodology.

Supplementary Information

Supplementary information (the dataset and method construction) is available free of charge at http://jbcs.sbq.org.br as PDF file.

Data Availability Statement

The final dataset is available at https://doi.org/10.7910/DVN/WCV1FZ.37

Acknowledgments

The authors acknowledge the financial support provided by the Brazilian granting agencies: São Paulo Research Foundation (FAPESP-2023/12447-1, 2021/02789-7, 2018/07885-1, 2018/10279-6, and 2016/19269-8), CAPES, and CNPq. We also thank CNPq scientific research award to JHGL, DACP and AGT.

References

  • 1 Cruz, D. S.; Damasceno, R. F.; Leite, S. F.; Cardoso, M. D.; Almeida, D. N. M.; de Souza, A. B.; de Jesus Santos, A. C.; Veira, T. M.; Ribeiro, A. L. P.; de Oliveira, L. C.; Sabino, E. C.; Haikal, D. S. A.; Ferreira, A. M.; Molina, I.; IJID Regions 2024, 12, 100400. [Crossref]
    » Crossref
  • 2 Bern, C.; N. Engl. J. Med. 2015, 373, 456. [Crossref] [PubMed]
    » Crossref» PubMed
  • 3 Moraes, C. B.; Giardini, M. A.; Kim, H.; Franco, C. H.; Araujo Junior, A. M.; Schenkman, S.; Chatelain, E.; Freitas-Junior, L. H.; Sci. Rep. 2015, 4, 4703. [Crossref]
    » Crossref
  • 4 Zhang, X.-Q.; Spadafora, C.; Pineda, L. M.; Ng, M. G.; Sun, J.-H.; Wang, W.; Wang, C.-Y.; Gu, Y.-C.; Shao, C.-L.; Sci. Rep. 2017, 7, 11822. [Crossref]
    » Crossref
  • 5 Teixeira, A. R. L.; Nitz, N.; Guimaro, M. C.; Gomes, C.; Santos-Buch, C. A.; Postgrad. Med. J. 2006, 82, 788. [Crossref]
    » Crossref
  • 6 Meymandi, S. K.; Hernandez, S.; Forsyth, C. J.; Trends Parasitol. 2017, 33, 828. [Crossref]
    » Crossref
  • 7 Sobrinho, J. L. S.; Medeiros, F. P. M.; la Roca, M. F.; da Silva, K. E. R.; Lima, L. N. A.; Rolim Neto, P. J.; J. Trop. Pathol. 2006, 36, 103. [Crossref]
    » Crossref
  • 8 Acevedo, C. H.; Scotti, L.; Scotti, M. T.; ChemMedChem 2018, 13, 634. [Crossref]
    » Crossref
  • 9 Jardim, G. A. M.; Reis, W. J.; Ribeiro, M. F.; Ottoni, F. M.; Alves, R. J.; Silva, T. L.; Goulart, M. O. F.; Braga, A. L.; Menna Barreto, R. F. S.; Salomão, K.; de Castro, S. L.; da Silva Jr., E. N.; RSC Adv. 2015, 5, 78047. [Crossref]
    » Crossref
  • 10 Winkler, D. A.; Front. Chem. 2021, 9, 614073. [Crossref]
    » Crossref
  • 11 Khalheim, O. M.; Anal. Chim. Acta 1985, 177, 71. [Crossref]
    » Crossref
  • 12 van den Berg, R. A.; Hoefsloot, H. C. J.; Westerhuis, J. A.; Smilde, A. K.; van der Werf, M. J.; BMC Genomics 2006, 7, 142. [Crossref]
    » Crossref
  • 13 Chemspider, https://chemspider.com, accessed in April 2025.
    » https://chemspider.com
  • 14 Pubchem, https://pubchem.ncbi.nlm.nih.gov, accessed in April 2025.
    » https://pubchem.ncbi.nlm.nih.gov
  • 15 Avogadro, https://avogadro.cc, accessed in April 2025.
    » https://avogadro.cc
  • 16 Hanwell, M. D.; Curtis, D. E.; Lonie, D. C.; Vandermeerschd, T.; Zurek, E.; Hutchison, G. R.; J. Cheminform. 2012, 4, 17. [Crossref]
    » Crossref
  • 17 MOPAC, http://openmopac.net, accessed in April 2025.
    » http://openmopac.net
  • 18 Stewart, J. J. P.; MOPAC2016 Stewart Computational Chemistry, vol. 1; Colorado Springs: Colorado, US, 2016.
  • 19 PaDEL-Descriptor, http://www.yapcwsoft.com/dd/padeldescriptor, accessed in April 2025.
    » http://www.yapcwsoft.com/dd/padeldescriptor
  • 20 Yap, C. W.; J. Comput. Chem. 2011, 32, 1466. [Crossref]
    » Crossref
  • 21 Witten, I. H.; Frank, E.; Hall, M. A.; Practical Machine Learning Tools and Techniques, 3rd ed.; Morgan Kaufmann: London, UK, 2011.
  • 22 The Weka Workbench, https://www.cs.waikato.ac.nz/ml/weka, accessed in April 2025.
    » https://www.cs.waikato.ac.nz/ml/weka
  • 23 Aptula, A. O.; Jeliazkova, N. G.; Schultz, T. W.; Cronin, M. T. D.; QSAR Comb. Sci. 2005, 24, 385. [Crossref]
    » Crossref
  • 24 Corrêa, D. S.; Tempone, A. G.;Reimão, J. Q.; Taniwaki, N. N.; Romoff, P.; Fávero, O. A.; Sartorelli, P.; Mecchi, M. C.; Lago, J. H. G.; Parasitol. Res. 2011, 109, 231. [Crossref]
    » Crossref
  • 25 Romeiro, N. C.; Aguirre, G.; Hernández, P.; González, M.; Cerecetto, H.; Aldana, I.; Pérez-Silanes, S.; Monge, A.; Barreiro, E. J.; Lima, L. M.; Bioorg. Med. Chem. 2009, 17, 641. [Crossref]
    » Crossref
  • 26 Frank, F. M.; Ulloa, J.; Cazorla, S. I.; Maravilla, G.; Malchiodi, E. L.; Grau, A.; Martino, V.; Catalán, C.; Muschietti, L. V.; J. Evidence-Based Complementary Altern. Med. 2013, 2013, [Crossref]
    » Crossref
  • 27 Botero, A.; Keatley, S.; Peacock, C.; Thompson, R. C. A.; Int. J. Parasitol. Drugs Drug Resist. 2017, 7, ID 627898. [Crossref]
    » Crossref
  • 28 dos Santos, R. T.; Hiramoto, L. L.; Lago, J. H. G.; Sartorelli, P.; Tempone, A. G.; Pinto, E. G.; Lorenzi, H.; Quim. Nova 2012, 35, 2229. [Crossref]
    » Crossref
  • 29 Pelizzaro-Rocha, K. J.; Veiga-Santos, P.; Lazarin-Bidóia, D.; Ueda-Nakamura, T.; Dias Filho, B. P.; Ximenes, V. F.; Silva, S. O.; Nakamura, C. V.; Microbes Infect. 2011, 13, 1018. [Crossref]
    » Crossref
  • 30 Corrêa, D. S.; Tempone, A. G.; Reimão, J. Q.; Taniwaki, N. N.; Romoff, P.; Fávero, O. A.; Sartorelli, P.; Mecchi, M. C.; Lago, J. H. G.; Parasitol. Res. 2011, 109, 231. [Crossref]
    » Crossref
  • 31 Morais, T. R.; Romoff, P.; Fávero, O. A.; Reimão, J. Q.; Lourenço, W. C.; Tempone, A. G.; Hristov, A. D.; Di Santi, S. M.; Lago, J. H. G.; Sartorelli, P.; Ferreira, M. J. P.; Parasitol. Res. 2012, 110, 95. [Crossref]
    » Crossref
  • 32 Patel, N. B.; Patel, J. N.; Purohit, A. C.; Patel, V. M.; Rajani, D. P.; Moo-Puc, R.; Lopez-Cedillo, J. C.; Nogueda-Torres, B.; Rivera, G.; Int. J. Antimicrob. Agents 2017, 50, 413. [Crossref]
    » Crossref
  • 33 Rücker, C.; Rücker, G.; Meringer, M.; J. Chem. Inf. Model. 2007, 47, 2345. [Crossref]
    » Crossref
  • 34 Rasulev, B. F.; Abdullaev, N. D.; Syrov, V. N.; Leszczynski, J.; QSAR Comb. Sci. 2005, 24, 1056. [Crossref]
    » Crossref
  • 35 Baldim, J. L.; Alcântara, B. G. V. de; Domingos, O. S.; Soares, M. G.; Caldas, I. S.; Novaes, R. D.; Oliveira, T. B.; Lago, J. H. G.; Chagas-Paula, D. A.; Oxid. Med. Cell. Longevity 2017, 2017, ID 3789856. [Crossref]
    » Crossref
  • 36 Golbraikh, A.; Tropsha, A.; Mol. Diversity 2000, 5, 231. [Crossref]
    » Crossref
  • 37 Dataset for Baldim, https://doi.org/10.7910/DVN/WCV1FZ, accessed in April 2025.
    » https://doi.org/10.7910/DVN/WCV1FZ
  • 38 Mikus, J.; Harkenthal, M.; Steverding, D.; J. Planta Med. 2000, 66, 66. [Crossref]
    » Crossref
  • 39 Mitchell, J. B. O.; Wiley Interdiscip. Rev.: Comput. Mol. Sci. 2014, 4, 468. [Crossref]
    » Crossref
  • 40 Dara, S.; Dhamercherla, S.; Jadav, S. S.; Babu, C. M.; Ahsan, M. J.; Machine Learning in Drug Discovery: A Review; Springer: Netherlands, 2022.
  • 41 Dinov, I. D.; Gigascience 2016, 5, 1. [Crossref]
    » Crossref
  • 42 Hawkins, D. M.; J. Chem. Inf. Comput. Sci. 2004, 44, 1. [Crossref]
    » Crossref
  • 43 MOPAC2016Â, http://openmopac.net/home.html, accessed in April 2025.
    » http://openmopac.net/home.html
  • 44 Stewart, J. J. P.; J. Mol. Model. 2013, 19, 1. [Crossref]
    » Crossref
  • 45 Perez-Llamas, C.; Lopez-Bigas, N.; PLoS One 2011, 6, e19541 [Crossref]; Gitools, http://www.gitools.org, accessed in April 2025.
    » Crossref» http://www.gitools.org
  • 46 Faith, J.; 11th International Conference Information Visualization (IV ‘07); Zurich, Switzerland, 2007, 286. [Crossref]
    » Crossref
  • 47 Nilakantan, R.; Nunn, D. S.; Greenblatt, L.; Walker, G.; Haraki, K.; Mobilio, D.; J. Chem. Inf. Model. 2006, 46, 1069. [Crossref]
    » Crossref
  • 48 Veerasamy, R.; Rajak, H.; Jain, A.; Sivadasan, S.; Varghese, C. P.; Agrawal, R. K.; Int. J. Drug Des. Disocovery 2011, 2, 511. [Crossref]
    » Crossref
  • 49 Özçift, A.; Comput. Biol. Med. 2011, 41, 265. [Crossref]
    » Crossref
  • 50 Jain, S.; Kotsampasakou, E.; Ecker, G. F.; J. Comput. Aided. Mol. Des. 2018, 32, 583. [Crossref]
    » Crossref
  • 51 Svetnik, V.; Liaw, A.; Tong, C.; Christopher Culberson, J.; Sheridan, R. P.; Feuston, B. P.; J. Chem. Inf. Comput. Sci. 2003, 43, 1947. [Crossref]
    » Crossref
  • 52 Obrezanova, O.; Segall, M. D.; J. Chem. Inf. Model. 2010, 50, 1053. [Crossref]
    » Crossref
  • 53 WebMeV, https://webmev.tm4.org, accessed in April 2025.
    » https://webmev.tm4.org
  • 54 Cos, P.; Vlietinck, A. J.; Berghe, D. Vanden; Maes, L.; J. Ethnopharmacol. 2006, 106, 290. [Crossref]
    » Crossref
  • 55 Menezes, I. R. A.; Lopes, J. C. D.; Montanari, C. A.; Oliva, G.; Pavão, F.; Castilho, M. S.; Vieira, P. C.; Pupo, M. T.; J. Comput. Aided. Mol. Des. 2003, 17, 277. [Crossref]
    » Crossref
  • 56 Tasdemir, D.; Kaiser, M.; Brun, R.; Yardley, V.; Schmidt, T. J.; Tosun, F.; Rüedi, P.; Antimicrob. Agents Chemother. 2006, 50, 1352. [Crossref]
    » Crossref
  • 57 Nefertiti, A. S. G.; Batista, M. M.; da Silva, P. B.; Batista, D. G. J.; da Silva, C. F.; Peres, R. B.; Torres-Santos, E. C.; Cunha-Junior, E. F.; Holt, E.; Boykin, D. W.; Wenzler, T.; Soeiro, M. N. C.; Antimicrob. Agents Chemother. 2018, 62, 1. [Crossref]
    » Crossref
  • 58 da Silva, C. H. T. P.; Bernardes, L. S. C.; da Silva, V. B.; Zani, C. L.; Carvalho, I.; J. Biomol. Struct. Dyn. 2012, 29, 1206. [Crossref]
    » Crossref
  • 59 Ogindo, C. O.; Khraiwesh, M. H.; George, M.; Brandy, Y.; Brandy, N.; Gugssa, A.; Ashraf, M.; Abbas, M.; Southerland, W. M.; Lee, C. M.; Bakare, O.; Fang, Y.; Bioorg. Med. Chem. 2016, 24, 3849. [Crossref]
    » Crossref
  • 60 Noguera, G. J.; Fabian, L. E.; Lombardo, E.; Finkielsztein, L.; Eur. J. Pharm. Sci. 2015, 78, 190. [Crossref]
    » Crossref
  • 61 Cerecetto, H.; González, M.; Mini Rev. Med. Chem. 2008, 8, 1355. [Crossref]
    » Crossref

Edited by

  • Editor handled this article:
    Maurício Coutinho-Neto (Guest)

Publication Dates

  • Publication in this collection
    19 May 2025
  • Date of issue
    2025

History

  • Received
    29 Oct 2024
  • Accepted
    24 Apr 2025
location_on
Sociedade Brasileira de Química Instituto de Química - UNICAMP, Caixa Postal 6154, 13083-970 Campinas SP - Brazil, Tel./FAX.: +55 19 3521-3151 - São Paulo - SP - Brazil
E-mail: office@jbcs.sbq.org.br
rss_feed Acompanhe os números deste periódico no seu leitor de RSS
Reportar erro