Abstract
A better understanding of the problem of protein aggregation encouraged the development of computational tools to help study this phenomenon. Some studies indicate that short segments of amino acids found in amyloid precursor proteins may be involved in the formation of these aggregations. The creation of precise techniques for identifying regions that are prone to aggregation is one of the present issues facing the bioinformatics community. In our earlier work, we developed Magre-I using machine learning techniques. Using a consensus model for classification, this approach made it possible to predict aggregation regions based on the primary amino acid sequence of proteins. We now present an improved method, called Magre-II, which makes use of the three-dimensional (3D) structure of proteins and experimental annotations of aggregation by using a neighborhood sphere analysis model for training and predictions. We show that Magre-II has a high potential for predicting regions that are prone to protein aggregation by comparing it with other commonly used predictors. The obtained scores were as follows: Magre-II (78.2%), Aggrescan3d (76.7%), Waltz (67.4%), and Tango (73.1%) for alpha-synuclein.
Keywords:
aggregation region; proteins; machine learning; prediction
Introduction
Neurodegenerative diseases represent a complex challenge, even in the face of the advances achieved by countless studies.1,2 To date, there is a gap in knowledge about these pathologies and the underlying molecular mechanisms, particularly about the formation of aggregates.1,2 Studies dedicated to investigating aggregation processes and the molecular mechanisms associated with these diseases have explored a variety of forms of aggregation, which have generated opportunities for the development of drugs and therapies.3,4
The study of the molecular interactions involved in the formation of aggregates in mutant structures plays a crucial role in contributing to the understanding of the molecular mechanisms underlying various diseases.3,4 In this context, computer simulations have emerged as a suitable approach for investigating phenomena related to the aggregation of proteins associated with neurodegeneration.4-7 Through these simulations, it is possible to explore in detail the interactions between molecules and understand the structural transformations that lead to the formation of aggregates.4-7
Due to the relevance in understanding the mechanisms related to protein aggregation and its association with diseases, there has been a significant development of computational tools to help study and predict these phenomena.4-7 The underlying hypothesis is that certain characteristics, such as hydrophobicity, charge, size, relative surface area (RSA), among others, combined with an analysis of the three-dimensional (3D) structure, can be part of the approach related to the propensity of each residue of a protein to aggregate.4-7
Proteins perform important tasks such as catalysis of chemical reactions, transport, recognition, and signal transmission.8 Although they take on a huge variety of conformations, there is a preferential, lower-energy conformation known as the native structure in most cases. There is a class of proteins that, despite being soluble in certain tissues, can be found in insoluble form aggregated known as amyloids.9 This class of proteins presents two extremely different stable conformations: (i) the native and (ii) amyloid forms, the latter being mainly constituted by secondary structures in β-sheets.10 It is important to better understand the processes related to the protein aggregation phenomenon and improve the knowledge about these proteins, their functionality, and molecular mechanism.
Several predictive bioinformatics tools have been developed in the past ten years, given the importance of protein aggregation (see Table 1). Some of these works calculate propensity based on primary structure and empirical evaluation, while others apply machine learning techniques. The use of computational models to predict regions prone to aggregate has been used to identify these regions and can decrease the time spent on analysis. The machine learning approach combined with experimental data becomes an important alternative for the creation of these predictors since it uses trends that can be captured in the characteristics contained in protein databases. There are several algorithms, some of them apply machine learning techniques as well as protein features11 (see Table 1).
Main predictors related to protein aggregation and their methodological bases for evaluation
Some studies and methodologies11-43 that predict aggregation regions in proteins have been identified in the literature, as presented in Table 1, which outlines specific characteristics.11
In comparison, the renowned Aggrescan tool is an application of an aggregation propensity scale for natural amino acids, derived from in vitro experiments, based on the assumption that short, specific sequences can modulate protein aggregation.7 Its advantages include accuracy, accessibility, ease of use, and primary structure, however, it does not work with complex structures, lacks contextual information, and is dependent on the individual propensity score. Our tool, Magre-II, has the following advantages: introduction of an individual sphere analysis methodology taking into account the characteristics of neighboring residues, permission to predict regions prone to aggregation with performance similar to or better than the main predictors found in the literature, according to the tests carried out, retraining with more proteins identified as amyloids, flexibility to update the model with new data contributes to its continued robustness and relevance, low computational cost, with time to generate a new model of 4 h on a personal computer. The disadvantage is being not taking into account the structural dynamics of proteins, a factor that could be a key point for future Magre-II version updates.
In our previous work,43 we developed a tool to predict aggregation propensity regions based on primary structure, called Magre-I.
This software was developed based on the support vector machine (SVM) algorithm to predict aggregation regions, using a slide windows method considering 7, 9, and 11 residues from each protein sequence. Most predictors consider primary and secondary structures in their evaluation and make predictions based on the calculation of properties such as the propensity of secondary structure and individual propensity aggregation of amino acids. Our purpose in this new version was the development of a predictor that considers the tertiary structure, i.e., 3D information of protein structure.
The 3D consideration was accomplished using 3D structure, the definition of distance cutoff between residues in the neighborhood sphere, the relative surface area of the residues into the sphere of the neighborhood (called here sphere of the residues), and the distance between residues. In summary, the Magre-I was built based on a training set via the primary structure of amyloid proteins, slide windows, and predictors classification, while Magre-II was built based on tertiary structure and residue spheres. Tests using Magre-II were in agreement with experimental data for some selected proteins associated with diseases such as Alzheimer’s, hereditary transthyretin, and diabetes.
Methodology
The flowchart depicted in Figure 1 presents the methodological steps developed and implemented in Magre-II version. Briefly, we adopted the methodology based on machine learning, primary structure, tertiary structure, and solvent-accessible surface area to predict the aggregation probability of local regions.
Protein structures
We selected 352 protein structures from the AmyPro database,44 in the first step, including proteins with negative (no aggregate) and positive (aggregate) regions. The AmyPro is a database that provides experimental data annotation related to identified amyloid regions. We also considered the Amyloid database.45 These two databases also report many protein structures (PDB ID) with similar protein families of amyloid proteins, such as insulin, prion, tau, etc. The criteria adopted to select protein structure was to consider at least one sample of each protein structure family. The last extraction was on April 1st, 2023.
Calculation of residue’s sphere
To consider a 3D approach, in step 2, we calculated the Euclidean distance (Dist) between the carbon alpha atom (Cα) of each amino acid residue, where we considered each protein sequence residue as a central residue (Ri) and searched for peripheral residues (Re) for the composition of the sphere (see Figure 2). Residues within 10 Å of the central amino acid are called peripheral. The selected protein structure was split into individual spheres of residues based on the distance criteria calculated as the distance (Dist) Cα according to equation 1. Using this approach, it was also possible to calculate the areas accessible to the solvent for each residue analyzed.
(a) Schematic illustration of a sphere of residues based on 3D coordinates obtained from the PDB database. (b) Example: PDB ID 1F86 (transthyretin protein (TTR)).
Feature used was the relative surface area (RSA), the ratio between the accessible surface area (ASA) and maximum accessible surface area (MaxASA). As shown in equation 2, this metric provides information about the residue’s exposition to the solvent.46 The ASA calculation considered all peripheral residues and it was calculated considering the sum of all individual residue areas. The score MaxASA is related to the maximum possible solvent-accessible surface area for each residue. MaxASA is obtained from Gly-X-Gly tripeptides, where X is the residue of interest, the FreeSASA was used to compute RSA.47
The calculation of distance (Euclidean) was performed as indicated in equation 1, where ΔX, ΔY and ΔZ correspond to the Cα coordinate differences between each central and peripheral residues.
Equation 2 was used to calculate the RSA, where: RSA is the relative surface area to solvent, ASA the accessible surface area and MaxASA the maximum accessible surface area.
After selecting the data and defining the residue spheres, as described above, the next step consisted of identifying the individual residue spheres, following the Euclidean distance criterion, using the coordinates obtained from the PDB database for each residue. The identification of the spheres takes into account all the residues present in the spherical neighborhood of the residue in question, regardless of which protein chain these neighbors belong to. If the selection meets the established criteria, the relevant physicochemical characteristics, such as distance, solvent RSA, hydrophobicity, charge, size, and propensity to aggregate, are calculated and considered as descriptors.
Definition of descriptors and set up data training
The definition/choice of suitable descriptors is a key point to performance in the machine learning approach. They are defined as composites in the data set representation. The available descriptors must be evaluated in different aspects, as well as combined to assess the best set to achieve a good machine learning model.48
Definition of descriptors
The combination of several descriptors was tested aiming to select an optimal set to maximize performance. We tested combinations involving characteristics for each residue of the sphere, such as: (i) distance - normalized from range from 0 to 10; (ii) relative surface area, normalized by using the percentage of participation of the area in the spheres; (iii) hydrophobicity, classified as hydrophobic, neutral or polar; (iv) size, classified in three qualitative classes: small, medium and large; and (v) charge, defined as negative, neutral and positive.49
Dataset structure (codifying)
Once the residues that make up the sphere of the central residue had been determined, we set about preparing a machine learning file representing the individual spheres. At this point, we determined the classes that represent the aggregation problem and the descriptors or attributes that will represent these classes. As mentioned before, each protein amino acid residue is considered a central residue with descriptors and peripheral characteristic residues (ranging from 1 to 50). This assessment occurs when the spheres are evaluated. The features group represents classes “0” = no aggregate, and “1” = aggregate (see Figure 3). The sphere is composed of the group of each residue and added to descriptors. All information is represented by classes (aggregate and non-aggregate).
Structure of training data set with 50 peripheral residues each one with its group of descriptors (residue code, distance, hydrophobicity, RSA, aggregation propensity, size, and charge).
Descriptors refining
In this step, we investigated and selected the set of descriptors. 50 attributes were used, representing the central and peripheral residues of the spheres analyzed. Then, it was adopted the analysis technique known as Forward Feature Selection (FFS),50 which performs the evaluation of attributes sequentially. For performance evaluation, we used the area under the curve (AUC) metric for each test (Table S1).
Training
The training data set was composed of 9.846 instances representing the spheres of residues with the Gradient Boosting Regressor classification algorithm that provides the probability of prediction. It is useful to evaluate the prediction performance by checking the aggregation probability. The model obtained from this training is then used in validation tests (see Figure 4), using Scikit-learn implementation of Gradient Boosting, regression algorithms with default parameters.
The training file with 50 attributes or descriptors was analyzed using an attribute selection evaluation technique called Forward Feature Selection.50 This analysis aimed to identify the required number of attributes to be used for training. The metric used for this evaluation was AUC. By applying this technique, we reduced the number of attributes to 21, where the first represents the central residue and the others represent the characteristics of the peripheral residues. After choosing the training file with the classes and number of attributes, we moved on to the actual training. This stage used resources from the Scikit-Learn library.
Testing and validation
In this step, we carried out validating tests by inputting information into the Confusion Matrix. Scores were extracted as either True Positive (TP), True Negative (TN), False Positive (FP) or False Negative (FN). Based on this information, it was evaluated the following metrics: the area under curve (AUC) and area under precision-recall curve (AUPRC) were used to validate the balancing test of the class and the refining of the descriptos.52 The concepts mentioned above are shown in equations 3, 4, 5, and 6.
Generate model
The prediction model was generated by considering all residue spheres of protein extract except those for the validation test.
Results and Discussion
Magre II was trained with a database prepared considering a 3D approach. Three structures were selected from proteins that have been associated with important diseases. Initially, it is presented the feature selection and prediction model training results.
Data preprocessing and training
Training
We selected PDBs for fibril protein from the PDB database based on AmyPro and Amyloid information. It was also considered 6 proteins that do not present aggregation in the literature. For each of them, residue spheres were calculated and it was obtained the following number of instances that represent these spheres: 41.342 instances (class 0 = no aggregate) and 11.303 instances (class 1 = aggregate) (see Figure 5). To generate the model based on Adaboost Regressor algorithm we considered this proportionality.
Graphic with distribution of classes 41342 classified as no-aggregation spheres and 11303 classified as aggregation spheres.
The training file generated contains residue spheres of all the proteins that were selected for evaluation. In this situation, there may be a significant class imbalance, i.e., the aggregate and non-aggregate classes may not be properly represented. Looking at each protein in question, it was found that there is a higher incidence of residues that are not prone to aggregation. Since in the training file we have all the spheres with no indication of which protein structure has been generated, it was important to evaluate a proportion of classes that represented a generalization of these proportions. As mentioned, there are class balancing techniques that seek an optimal proportion. In our tests, an exaggerated reduction in the majority class could lead to distorted training and affect prediction, i.e., too many positive hits. The opposite can happen if the proportion between the classes is too high, which can lead to many errors. We then applied the following undersampling techniques: Randomundersampler,53 Nearmiss,54 Tomeklinks,55 EditedNearestNeighbours and Allknn.56
Balance data set
When analyzing the distribution of classes, a significant imbalance was observed, with 41,342 spheres of waste belonging to class 0 (does not aggregate) and 11,303 spheres of waste belonging to class 1 (aggregates). In order to assess whether there is a need to correct this proportion, some balancing techniques were applied to determine whether there was a need to change the proportion of classes and which would be the most appropriate for training the model. We adopted the original training file already adjusted with 50 attributes and applied undersampling balancing techniques. The intention here was to avoid the creation of artificial or repetitive instances, which could occur with oversampling techniques. We applied the Randomundersampler, Nearmiss, Tomeklinks, EditedNearestNeighbours, and Allknn algorithms. In all tests, the cross-validation technique was applied to obtain the average AUC metric and we analyzed the results (Table S2). The AUPRC metric scores were: original data = 47.91%, Randomundersampler = 73.96%, Nearmiss = 80.45%, Allknn = 81.55%, Tomekilins = 50.55%, and Edited Nearest Neighbors = 62.97%. Observing that the score of the data adjusted with the Allknn algorithm presents better results, we decided not to make any reduction in the training file.
Testing
After training the model and storing it in .sav format, we started testing the model that had already been generated, which consisted of evaluating the functionality of the model by applying proteins that were not used in the training, but which had information on their propensity to aggregate. To this end, a program was developed in Magre-II, using the Python programming language on the Anaconda platform.57 This program was prepared to receive the protein structure (PDB) as input and build the residue spheres, following the same format used during training, but without the class defined. The spheres of the indicated structure are then submitted to the prediction model generated, as described above. This gives the probability of aggregation for each residue of the indicated protein. After submitting all the residues in the protein sequence to the prediction model, the program generates a file containing the individual probabilities and graphs representing the percentage probability of aggregation at the position of the residues in the protein sequence. To analyze the performance of the model, we used the F1-Score metric because, as the classes are unbalanced, the other metrics (accuracy and recall) do not contribute to the performance analysis.
In this step, we conducted a cross-validation test to evaluate and adjust the validation parameters and assess the model using the AUPRC criterion. This procedure is crucial as it allows for verifying the linearity of the tests and considering all the training information for the model. It is worth noting that the defined classes for evaluation were as follows: 0 = non-aggregating and 1 = aggregating. To accomplish this, we employed the original data as described in the previous section and proceeded with training the model using 10-fold cross-validation (k = 10). The obtained results revealed an average area under the curve (AUPRC) of 77.83% with a standard deviation of 0.0346. These values indicate linearity and satisfactory balance in the conducted tests, as the standard deviation demonstrates minimal variation around the mean for k = 10. Based on these results, we decided to proceed with generating the prediction model using the aforementioned class ratio.
Validation
For validation, 4 proteins were selected that have been associated with some diseases considering each one in soluble and fibrillar forms (Table S3). These proteins have not been included in the testing phase. We considered taking one soluble structure conformation and another fibril structure conformation. We compared the performance with 2 criteria: soluble protein vs. no expectation aggregation and fibril protein vs. Amypro aggregation annotation of Amypro expectation. An additional experimental test was done with prion protein investigating the aggregation propensity of in silico generated structures.58
Alpha-synuclein
The human alpha-synuclein is a 140 amino acid residue protein which regulates synaptic vesicles trafficking and subsequent neurotransmitter release.59 The region called non-amyloid-β component (NAC) of Alzheimer’s disease is found in an amyloid-enriched fraction.59 The incorrect folding of the alpha-synuclein protein has been implicated in the molecular chain of events that lead to Parkinson’s disease. The graph (Figure 6) shows the ability of Magre-II to predict regions of aggregation. Out of the 61 residues expected to exhibit aggregation propensity, Magre-II accurately identified 40 residues as True Positives (TP). Additionally, Magre- II classified 66 residues as True Negatives (TN) out of the 79 expected non-aggregation residues. However, there were 13 False Positives (FP), where Magre-II incorrectly indicated a propensity for aggregation, and 21 False Negatives (FN), indicating instances where Magre-II failed to identify actual aggregation propensity.
Graph representing the prediction of regions prone to aggregation of alpha-synuclein protein (PDB ID: 1XQ8) by Magre-II.
Table 2 and Figure 7 present the comparative analysis of Magre-II, Aggrescan3d, Waltz, and Tango predictors for the alpha-synuclein structure. The F1-Score metric was employed to evaluate the performance, considering the balance between precision and recall. The obtained scores were as follows: Magre-II (78.2%), Aggrescan3d (76.7%), Waltz (67.4%), and Tango (73.1%). These results indicate that Magre-II exhibited the highest performance among the predictors, with an average score of 78.2% for the 1XQ8 protein structure. The superior performance of Magre-II in predicting aggregation propensity for the alpha-synuclein 1XQ8 structure highlights the effectiveness of this algorithm. The higher F1-Score achieved by Magre-II indicates a better balance between correctly identifying aggregation-prone regions (precision) and capturing the true aggregation instances (recall). The results suggest that Magre-II can be a valuable tool for predicting aggregation propensity and understanding the underlying mechanisms in the context of the alpha-synuclein protein.
Comparison of Magre-II with different predictors - 1XQ8 structure. Blue (as background): region from AmyPro, black: prediction from Magre-II, orange: prediction from Aggrescan3d, red: prediction from Watz, gray: prediction from Tango.
Comparison between predictors: Magre-II, Aggrescan3d, Waltz, and Tango of the 1XQ8 structure
Microglobulin
Among toxins related to kidney function, beta2-microglobulin (β2-M) is certainly one of the most studied. Its serum level increases with the progression of chronic kidney disease, reaching very high concentrations in patients with mainly terminal kidney disease.60 This is the main protein component of dialysis-related amyloidosis, a dramatic complication that results from high extracellular concentration and post-translational modification of β2-M and several other promoters of amyloid fibril formation and deposition in osteoarticular tissues. The structure selected for this protein was the PDB ID 2D4F61 with 98 amino acid residues.
Figure 8 shows the ability of Magre-II to predict regions prone to microglobulin protein aggregation (PDB ID: 2D4F). Out of the 50 residues expected to exhibit aggregation propensity, Magre-II accurately identified 45 residues as True Positives (TP). Additionally, Magre-II classified 41 residues as True Negatives (TN) out of the 47 expected non-aggregation residues. However, there were 5 False Positives (FP), where Magre-II incorrectly indicated a propensity for aggregation, and 6 False Negatives (FN), indicating instances where Magre-II failed to identify actual aggregation propensity.
Graph representing the prediction of regions prone to aggregation of microglobulin protein (PDB ID: 2D4F) by Magre-II.
Table 3 and Figure 9 present the comparative analysis of Magre-II, Aggrescan3d, Waltz, and Tango predictors for the 2D4F structure. The F1-Score metric was employed to evaluate the performance, considering the balance between precision and recall. The obtained scores were as follows: Magre-II (82.54%), Aggrescan3d (71.05%), Waltz (70.20%), and Tango (74.00%). These results indicate that Magre-II exhibited the highest performance among the predictors, with an average score of 82.54% for the 2D4F protein structure. The superior performance of Magre-II in predicting aggregation propensity for the 2D4F structure highlights the effectiveness of this algorithm. The higher F1-Score achieved by Magre-II indicates a better balance between correctly identifying aggregation-prone regions (precision) and capturing the true aggregation instances (recall). The underlying mechanisms in the context of the beta2-microglobulin protein.
Comparison of Magre-II with different predictors - 2D4F structure. Blue (as background): region from AmyPro, black: prediction from Magre-II, orange: prediction from Aggrescan3d, red: prediction from Watz, gray: prediction from Tango.
Comparison between predictors: Magre-II, Aggrescan3d, Waltz, and Tango of the 2D4F structure
Beta-amyloid
Amyloid-β (Aβ) is present in humans as a metabolic product of 39 to 42 amino acid residues of the amyloid precursor protein. Despite having only two residues not in common, the two predominant forms, Aβ (1-40) and Aβ (1-42), display distinct biophysical, biological, and clinical characteristics. Aβ (1-42) is the most neurotoxic species, it aggregates much more quickly and dominates the senile plaque of patients with Alzheimer’s disease (AD). The structure selected was the PDB code 2NAO chain A with 42 residues,62 indicated throughout the text as 2NAO/A.
Figure 10 displays the ability of Magre-II to predict regions of aggregation. Out of the 32 residues expected to exhibit aggregation propensity, Magre-II accurately identified 16 residues as True Positives (TP). Additionally, Magre-II classified 8 residues as True Negatives (TN) out of the 10 expected non-aggregation residues. However, there were 2 False Positives (FP), where Magre-II incorrectly indicated a propensity for aggregation, and 16 False Negatives (FN), indicating instances where Magre-II failed to identify actual aggregation propensity.
Graph representing the prediction of regions prone to aggregation of beta-amyloid protein (PDB ID: 2NAO) by Magre-II.
Figure 11 and Table 4 present the comparative analysis of Magre-II, Aggrescan3d, Waltz, and Tango predictors for the 2NAO structure. The F1-Score metric was employed to evaluate the performance, considering the balance between precision and recall. The obtained scores were as follows: Magre-II (78.05%), Aggrescan3d (73.68%), Waltz (73.04%), and Tango (68.04%), according to Table 4. These results indicate that Magre-II exhibited the highest performance among the predictors, with an average score of 78.05% for the 2NAO protein structure. The superior performance of Magre-II in predicting aggregation propensity for the 2NAO structure highlights the effectiveness of this algorithm. The higher F1-Score achieved by Magre-II indicates a better balance between correctly identifying aggregation-prone regions (precision) and capturing the true aggregation instances (recall).
Comparison of Magre-II with different predictors - 2NAO structure. Blue (as background): region from AmyPro, black: prediction from Magre-II, orange:prediction from Aggrescan3d, red: prediction from Watz, gray: prediction from Tango.
Comparison between predictors: Magre-II, Aggrescan3d, Waltz, and Tango of the 2NAO structure
Prion
Prions have been studied because they represent a new class of infectious agents in which a form PrPsc comes up to be the component of the infectious particle56 and can cause diseases such as transmissible encephalopathies, which affect humans and animals. In an additional test, we evaluated Magre-II using prion protein structures generated by in silico techniques accomplished previously by our group.56 Prion may be found in two different conformations: the cellular natural (PrPC) and the scrapie (PrPSc). The second conformer is an infectious form that tends to aggregate under specific conditions.56 Both forms are widely different regarding secondary and tertiary structures. The prion protein conformational change using simulation hybrid methods and found some transition states with different free energy levels with 4, 6, 8, 10, 12, 16, 17, 20 residues in β-strand form. We selected 4 structures representing distinct numbers of residues in the β-strand (4, 12, 16, 20). The results presented in Figure 12 are in agreement with the literature.56 The major regions, indicated to be inclined to aggregate by Magre-II correspond to the regions where the residues present changes on their secondary structure described.
Determination of prion residues for the structure with 4, 12, 16 and 20 residues in β-sheets (salmon).
Conclusions
Computational methods for predicting protein aggregation propensity have been extensively explored in the literature, aiming to understand the underlying causes of protein misfolding and its association with neurodegenerative diseases. These methods encompass various approaches, such as analyzing primary structure, secondary structure characteristics, and tertiary structure aspects, employing analytical techniques or machine learning algorithms. Key physicochemical properties, including hydrophobicity, charge, and residue size, are considered in these predictors.
This work developed a protein aggregation prediction methodology and tools based on machine learning techniques and three-dimensional protein features, incorporating relevant physicochemical characteristics in the protein’s conformation to its native structure. Various existing predictors and methodologies described in the literature were analyzed.
(i) Magre-II introduces an individual sphere analysis methodology considering the characteristics of neighboring residues.
(ii) Magre-II allows for the prediction of aggregation-prone regions with performance similar to or better than the leading predictors found in the literature, as demonstrated by the conducted tests.
(iii) Magre-II enables retraining with additional proteins identified as amyloid, which further enhances the prediction model. This flexibility to update the model with new data contributes to its continued robustness and relevance.
(iv) Magre-II exhibits low computational cost, making it efficient and practical for use in various computational environments.
(v) Maggre-II has shown efficiency in comparison with other predictors such as Aggrescan3D, Waltz, Tango.
Supplementary Information
Supplementary information (containing data on the analysis of descriptors, analysis of performance tests of equilibrium algorithms, and selection of proteins chosen for validation (related to neurodegenerative diseases) is available free of charge at https://jbcs.sbq.org.br/ as PDF file.
-
Author ContributionsALS and CAM conceived this research, designed the experiments, and conducted them. All authors participated in the interpretation of the data from the analyses, wrote the article, contributed to its revisions, reviewed the manuscript, and approved the final manuscript.
Data Availability Statement
All data are available in the text.
Acknowledgments
The authors would like to thank FAPESP, CNPq, and CAPES for funding.
References
-
1 Chiti, F.; Dobson, C. M.; Annu. Rev. Biochem. 2006, 75, 333. [Crossref]
» Crossref -
2 Soto, C.; Nat. Rev. Neurosci 2003, 4, 49. [Crossref]
» Crossref -
3 Knowles, T. P. J.; Vendruscolo, M.; Dobson, C. M.; Nat. Rev. Mol. Cell Biol. 2014, 15, 384. [Crossref]
» Crossref -
4 Vitalis, A.; Pappu, R. V.; Annu. Rep. Comput. Chem. 2009, 5, 49. [Crossref]
» Crossref -
5 Buchete, N. V.; Hummer, G.; Phys. Rev. E 2008, 77, 030902. [Crossref]
» Crossref -
6 Tartaglia, G. G.; Vendruscolo, M.; Chem. Soc. Rev. 2008, 37, 1395. [Crossref]
» Crossref -
7 Conchillo-Solé, O.; de Groot, N. S.; Avilés, F. X.; Vendrell, J.; Daura, X.; Ventura, S.; BMC Bioinf. 2007, 8, 65. [Crossref]
» Crossref - 8 Lodish, H.; Berk, A.; Matsudaira, P.; Kaiser, C. A.; Krieger, M.; Scott, M. P.; Bretscher, A.; Pincus, D.; Amon, A.; Molecular Cell Biology, 5th ed.; W.H. Freeman and Co.: New York, 2005.
-
9 Verma, M.; Vats, A.; Taneja, V.; Ann. Indian Acad. Neurol. 2015, 18, 138. [Crossref]
» Crossref -
10 Chiti, F.; Dobson, C. M.; Annu. Rev. Biochem. 2017, 86, 27. [Crossref]
» Crossref -
11 Navarro, S.; Ventura, S.; Curr. Opin. Struct. Biol. 2022, 73, 102343. [Crossref]
» Crossref -
12 Louros, N.; Konstantoulea, K.; De Vleeschouwer, M.; Ramakers, M.; Schymkowitz, J.; Rousseau, F.; Nucleic Acids Res. 2020, 48, D389. [Crossref]
» Crossref -
13 Zibaee, S.; Makin, O. S.; Goedert, M.; Serpell, L. C.; Protein Sci. 2007, 16, 906. [Crossref]
» Crossref -
14 Tartaglia, G. G.; Cavalli, A.; Pellarin, R.; Caflisch, A.; Protein Sci. 2005, 14, 2723. [Crossref]
» Crossref -
15 Fernandez-Escamilla, A.-M.; Rousseau, F.; Schymkowitz, J.; Serrano, L.; Nat. Biotechnol. 2004, 22, 1302. [Crossref]
» Crossref -
16 Garbuzynskiy, S. O.; Lobanov, M. Y.; Galzitskaya, O. V.; Bioinformatics 2010, 26, 326. [Crossref]
» Crossref -
17 Trovato, A.; Seno, F.; Tosatto, S. C. E.; Protein Eng., Des. Sel. 2007, 20, 521. [Crossref]
» Crossref -
18 Hamodrakas, S. J.; Liappa, C.; Iconomidou, V. A.; Int. J. Biol. Macromol. 2007, 41, 295. [Crossref]
» Crossref -
19 Ahmed, A. B.; Znassi, N.; Château, M. T.; Kajava, A. V.; Alzheimer’s Dementia 2015, 11, 681. [Crossref]
» Crossref -
20 Bryan, A. W.; Menke, M.; Cowen, L. J.; Lindquist, S. L.; Berger, B.; PLoS Comput. Biol. 2009, 5, e1000333. [Crossref]
» Crossref -
21 O’Donnell, C. W.; Waldispühl, J.; Lis, M.; Halfmann, R.; Devadas, S.; Lindquist, S.; Berger, B.; Bioinformatics 2011, 27, 1753. [Crossref]
» Crossref -
22 Bryan, A. W.; O’Donnell, C. W.; Menke, M.; Cowen, L. J.; Lindquist, S.; Berger, B.; Proteins: Struct., Funct., Bioinf. 2012, 80, 1389. [Crossref]
» Crossref -
23 Thangakani, A. M.; Kumar, S.; Nagarajan, R.; Velmurugan, D.; Gromiha, M. M.; J. Theor. Biol. 2014, 358, 90. [Crossref]
» Crossref -
24 Prabakaran, R.; Rawat, P.; Kumar, S.; Michael Gromiha, M.; J. Mol. Biol. 2021, 433, 167097. [Crossref]
» Crossref -
25 Jahn, T. R.; Radford, S. E.; Arch. Biochem. Biophys. 2008, 469, 100. [Crossref]
» Crossref -
26 Kim, C.; Choi, J.; Lee, S. J.; Welsh, W. J.; Yoon, S.; Nucleic Acids Res. 2009, 37, W469. [Crossref]
» Crossref -
27 Gasior, P.; Kotulska, M.; BMC Bioinf. 2014, 15, 54. [Crossref]
» Crossref -
28 Niu, M.; Li, Y.; Wang, C.; Han, K.; Int. J. Mol. Sci. 2018, 19, 2071. [Crossref]
» Crossref -
29 Keresztes, L.; Szögi, E.; Varga, B.; Farkas, V.; Perczel, A.; Grolmusz, V.; Biomolecules 2021, 11, 500. [Crossref]
» Crossref -
30 Louros, N.; Orlando, G.; De Vleeschouwer, M.; Rousseau, F.; Schymkowitz, J.; Nat. Commun. 2020, 11, 3314. [Crossref]
» Crossref -
31 Orlando, G.; Silva, A.; Macedo-Ribeiro, S.; Raimondi, D.; Vranken, W.; Bioinformatics 2020, 36, 2142. [Crossref]
» Crossref -
32 Li, Y.; Zhang, Z.; Teng, Z.; Liu, X.; Comput. Math. Methods Med. 2020, 8845133. [Crossref]
» Crossref -
33 Liaw, C.; Tung, C. W.; Ho, S. Y.; PLoS One 2013, 8, e53235. [Crossref]
» Crossref -
34 Tian, J.; Wu, N.; Guo, J.; Fan, Y.; BMC Bioinf. 2009, 10, S45. [Crossref]
» Crossref -
35 Burdukiewicz, M.; Sobczyk, P.; Rödiger, S.; Duda-Madej, A.; MacKiewicz, P.; Kotulska, M.; Sci. Rep. 2017, 7, 12961. [Crossref]
» Crossref -
36 Tsolis, A. C.; Papandreou, N. C.; Iconomidou, V. A.; Hamodrakas, S. J.; PLoS One 2013, 8, e54175. [Crossref]
» Crossref -
37 Emily, M.; Talvas, A.; Delamarche, C.; PLoS One 2013, 8, e79722. [Crossref]
» Crossref -
38 Shahbazi Dastjerdeh, M.; Shokrgozar, M. A.; Rahimi, H.; Golkar, M.; J. Biomol. Struct. Dyn. 2022, 40, 5384. [Crossref]
» Crossref -
39 Sankar, K.; Krystek, S. R.; Carl, S. M.; Day, T.; Maier, J. K. X.; Proteins 2018, 86, 1147. [Crossref]
» Crossref -
40 Chennamsetty, N.; Voynov, V.; Kayser, V.; Helk, B.; Trout, B. L.; Proc. Natl. Acad. Sci. U.S.A. 2009, 106, 11937. [Crossref]
» Crossref -
41 Kuriata, A.; Iglesias, V.; Pujols, J.; Kurcinski, M.; Kmiecik, S.; Ventura, S.; Nucleic Acids Res. 2019, 47, W300. [Crossref]
» Crossref -
42 Sormanni, P.; Aprile, F. A.; Vendruscolo, M.; J. Mol. Biol. 2015, 427, 478. [Crossref]
» Crossref -
43 Moreira, C. A.; Philot, E. A.; Lima, A. N.; Scott, A. L.; Appl. Math. Comput. 2019, 359, 502. [Crossref]
» Crossref -
44 Varadi, M.; De Baets, G.; Vranken, W. F.; Tompa, P.; Pancsa, R.; Nucleic Acids Res. 2018, 46, D387. [Crossref]
» Crossref -
45 Mitternacht, S.; F1000Research 2016, 5, 189. [Crossref]
» Crossref - 46 Seko, A.; Togo, A.; Tanaka, I. In Nanoinformatics; Tanaka, I., ed.; Springer: Singapore, 2018, p. 3.
-
47 Kawashima, S.; Pokarowski, P.; Pokarowska, M.; Kolinski, A.; Katayama, T.; Kanehisa, M.; Nucleic Acids Res. 2008, 36, D202. [Crossref]
» Crossref -
48 Fang, Y.; Gao, S.; Tai, D.; Middaugh, C. R.; Fang, J.; BMC Bioinf. 2013, 14, 314. [Crossref]
» Crossref -
49 Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Perrot, M.; Duchesnay, É.; J. Mach. Learn. Res. 2011, 12, 2825. [Link] accessed in September 2025
» Link -
50 Hand, D.; J. Mach. Learn. 2009, 77, 103. [Crossref]
» Crossref -
51 Alencar, R.; https://www.kaggle.com/code/rafjaa/resampling-strategies-for-imbalanced-datasets, accessed in September 2025.
» https://www.kaggle.com/code/rafjaa/resampling-strategies-for-imbalanced-datasets -
52 Beckmann, M.; Ebecken, N. F. F.; Pires de Lima, B. S. L.; J. Intell. Learn. Syst. Appl. 2015, 7, 104. [Crossref]
» Crossref -
53 Elhassan, A. T.; Aljourf, M.; Al-Mohanna, F.; Shoukri, M.; Global J. Technol. Optim. 2016, 7, S1. [Crossref]
» Crossref -
54 He, H.; Ma, Y.; Imbalanced Learning: Foundations, Algorithms, and Applications; Wiley, 2013. [Crossref]
» Crossref - 55 Anaconda, version 2-2.4.0; Anaconda, Inc.: Austin, TX, USA, 2015.
-
56 Lima, A. N.; de Oliveira, R. J.; Braz, A. S. K.; de Souza Costa, M. G.; Perahia, D.; Scott, L. P. B.; Eur. Biophys. J. 2018, 47, 583. [Crossref]
» Crossref -
57 Ulmer, T. S.; Bax, A.; Cole, N. B.; Nussbaum, R. L.; J. Biol. Chem. 2005, 280, 9595. [Crossref]
» Crossref -
58 Vieira, W. P.; Gomes, K. W. P.; Frota, N. B.; Andrade, J. E. C. B.; Vieira, R. M. R. A.; Moura, F. E. A.; Vieira, F. J. F.; Rev. Bras. Reumatol. 2005, 45, 295. [Crossref]
» Crossref -
59 Kihara, M.; Chatani, E.; Iwata, K.; Yamamoto, K.; Matsuura, T.; Nakagawa, A.; Naiki, H.; Goto, Y.; J. Biol. Chem. 2006, 281, 31061. [Crossref]
» Crossref -
60 Wälti, M. A.; Ravotti, F.; Arai, H.; Glabe, C. G.; Wall, J. S.; Böckmann, A.; Güntert, P.; Meier, B. H.; Riek, R.; Proc. Natl. Acad. Sci. U. S. A. 2016, 113, E4976. [Crossref]
» Crossref -
61 Lee, J.; Kim, S. Y.; Hwang, K. J.; Ju, Y. R.; Woo, H.-J.; Osong Public Health Res. Perspect. 2013, 4, 57. [Crossref]
» Crossref
Edited by
-
Editor handled this article:
Paulo Augusto Netz (Associate)
Publication Dates
-
Publication in this collection
17 Nov 2025 -
Date of issue
2025
History
-
Received
12 May 2025 -
Accepted
03 Oct 2025
























