Comparison of machine learning techniques to predict the compressive strength of concrete and considerations on model generalization

: The compressive strength of concrete is an essential property to ensure the safety of a concrete structure. However, estimating this value is usually a laborious and uncertain process since the mix design is based on empirical methods and its confirmation in the laboratory demands time and resources. In this context, this work aims to evaluate Machine Learning (ML) models to predict the compressive strength of concrete from its constituents. For this purpose, a dataset from the literature was used as input to four ML models: Extreme Gradient Boosting (XGBoost), Support Vector Regression (SVR), Artificial Neural Networks (ANN) and Gaussian Process Regression (GPR). The accuracy of the models was evaluated through 10-fold cross-validation, and quantified by R 2 , Mean Absolute Error (MAE), and Root-Mean-Square Error (RMSE) metrics. Subsequently, a new dataset was put together with mixtures from the literature and used to validate the previous models. In the model creation step, all algorithms obtained similar and positive results, with MAE between 1.96-2.26 MPa and R 2 varying from 0.79 to 0.83. However, in the validation step, the accuracy of the models dropped sharply, with MAE growing to 3.04-4.04 MPa and R 2 decreasing to 0.37-0.59. ANN and GPR showed the best results, while SVR had the worst predictions. This work showed that ML tools are promising techniques to predict the compressive strength of concrete. However, care must be taken with the input data to guarantee that models are not overfitted to a given region, set of materials, or type of concrete.


INTRODUCTION
The compressive strength of concrete is one of its most important properties. This feature directly impacts the structural design and is related to the cost, safety, and stability of a concrete structure. This strength is usually expressed in MPa, and is traditionally obtained from the rupture of cylindrical or cubic specimens in a hydraulic press, a procedure standardized worldwide [1], [2]. Due to the evolution of cement hydration over time, engineers stipulate that this target strength is reached after 28 days of cure in most conventional projects.
As structural projects demand a given compressive strength, the engineers responsible for the construction sites need to establish an optimized proportion among the constituents of the concrete to guarantee the safety of the building. This is done using mix design methods, such as the ones developed by the American Concrete Institute (ACI), the Brazilian Association of Portland Cement (ABCP) and the Brazilian Technological Research Institute (IPT). These methods seek to achieve an average target value (above the minimum) so that the minimum value is met with a safety margin [3]. This average value is obtained statistically, as it is possible for a concrete specimen to obtain a lower strength than specified, given the heterogeneous nature of its components and mixing procedure [1]. Therefore, in practice, the economically viable target strength is defined as the value to be exceeded by a certain proportion of all results (usually 95% when a single test is considered, or 99% when an average of 3 or 4 tests is taken) [1].
These well-established methods are nowadays still performed through charts and empirical formulae [1], [4]. Additionally, they are only valid for conventional concrete. For other types of concrete, such as high-strength, selfcompacting, lightweight, and recycled concretes, the scenario is even more uncertain, with scarce and divergent mix design techniques [5], [6].
Like strength evaluation, other concrete-related areas deal with empirical processes and time-consuming tests. To improve these processes, or at least reduce the need for experimental tests, several studies of Machine Learning (ML) techniques applied to civil engineering problems have been published in recent years. ML techniques consist of computational models capable of autonomously acquire knowledge. These models make decisions and can predict new results based on patterns acquired from previous data. As examples, we can cite Yaseen et al. [7], who applied ML techniques to measure the shear strength of reinforced concrete beams and concluded that these algorithms can be useful tools for professionals. Pettres and de Lacerda [8] obtained positive results in the recognition of defect patterns in concrete with the use of Artificial Neural Networks (ANN). ML-based algorithms are also being successfully used in the field of Structural Health Monitoring, especially in applications involving damage detection in large-scale concrete structures, such as bridges, dams, and buildings [9]- [10].
Some authors have also tried to predict the compressive strength of concrete using ML techniques. For example, Hoang et al. [11] applied the Gaussian Process Regression (GPR) to predict concrete strength using a dataset of 246 mixtures, defined according to the Vietnamese standard. The authors achieved an R 2 (coefficient of determination) of 0.90, concluding that these models are a promising alternative to assist engineers in construction sites. In turn, Dao et al. [12] tested the accuracy of ANN and GPR to the dataset assembled by Yeh [13], currently one of the most used worldwide, using a Monte Carlo simulation. The dataset was simply split into 70% of the observations for training and 30% for testing. The authors obtained an R 2 of 0.89 with the GPR and indicated that these algorithms may contribute to the mix design process. Likewise, Mustapha and Mohamed [14], also using Yeh's [13] dataset without cross-validation, obtained an R 2 of 0.93 by applying the Support Vector Regression (SVR). Finally, Cui et al. [15] used a decision tree model for this same purpose, obtained an R 2 above 0.80, and concluded that these models are suitable to assist in the mix design of concretes.
Thus, ML techniques are promising tools to predict the compressive strength of concrete. However, no article was found comparing the Extreme Gradient Boosting Decision Tree (XGBoost), GPR, SVR, and ANN to this purpose within the same dataset and boundary conditions. Furthermore, to the author's best knowledge, no article validated the models trained from the traditional Yeh dataset [13] with a different dataset to test the generalization ability of the models.
In this sense, the present work compares the accuracy of these four ML techniques in predicting the compressive strength of conventional concrete specimens and evaluates the resulting models in terms of their generalizing capabilities to a different dataset. The authors seek, therefore, to find the most suitable technique to use in future predictions and to reflect on the limitations of applying these models to concretes in diverse contexts. Figure 1 shows an overview of the present work. Initially, four supervised ML models were developed to relate the input features (concrete components and proportions) to the target variable (compressive strength). These models were built using a classic dataset available in the literature, gathered by Yeh [13]. We subsequently evaluated the quality of the prediction through cross-validation and three statistical metrics: coefficient of determination (R 2 ), the mean absolute error (MAE) and the root mean square error (RMSE). The significance of each input variable (concrete component) in the prediction of the final compressive strength was also investigated. In a second stage, for the validation of these models, the authors assembled a second dataset from 11 articles in the literature, with 22 new observations (in the Appendix). This dataset was then used as test values for the previously created models. The accuracy of this prediction was again assessed using the 3 metrics described above.

Machine Learning Techniques
As there is no single general model perfectly adaptable to all engineering problems, four supervised models were chosen to be applied to the present study: XGBoost, SVR, ANN and GPR. They were selected based on a preliminary literature analysis, in which we gathered the techniques that had different learning-based backgrounds. Among them, XGBoost, SVR, ANN and GPR were the ones with the most promising performance to deal with similar complex problems.
The authors opted to manually adjust the hyperparameters of the techniques without focusing on specific optimization methods for each one, so that there would be no distinction in the creation processes of these models. The experiments were carried out on a computer with an Intel Core i5-10210U processor and 8GB of RAM. The algorithms were implemented in Python (version 3.8.6) using the Pandas library to analyze and manipulate datasets, and the scikitlearn, TensorFlow and XGBoost libraries to apply the ML models.
The following sections will provide a summarized description of these methods. For more detailed explanations, the reader may consult the references given at the end of each part.

Extreme Gradient Boosting (XGBoost)
XGBoost has been increasingly used in several research fields because it presents suitable predictions and a short execution time to solve classification and regression problems [16]. This algorithm is based on the classical decision tree technique.
The structure of a decision tree can be described as follows: the tree starts with a major node called "root" that splits into several other nodes. Each of these nodes carries a condition to separate the dataset into subsets that have similar characteristics [17]. Generally, using only one decision tree leads to poor predictions; therefore, ensemble techniques are usually adopted to improve the performance of these models [18]. Ensemble methods consist of combining several trees to achieve more reliable results.
An example of ensemble is the boosting technique, which uses "n" weak trees sequentially to create a more robust predictor at the end of training [19]. The focus of the boosting method is to reduce bias and variance with each new model created, based on the difficulties faced by the previous model [20]. XGBoost uses gradient boosting, an extension of the previous method, in which a descending gradient is applied to improve the trees, according to the error of the previous models.
The XGBoost can be briefly described as follows: for a given dataset = {( , )}(| | = , ∈ ℝ , ∈ ℝ ), with and variables (inputs and outputs, respectively), m features, and n observations, the model uses K additive functions to predict outputs: with � being the model output and the space of the regression tree, defined as: The structure of each tree is represented by , while the number of leaves and their weights are represented by and , respectively. Also, the term represents an independent tree structure with leaf weights.
In the regression tree optimization process, the following objective function must be minimized: There is also a convex loss function that measures the difference between � and which are, respectively, the prediction given by the model and the real value. The term Ω penalizes the complexity of the regression trees and is given by: However, models that use gradient boosting are trained in an additive way. In these cases, the following objective function is minimized: is added in the objective function, with being the number of iterations [16].
Regarding the implementation of the algorithm, this model does not need many adjustments. Hence, the authors carried out some preliminary tests to define its optimal hyperparameters. For a more detailed explanation about this method, the authors recommend the references Chen & Guestrin [16] and Suen et al [20].

Support Vector Regression (SVR)
Support Vector Machine (SVM) is a supervised learning model that creates a hyperplane capable of separating data into distinct classes [21]. There are infinite hyperplanes able to perform this task. However, this algorithm seeks to find the one that yields the greatest distance between the classes. To this purpose, the SVM finds the points located on the margins (the support vectors) and maximizes the margin [22]. In other words, the algorithm initially defines a hyperplane that separates the data to later determine the points of each class that are closest to this separator. Finally, it seeks the hyperplane that leads to the greatest distance between the two classes, called the "optimum" hyperplane [23].
In addition to linear problems, these algorithms can be used to solve non-linear problems, by using kernels. Applying the kernel to the model increases the number of dimensions of the input space, thus transforming the initially nonseparable data into data that is separable by the algorithm [23].
Given that the prediction of concrete strength is a regression problem, the authors used the Support Vector Regression (SVR) variant in this work. It has the same principle as SVM but focuses on solving regression problems.
The SVR can be briefly described as follows: for a dataset {( 1 , 1 ), … , ( , ) ⊂ × ℝ}, where i represents the space of the input variables, the purpose of the regression is to find a function ( ) that has at most one deviation ε from the real values . For the linear function: the SVR will transform this problem into a constrained optimization problem: subject to the following restrictions: The error of the model's predictions is dealt with within the constraints. The SVR model adopts an ε-insensitive loss function, which penalizes predictions that are farther than ε from the desired output [24].
To perform the hyperparameter tuning for the SVR model, the authors varied the kernel coefficient (a.k.a. gamma) and the 'C' regularization parameter randomly from 10 -2 to 10 3 . The best results were achieved with gamma and C as 0.6 and 33, respectively. For a more detailed explanation about this method, the authors recommend the references Smola and Schölkopf [24] and Noble [23].

Artificial Neural Networks (ANN)
Artificial Neural Networks (ANN) were developed based on studies of the human brain [25]. These algorithms have been widely applied to solve problems in various fields around the world, due to their robustness to deal with complex tasks [26]- [27]. ANNs consist of several processing elements, called neurons, connected to each other. Figure 2 represents the single neuron model, also known as perceptron. The neuron will receive the input values ; these entries are multiplied by the synaptic weights . Each neuron also has a bias . This bias has no input data associated with it, allowing the neuron to change the output independently of the input values. Neuron performs the weighted sum of the received signals. Finally, this sum passes through the activation function to produce the output : One of the architectures most used by ANN models is the Multilayer Perceptron (MLP). In MLPs, neurons are divided into the input layer, hidden layers, and output layers, as shown in Figure 3 [28].

Figure 3. Representation of an MLP with a hidden layer
For an MLP like the one depicted in Figure 3, the mechanism of only one neuron is used for each of the layers: Therefore, will provide the output of each neuron in its respective layer [22]. For a more detailed explanation about this method, the authors recommend the references Garcia [25] and Barreto [28].
To define the number of hidden layers and the number of neurons per ANN layer for the present work, the authors conducted a sensitivity analysis. The model was trained several times with Yeh's dataset [13], varying the number of layers from 1 to 7, and the number of neurons from 4 to 512, per layer. From the analysis of the evaluation metrics (section 2.3.5), the final model with 5 hidden layers and 256 neurons was implemented.

Gaussian Process Regression (GPR)
The Gaussian Process Regression (GPR) is a non-parametric regression technique that uses the probability distribution to predict the outcome. Through the provided training data, this technique uses the Bayes' rule to update the probabilities of each function representing the model [29]. The main advantage of the GPR is that it provides an approximation of the uncertainty of each forecast [29].
The GPR can be defined as follows: where ( ) is an average function and � , � a covariance (or kernel) function of the Gaussian distribution [30] for samples e . Choosing the kernel function is one of the most important steps in implementing this model. As in the SVR models, these functions are responsible for smoothing the function being modelled, which will affect the quality of the prediction [30].
This work adopted the Radial Basis Function kernel (RBF). RBF is a stationary kernel function that uses the squared Euclidean distance between two vectors, as follows [31]: with � , � being the Euclidean distance and the kernel function length scale [30]. Based on previous validations, considering several different simulations, this function proved to be the most suitable for the present study. As in the previous models, a hyperparameter optimization for the GPR was also performed. To this purpose, the parameter "alpha" of the model was randomly varied from 10 -3 to 10 2 . This hyperparameter is the value added to the diagonal of the kernel matrix during the process. The value 0.2 was set. For a more detailed explanation about this method, the authors recommend the references Rasmussen [29] and Williams and Rasmussen [30].

Data analysis
Choosing the right technique, as well as defining a proper dataset, are essential steps in the framework of machine learning. For instance, using a tool that performs well on several problems, but training it with unrepresentative data, will result in poor predictions [32], [33].

Training dataset
In the present work, the dataset of concrete compositions was gathered from data available in the literature. For the first part of the construction of the models, the authors used the "Concrete Compressive Strength Data Set" from the studies carried out by Yeh [13]. This dataset has eight input features: Cement, Blast Furnace Slag, Fly Ash, Water, Superplasticizer, Coarse Aggregate, Fine Aggregate, and Age. The set also has the output feature Compressive Strength of Concrete, ranging from 2 to 82 MPa. The complete dataset has 1030 distinct observations (entries). The dataset comprises mixtures from 17 different sources, most of them originated from research carried out between 1987 and 1997, in Taiwan. These mixtures comprised specimens of different shapes and sizes. Thus, the original author performed a standardization, through correlation indices from the literature, so that all the compressive strength results corresponded to 15-cm cylindrical specimens. In addition, the author specified that the coarse aggregate of all the mixtures had dimensions below 20mm and that the superplasticizers were originated from several manufacturers [13].
Pre-processing steps include data preparation prior to making predictions. In general, this part consists in solving scaling problems, analyzing the outliers, and missing values that directly impact the performance of the models [32]. In the present work, the feature responsible for informing the concrete curing time (age) was not used. Due to the design convention of 28 days to achieve the target concrete strength for conventional purposes, only instances that had this age were used. All data referring to other curing times were removed from the set to avoid biases. This step reduced the number of observations from 1030 to 419.
A second adjustment included filtering the values of the output variable. This work aims to evaluate normalstrength concretes, whose values vary between 15-50 MPa [34]. As the outliers are largely responsible for hindering the modelling of the phenomenon, the authors decided to remove all data above 50 and below 15 MPa, seeking to obtain a more robust model for the defined resistance range. Thus, in total, 329 observations (32% of the initial dataset) were adopted for the creation of the models. Table 1 shows the characteristics of the final dataset used.

Validation dataset
To test the generalization ability of the implemented models, the authors assembled a new dataset with concrete mixtures available in the literature. Mixtures were taken from 11 articles (listed in the Appendix), which originated from 8 different countries, including Brazil. For the validation set to be compatible with the model, the compressive strength was standardized to correspond to 150×300mm cylindrical specimens (the same ones used in Yeh's dataset), using the correlations from Yi et al. [35]. In addition, the same data filtering was performed to consider only strengths between 15 and 50MPa. Thus, the final validation set had 22 observations, described in Table 2. The complete dataset can be provided by request to the corresponding author.

Data Rescaling
When working with ML, another important factor is the scale of the data. Some models do not perform well with inputs that have different scales, which can lead the model to prioritize a given input simply because it has a bigger scale [36]. Regarding the present work, Table 1 shows that the data referring to the superplasticizer range from 0 to 22 Kg/m 3 , while the coarse aggregate values range from 801 to 1145 Kg/m 3 , evidencing that the data from different features are not in the same magnitude. Thus, the authors rescaled the input data, as follows: where is the original input value, is the average, is the standard deviation, and the modified input value. After the rescaling step, all values are centered at zero with a standard deviation equal to 1.

Cross-validation (k-fold)
Cross-validation is a technique widely used to assist in the evaluation of ML models [37]. It consists of randomly dividing the data into "k" sets, each of which is used to validate the model once [38], [39]. This strategy provides a less biased assessment compared to common techniques such as just splitting data once into training and testing.
This study adopted k=10, which is widely used in the literature for similar problems [5], [11], [40]. Initially, the complete dataset is randomly divided into 10 subsets or folds. In the first iteration, the first subset is used to test the model, after all the others have been used to train it. In the next iteration, the algorithm uses the second segmentation to test the model after it has used everything else for training. This procedure is repeated until all 10 sets have been used to test the model, as illustrated in Figure 4. The results shown in this work correspond to the mean of the 10 iterations.

Assessment Metrics
Three quantitative metrics were used to assess the performance of each model, aiming, together, to provide a global analysis of its accuracy. They are the coefficient of determination (R 2 ), the mean absolute error (MAE) and the root mean square error (RMSE). They are vastly used to assess regression models for this type of problem [5] [41] [42].
The R 2 is calculated using (14) [43], where � is the value predicted by the model and is the observed value. R 2 results in a number between minus infinity and 1. When the analyzed model fits perfectly to the data, the R 2 will assume the value 1, indicating that the predictors are able to explain all the variability of the data [44]. As the R 2 compares the performance of the tested model with a flat line (a baseline model in which all predictions will be the mean value of the outputs), if the assessed model presents a worse fit than the line that represents the mean value, the R 2 will be negative.
The MAE measures the average magnitude of the errors (the difference between observed and predicted values), regardless of their direction. It can be determined using (15) [45]. In MAE, large errors caused by outliers are not so important, because this metric is absolute and not quadratic [44].
Finally, the RMSE ((16 [45]) is a vastly used metric when the researcher wants to measure the average magnitude of the errors [44]. Unlike MAE, in RMSE, as the error of each prediction increases, the RMSE increases considerably.
Both MAE and RMSE range from zero to positive infinity. The lower these metrics, the better the model.

Significance of the input features
Finally, the authors sought to understand the impact that each feature had on the predictions. For this evaluation, the decision tree technique (XGBoost) was used. In these models, each node has a condition to split the values so that similar instances end up in the same set. The condition is based on the Gini impurity for classification problems and in the variance for the regression problems [46]. Thus, when a decision tree-based model is trained, it intrinsically calculates how much each variable contributes to reducing the variance and, consequently, it can estimate how useful each variable is to the construction of the model. For a dataset with classes, (17) calculates the Gini impurity, with being the class probability [47]. The Gini impurity ranges from 0 to 1, with 0 relating to an impure node. The smaller the Gini, the more important that variable is for the tree. Table 3 summarizes the evaluation metrics (R 2 , MAE and RMSE) of the four models created to predict the compressive strength of conventional concrete specimens. In this initial stage, the models were trained and crossvalidated with the Yeh [13] dataset. The XGBoost achieved the best correlation between predicted and observed values, reaching an R 2 of 0.83. On the other hand, SVR had the worst performance (R 2 = 0.79), although it was very close to the other models (R 2 = 0.82).

Creation and evaluation of the models
As part of the assessment of the best models to develop future studies, the authors have also recorded the time required to process each algorithm. Due to the small amount of data available for training, the running time ranged from 0.15 (SVR) to 69.73 seconds (ANN). Despite both being relatively short periods, the processing time for the ANN model was approximately 465 times that of the SVR, 50 times the XGBoost and 19 times the GPR. This result means that the application of ANN to larger datasets may be impractical depending on the situation. The best model in this article obtained a lower R 2 than that of other authors who used the same dataset put together by Yeh [13]. For example, Dao et al. [12] used GPR and ANN to obtain the compressive strength of concrete and reached R 2 of 0.89 (against our 0.82 shown in Table 3). However, as opposed to the current work, these authors used the curing time as one of the features and evaluated all strength ranges. It means that they had access to a bigger dataset and their metrics were boosted by "easier" predictions (since the variability of the concrete strength at 3 and 7 days is usually much lower than that at 28 days). For comparison purposes, applying the complete dataset to our models would result in R 2 ranging from 0.87 to 0.93.
Similarly, Mustapha and Mohamed [14] applied SVR to the Yeh [13] dataset, obtaining R 2 up to 0.93 (versus 0.79 in this work). However, Mustapha and Mohamed [14] not only used the complete dataset (all ages and strengths) but also did not perform cross-validation to remove possible bias when splitting the data for training and testing.
It is also possible to compare the accuracy of our models with works in which the authors produced their own concrete specimens. For example, Lam et al. [48] produced 75 specimens to obtain the data used in their algorithms. They built an ANNbased model that obtained R 2 = 0.92, (versus R 2 = 0.82 in the current work). However, this type of approach can limit the generalization ability of the model, as the algorithms learned from only one homogeneous source of concrete.
Regarding the other metrics, the XGBoost and ANN models obtained very similar RMSE and MAE results, around 3.40 MPa and 2.24 MPa, respectively. GPR obtained a lower MAE, 1.96 MPa, and a slightly higher RMSE, 3.43 MPa. As with the R 2 results, the SVR presented the worst results, MAE of 2.26 MPa and RMSE of 3.73 MPa. It is noteworthy that the models proposed in the current work resulted in relatively close MAE and RMSE values. At a first glance, these results indicate a good performance of the models.
Comparatively, Dao et al. [12], mentioned above, obtained a RMSE of 5.46 MPa and a MAE of 3.86 MPa -while using the complete dataset, including compressive strengths higher than 50 MPa. In the same conditions, Mustapha and Mohamed [14] reached a MAE of 5.89 MPa. We can also mention Hoang et al. [11], who achieved a RMSE of 4.04 MPa, even though they created their own dataset of 246 specimens (ranging from 13.5 -85.2 MPa).
It is important to remember that the RMSE is influenced by the square of the individual errors [44]. Thus, large errors are weighted more heavily than small ones. Therefore, this metric is recommended to evaluate models when large errors are particularly undesirable (such as in the prediction of concrete strength). However, Willmott and Matsuura [45] argue that the RMSE should not be used to compare two or more models, as this value varies according to the scale of the errors. The authors claim that the MAE is a metric that represents the magnitude of the error more naturally and, therefore, comparisons between different models should be based on the MAE.
Given the heterogeneous nature of cement-based composites and the infrastructure of construction sites, the calculation of the target mean strength of concrete is usually influenced by the quality control of its preparation. In Brazil, these parameters are set by NBR 12655 [49]. The smallest standard deviation value for the calculation of this strength, considering the best preparation conditions, normal-strength concrete, and no prior experiments, is 4.0 MPa [49]. Thus, both the RMSE and MAE values for all models were below the standard deviation indicated by NBR 12655. Important note: this comparison is not a measure of the safety of this mix design methodology, but it shows that the weighted average of errors obtained through the ML algorithms is smaller than the typical variability considered among specimens at a construction site.
Regarding individual errors, Figure 5 shows the frequency distribution of absolute errors (the difference between predicted and observed values) for all the mixtures in the dataset, regardless of direction. For all the models, at least 84% (275 instances) of the errors fell below 5MPa (for SVR), reaching 91% (300 instances) (for ANN). Conversely, for any algorithm, less than 3% of the predictions (10 instances) deviated more than 10 MPa from the real values. However, the maximum absolute error reached 18.78 -21.12 MPa, which is a significant value. Seeking to understand the factors that led to these high singular errors, the authors assembled the 10 concrete mixtures that led the models to the biggest deviations, shown in Table 4. This table reveals that 3 observations are repeated in all models (being the top 3 errors of the XGBoost, ANN, and GPR); and another 3 are repeated in 3 models.
When analyzing the observations that presented the highest errors, one notices that they refer to concretes with unconventional proportions of materials. For example, mixture #1 of XGBoost (that was also #1 in SVR, ANN, and GPR), has only 200 kg/m 3 of Portland cement (and another 200 kg/m 3 of blast furnace slag), an unusual w/c ratio of 0.95, and still reached 49.25 MPa (versus an average of 27.7 MPa, predicted by the algorithms). Conversely, mixture #2 in XGBoost (that was #3 in GPR and ANN, and #7 in SVR) has a cement consumption of 436 kg/m 3 , w/c ratio of 0.5 and only reached 23.85 MPa (while the algorithms predicted approximately 38.9 MPa). The other observations that were repeated in the top 10 errors also showed mix proportions that are not commonly found in conventional concretes (e.g., over 30% of mineral admixtures in relation to cement mass). Assuming that these results are not due to typing mistakes or experimental issues, they indicate: • the relevance of the input data for the construction of models with good quality predictions, bearing in mind that the data must be like the problem studied. • the necessity to collect multiple observations of concrete mixes of all types if one wants to create mix design tools that are as generalizable as possible. Figure 6 shows the importance of each feature to the construction of the boosted decision trees within the model, obtained with the XGBoost technique. The more a feature is used to make key decisions during the construction of the model, the higher will be its relative significance. As expected, the cement has the greatest relative impact among the input features, while the aggregates had the smallest. Following the cement, we observed a significant influence of supplementary cementitious materials (mineral admixtures, such as blast furnace slag and fly ash). This is explained because all these binders have characteristics that significantly increase the strength of concrete [1] [50] [51]. These results indicate that the model had a good interpretation of the data.  Table 5 presents the performance of the models trained with Yeh's dataset [13], when validated with the new dataset with 22 instances elaborated by the authors. No model performed a good prediction, with the R 2 falling from 0.79-0.83 (Table 3)  The characteristics of Yeh's dataset [13] may explain this scenario. First, this dataset was built from relatively old studies (between 1987 and 1997), which is probably a major source of inaccuracies given the technological advancements of construction materials, especially Portland cement and chemical admixtures. Additionally, most of these works were carried out in Taiwan, using relatively homogeneous local materials, and coarse aggregates with a maximum size of 20mm. Thus, the dataset is incapable of representing the variability of concretes on a global scale. And this low generalization ability is even more worrisome because a significant portion of articles on the application of artificial intelligence for concrete mix design uses this dataset. The regional peculiarities of concrete components are well known by professionals in this field. For example, even within Brazil, cements and concretes from the South region tend to adopt pozzolanic admixtures, while cements and concretes from the Southeast region commonly incorporate blast furnace slag [52]. However, despite this heterogeneity being empirically known, studies are still lacking to measure its impact on algorithms for concrete mix design.

Considerations on model generalization
To allow the development of safe, efficient, and economical mix design tools, the authors see two possibilities: 1) each country or region must work with its own dataset to generate models that are adapted to the local reality or 2) the creation of databases with more input features, such as country of origin, maximum aggregate size, type of cement, etc., thus allowing the creation of fewer mix design tools, but highly adaptable to different types of concrete.

CONCLUSION
This article compared four machine learning techniques to predict the compressive strength of conventional concrete specimens from their components. A well-known database, elaborated by Yeh [13], was used to train four models: Gaussian Process Regression (GPR), Extreme Gradient Boosting Decision Tree (XGBoost), Artificial Neural Networks (ANN), and Support Vector Regression (SVR). After evaluating these models, a new database was put together by the authors to validate them. This test sought to analyze the models' generalization ability to new concrete mixes.
In the first stage, the GPR, XGBoost, and ANN models obtained R 2 > 0.82, while SVR had the worst performance, R 2 = 0.79. For all algorithms, the MAE was below 2.26 MPa and the RMSE, below 3.73 MPa, which the authors consider relatively positive results compared to the minimal standard deviation prescribed in real mix design procedures. Although better correlations have been found in the literature, our work adopted a more conservative approach, looking only for resistance at 28 days.
To identify the causes of the inaccuracies in the proposed models, we ranked the top 10 mix proportions with the greatest deviations between the predicted and observed results. Most of them appeared in at least 3 algorithms, indicating that the issue was probably related to these particular mix proportions rather than with the proposed models. Indeed, the authors observed that these entries had unconventional percentages of admixtures front what is normally observed in conventional concretes. This result highlights the importance of the input data for the development of highquality prediction models.
The relatively small number of observations meant that the running time was not significant to select the best algorithm. In this sense, a study analyzing how the dataset size would affect the processing time of the models should be carried out in the future.
In the validation step, the quality of the models dropped sharply, with the best R 2 being only 0.59 (for the GPR model). The probable main contribution to this result was the difference between the characteristics of the dataset used for validation and the one used for model training. The models were created from the classic Yeh's dataset [13], which can be considered relatively homogeneous in terms of the origin of concrete observations and aggregate sizes.
This result shows that the regionalization and homogeneity of some datasets can lead to false-positive results in the search for universal concrete mix design strategies. In a future study, the authors intend to quantitatively assess this ability to generalize models. Furthermore, joint initiatives are needed to build a more comprehensive and varied database of concrete properties. Until that happens, the authors recommend that ML models for concrete mix design should be limited to predicting the strength of specimens from the same laboratories that trained them.
In summary, this article showed that ML techniques are potentially viable to predict the compressive strength of concrete. For now, more studies regarding the creation and validation of bigger and more varied databases are needed. However, soon, this approach may reduce the time and resources currently spent on the mix design processes.