Prediction of restrained shrinkage crack width of slag mortar composites using data mining techniques

The purpose of this study is to develop data mining models to predict restrained shrinkage crack widths of slag mortar cementitious composites. A database published by BILIR et al. [1] was used to develop these models. As a modelling tool R environment was used to apply these data mining (DM) techniques. Several algorithms were tested and analyzed using all the combinations of the input parameters. It was concluded that using one or three input parameters the artificial neural networks (ANN) models have the best performance. Nevertheless, the best forecasting capacity was obtained with the support vector machines (SVM) model using only two input parameters. Furthermore, this model has better predictive capacity than adaptative-network-based fuzzy inference system (ANFIS) model developed by BILIR et al. [1] that uses three input parameters.


INTRODUCTION
This study aims to estimate the drying shrinkage crack widths of mortars containing Granulated Blast Furnace Slag (GBFS) as fine aggregate. Mortars are cementitious materials currently used in construction industry. GBFS is commonly utilized as a fine aggregate substitute in mortar in order to reduce environmental problems related to aggregate mining and waste disposal. One of the most frequent problems in mortars is cracking that can be caused by several factors. One of these factor is the drying shrinkage that causes tensile stresses in the mortar due to restrictions to its free shortening. Drying shrinkage occurs after the mortar setting, being originated by the evaporation of free capillary water from the interior of the mortar that was not consumed in the hydration reactions of the cement.
Herewith, the estimation of drying shrinkage is based on data mining techniques. These techniques include decision trees, artificial neural networks (ANN), support vector machines (SVM) and k-nearest neighbours (k-NN). There are in scientific literature many applications of intelligent tools relative to mortars most of them based on artificial neural networks. Thus, artificial neural networks were used to predict compressive strength of mortar: for different cement grades (ESKANDARI et al. [2]); for mixtures containing different cement strength classes (ESKANDARI-NADDAF and KAZEMI [3]); containing metakaolin (SARIDEMIR [4]); using different saw waste for sand replacement (MAHZUZ et al. [5]); for different scoria percentages instead of sand (RAZAVI et al. [6]). Artificial neural networks were also used to predict rubberized mortar properties (TOPÇU and SARIDEMIR [7]), model the influence of salt on desorption isotherm and hygral state of cement mortar (KONIORCZYK and WOJCIECHOWSKI [8]), evaluate sand/cement ratio on mortar using ultrasonic transmission inspection (MOLERO et al. [9]) and establish a relationship between microstructural characteristics and compressive strength of cement mortar (ONAL and OZTURK [10]). Fuzzy logic was used to predict rubberized mortar properties [7] and compressive strength of mortars containing metakaolin [4]. Adaptive neuro-fuzzy methodology for prediction of sulfate expansion of PC mortar was used by İNAN et al. [11]. Optimized support vector machines were used to model the compressive strength of geopolymer paste, mortar and concrete by NAZARI and SANJAYAN [12] and a genetic algorithm-artificial neural network model to predict the compressive strength of cement mortar was created by AKKURT et al. [13].
The previous studies do not include the application of intelligent tools neither to drying shrinkage nor to cracking of mortars. Therefore, there is a lack in this field as the scientific literature has only focused on the application of intelligent tools to concrete shrinkage [14,15,16,17,18] or cracking [19,20,21,22,23]. No mention in the literature is found related to the prediction of retrained shrinkage crack widths of mortars or concretes by data mining techniques.
The main purpose of the present study is to construct predictive models for reliable estimation of drying shrinkage crack widths of mortars containing GBFS as fine aggregate using several data mining techniques. For this purpose, a database published by BILIR et al. [1] was used. By using this database several prediction models were constructed. A total of 358 registers including the replacement ratios of GBFS as fine aggregate in mortars (RR), the drying time of ring specimens (DT), the free shrinkage length changes of GBFS mortars (FS) and the crack widths of GBFS mortars (CW) exposed to ring test were used for constructing prediction models. This paper has two main challenges. The first is to verify if, at least one of the DM techniques gives better results than ANFIS used by BILIR et al. [1]. The second is to check if one can get good results by reducing the number of input parameters.

Definitions
The exponential development of the computational tools in recent decades has allowed the storage of large amounts of information. The need to extract useful knowledge from this information led to the emergence of the so-called Knowledge Discovery in Databases (KDD). KDD is an iterative process consisting of several steps, such as data selection, preprocessing of target data, data transformation, data mining and interpretation [24]. Data mining aims to extract useful patterns from data sets for decision-making purposes by applying automatic learning methods. DM can be applied to classification and regression tasks. The regression task consists of mapping several input variables to a numeric output. Usually in the DM process the dataset is splitted into two subsets denominated training set and testing set. The former set is used in a learning process of the algorithms whereas the latter is used to test the algorithms. During the learning process the various parameters of the algorithms are adjusted to optimize the results. The accuracy of the algorithms is assessed through metrics based on errors and the correlation coefficient. The validated algorithms are used as models to predict the value of the output variables.

DM algorithms
There are several DM algorithms such as Regression Trees (RT), Multiple Regressions (MR), Artificial Neural Networks (ANN), Support Vector Machines (SVM) and k-Nearest Neighbors (k-NN). A brief description of them will be provided next.
The Decision Trees [25] have an inverted tree structure composed of nodes and descendent branches. The result of a test performed at each node indicates the branch to continue the process. This process is repeated until the final decision can be made and a class is attributed to the register. Regression trees are a particular case of decision trees where classes are replaced by values ( Figure 1). Multiple regressions are similar to simple regressions but with several independent variables instead of one independent variable.
ANN is a technique that seeks to mimic the way the human brain works. An artificial neural network has an input layer, some hidden layers and an output layer, each one consisting of several neurons. The neurons in two adjacent layers are linked with associated weights wi,j (i and j are neurons or nodes) that determine the importance of the input. Each neuron has an activation function that introduces a non-linear component [26,27]. This study used a logistic activation function f given by 1/(1+e-x) and the following general equation: (1) where x i are the input parameters or nodes, I is the number of input parameters and o is the output parameter.
The multilayer perceptron (feed forward network) is the most popular neural network architecture [27] (Fig.2). Thus, this type of neural network was adopted in this study with one hidden layer containing HN (hidden nodes) processing units. The ANN performance depend on the number of hidden nodes. The SVM technique was initially proposed for classification problems by Vladimir Vapnik and his coworkers [28]. It uses nonlinear mapping to transform inputs into a multidimensional feature space (Fig. 3). After this transformation, SVM finds the best linear separating hyperplane, related to a set of support vector points, within the feature space. The nonlinear mapping depends on a kernel function K(x,x'). This work uses the Gaussian kernel function, which is the most popular one: The application of SVM to regression problems was possible after the introduction of the ε-insensitive loss function [29]. SVM performance is affected by meta-parameters C, ε and the kernal parameter, γ. Parameter C determines the trade-off between the model complexity (flatness) and the degree to which deviations larger than ε are tolerated in optimization formulation [30]. Parameter ε controls the width of the εinsensitive zone, used to fit the training data and its value can affect the number of support vectors used to construct the regression function [30]. In order to limit searching space, C and ε were set using heuristics proposed by [30]  The k-Nearest Neighbor [31] is a simple supervised learning algorithm that can be used in classification and regression problems. In classification problems an instance query is classified according to its neighbors' classes ( Figure 4). The dominant class among the nearest neighbors is attributed to the query instance. In regression problems the property value for the instance query is obtained as the average of the weighted values of the k nearest neighbors. This implies calculation of the distance between the target and its neighbors in the multidimensional space. Generally, weights are attributed according to distance. The closest neighbors are given more weight than more distant ones. In this work, the whole dataset was divided exactly in the same way used by BILIR et al. [1]. Thus, the training set consisted of approximately two-thirds of the total dataset (238 data) and the testing set was constructed with the remaining data (120 data). The performance of the models was evaluated through a 10-fold cross-validation process by dividing the training dataset in 10 subsets of equal size [32]. A single subset is retained as the validation data for testing the model, and the 9 remaining subsets are used as training data. The cross-validation process is repeated 10 times, with each of the 10 subsamples used as the validation data. Under this scheme all the data are used for training and testing. By averaging the values obtained in each of the folds, a single value is obtained for each of the considered performance measures. The model is retrained using all the training dataset whereas the testing dataset, composed of one third of the total data, is only used to validate the model.
The evaluation of the performance of the regression models can be done using different metrics. This study uses the root mean squared error (RMSE) and the coefficient of correlation (R), defined as: (4) where N denotes the number of examples, y i the real value, i ŷ the value estimated by the model, y the mean of the real values and ŷ the mean of the estimated values.

DATA USED IN DATA MINING
The database used in this work was presented by BILIR et al. [1] and was composed of 358 data sets collected from free shrinkage tests determining the length changes and ring test determination of the restrained shrinkage cracks. These authors developed a model to predict crack width (CW) using this database and adaptative-network-based fuzzy inference system (ANFIS).
Tables 1 and 2 present some statistical data of the parameters used in the analyses. Comparing both tables, it is possible to observe that there are no significant differences between the coefficients of variation of the training and testing data. The coefficients of variation of RR and CW are very similar. The same happens between the coefficients of variation of DT and FSH. This means that the RR-CW and DT-FSH pairs present similar variability.

RESULTS AND DISCUSSION
With the databases built, the predictive models were trained to forecast the drying shrinkage crack widths obtained from the ring test (CW).
The data mining models were tested using a single input variable and combinations of two or three variables (RR, DT and FSH). The mean values of the root mean square error (RMSE) and the coefficient of correlation (R) obtained during the training process are presented in Tables 3 and 4.   Tables 3 and 4 it can be seen that the best result using only one input parameter was obtained with the ANN and RR input. The best model with two input variables is the SVM with RR and DT inputs. ANN is the best model when all the input parameters are used.
Among all the combinations and models, the best performance was obtained with the SVM model using RR and DT as input variables. This model was fitted with all the training set and the result is graphically presented in Fig. 5, which also presents the values obtained with the testing set. Figures 6 and 7 presents the results for the best models with one and two input parameters. It must be highlighted that the testing data were not used in generating the model. Figures 5 and 6 confirm the good predictive capacity of the best models that use two or three input parameters. Figure 7 shows a set of scattered points above the 45 degrees line and some sets of points arranged in steps. This confirms the poor predictive capacity of models with only one input variable.
The model developed by BILIR et al. [1] presented a good performance. The values of R and RMSE extracted from Figures 5 to 7 and obtained with the BILIR et al. [1] model are summarized in Table 5. The values presented in this table confirm the best predictive capacity of the SVM model with RR and DT input parameters. Furthermore, the best models developed in this study with two or three input parameters have a better performance than the model developed by BILIR et al. [1].

CONCLUSIONS
Data mining techniques have the capacity to learn from examples. In this study several data mining techniques were used to predict drying shrinkage crack widths of mortars containing GBFS as fine aggregate. The cross-validation scheme indicates that with one or three input parameters the ANN models provide the best prediction results. Among all the developed models the SVM using only two input parameters (RR and DT) leads to the best results. Furthermore, this model and ANN with three input variables have better performances than the model ANFIS developed by BILIR et al. [1].

ACKNOWLEDGMENTS
This work was partly financed by FEDER funds through the Competitivity Factors Operational Programme -COMPETE and by national funds through FCT -Foundation for Science and Technology within the scope of the project POCI-01-0145-FEDER-007633".