Comparison of machine-learning algorithms to build a predictive model for detecting undiagnosed diabetes - ELSA-Brasil: accuracy study

ABSTRACT CONTEXT AND OBJECTIVE: Type 2 diabetes is a chronic disease associated with a wide range of serious health complications that have a major impact on overall health. The aims here were to develop and validate predictive models for detecting undiagnosed diabetes using data from the Longitudinal Study of Adult Health (ELSA-Brasil) and to compare the performance of different machine-learning algorithms in this task. DESIGN AND SETTING: Comparison of machine-learning algorithms to develop predictive models using data from ELSA-Brasil. METHODS: After selecting a subset of 27 candidate variables from the literature, models were built and validated in four sequential steps: (i) parameter tuning with tenfold cross-validation, repeated three times; (ii) automatic variable selection using forward selection, a wrapper strategy with four different machine-learning algorithms and tenfold cross-validation (repeated three times), to evaluate each subset of variables; (iii) error estimation of model parameters with tenfold cross-validation, repeated ten times; and (iv) generalization testing on an independent dataset. The models were created with the following machine-learning algorithms: logistic regression, artificial neural network, naïve Bayes, K-nearest neighbor and random forest. RESULTS: The best models were created using artificial neural networks and logistic regression. These achieved mean areas under the curve of, respectively, 75.24% and 74.98% in the error estimation step and 74.17% and 74.41% in the generalization testing step. CONCLUSION: Most of the predictive models produced similar results, and demonstrated the feasibility of identifying individuals with highest probability of having undiagnosed diabetes, through easily-obtained clinical data.


INTRODUCTION
Type 2 diabetes is a chronic disease characterized by the body's inability to efficiently metabolize glucose, which increases blood glucose levels and leads to hyperglycemia.This condition is associated with a wide range of serious health complications affecting the renal, neurological, cardiac and vascular systems, and it has a major impact on overall health and healthcare costs. 1 Recent studies have estimated that around 415 million people have diabetes, and that the number of cases may increase to 642 million by 2040.In addition, approximately half of these individuals are not aware of their condition, which may further intensify the negative consequences of the disease.[4] It is believed that diabetes, like other noncommunicable chronic diseases, is mainly caused by behavioral factors such as poor diet and physical inactivity.Early interventions aimed towards creating lifestyle changes, with or without associated pharmacological therapies, have been proven effective in delaying or preventing type 2 diabetes and its complications.This has led many countries to invest in national programs to prevent this disease.To reduce costs and amplify the results, population-level interventions need to be combined with interventions that are directed towards individuals who are at high risk of developing or already having diabetes, 5 so as to focus interventions, at the individual patient level, on those for whom they are most appropriate.
6][7][8][9] However, few of these rules have been drawn up using the most recently developed machine-learning techniques, which potentially have the ability to produce algorithms of greater predictive ability than those developed through the technique most commonly used to date, i.e. multiple logistic regression.

OBJECTIVE
This paper presents the development and comparison of predictive models created from different machine-learning techniques with the aim of detecting undiagnosed type 2 diabetes, using baseline data from the Longitudinal Study of Adult Health (ELSA-Brasil).

METHODS
These analyses were performed on data from the baseline survey (2008-2010) of ELSA-Brasil, a multicenter cohort study that had the main aim of investigating multiple factors relating to adult health conditions, including diabetes and cardiovascular diseases.The study enrolled 15,105 public servants aged between 35 and 74, at six public higher-education and research institutions in different regions of Brazil between 2008 and 2010, as has been previously reported in greater detail. 10,11The institutional review boards of the six institutions at which the study was conducted gave their approval, and written informed consent was obtained from all participants.
All analyses were performed using R version 3.2.3.The source codes used in the analysis are freely available.

Dataset and preliminary variable selection
Data from the ELSA study baseline were used to create the predictive models.At this baseline, the for each participant at baseline, as described previously. 10,11total of 1,473 participants were excluded from the present analyses because they had self-reported diabetes.Another three participants were excluded because some information required for characterizing undiagnosed diabetes was missing.An additional 1,182 participants (8.7%) were excluded from the analyses because data relating to other variables were missing.Among the remaining 12,447 participants, 1,359 (11.0%) had undiagnosed diabetes.
Undiagnosed diabetes was considered present when, in the absence of a self-report of diabetes or use of anti-diabetic medication, participants had fasting glucose levels ≥ 126 mg/dl, glucose levels ≥ 200 mg/dl two hours after a standard 75 g glucose load or had glycated hemoglobin (HbA1c) ≥ 6.5%.
Through procuring variables in the ELSA dataset that were similar to those investigated in previously published predictive models for detecting diabetes or in situations of high risk of developing diabetes, we selected 27 diabetes risk factors for analysis.Any variables that implied additional costs beyond those of filling out a questionnaire and performing basic anthropometry, such as clinical or laboratory tests, were excluded so that the model obtained could be applied using a straightforward survey and simple anthropometric measurements.The final variable subset was validated by experts, and this resulted in the subset of 27 candidate variables described in Table 1 and Table 2

Machine-learning algorithms
The machine-learning algorithms are briefly described below: Logistic regression 12 is a well-established classification technique that is widely used in epidemiological studies.It is generally used as a reference, in comparison with other techniques for analyzing medical data.
Multilayer perceptron/backpropagation 13 is the principal artificial neural network algorithm.When there is no hidden layer on the network, this algorithm is equivalent to logistic regression, but it can solve more difficult problems with more complex network architectures.The price of using complex architectures is that it produces models that are more difficult to interpret.Additionally, it can be computationally expensive.
Naïve Bayes classifier 14 is a type of Bayesian network that, despite enormous simplicity, is able to create models with high predictive power.The algorithm works well with heterogeneous data types and also with missing values, because of the independent treatment of each predictor variable for model construction.Random forest 16 is a machine-learning algorithm from the "ensemble" family of algorithms, 17 which creates multiple models (called weak learners) and combines them to make a decision, in order to increase the prediction accuracy.The main idea of this technique is to build a "forest" of random decision "trees" and use them to classify a new case.Each tree is generated using a random variable subset from the candidate's predictor variables and a random subset of data, generated by means of bootstrap.This algorithm also can be used to estimate variable relevance.

Standardization of numerical variables
Transformation between different data types (categorical or numerical) was performed by means of binarization or discretization, when necessary.In binarization, a categorical variable with n levels is transformed into n -1 dummy variables that have values equal to "1" when the case belongs to the level represented by the dummy variable or "0" otherwise.
In discretization, a numerical variable is transformed into a categorical variable by defining a set of cutoff points for that variable, such that the ranges of values between the cutoff points correspond to the levels of the categorical variable.The Ameva algorithm 18 was used to find the best cutoff points for each numerical variable.

General process of model construction and evaluation
The models were built, evaluated and compared using four The third step ("Error Estimation" box in Figure 1) used crossvalidation to obtain more reliable estimates of the performance of different learning schemes, using the best settings and subsets obtained in the preceding steps.
Finally, the last step ("Generalization Testing" box in Figure 1) evaluated models using only the learning scheme that obtained the best performance for each algorithm in the test dataset that had not previously been used.
The following sections describe each step in more details.

Parameter tuning
This first step in model building evaluated each algorithm with different parameter configurations to find out which parameter configuration produced the best results for each algorithm and data type conversion used.The parameters tested for each algorithm are listed in Table 3.Moreover, a set of different cutoffs (predefined by the analyst) to generate the classification was evaluated to find out which produced the best classification on average between the 30 models in terms of balanced accuracy.

Because
After that, the results were analyzed and, when necessary, new parameter values and/or cutoff points were selected for new tests.In this case, the new values were selected around the values from which the best results had been obtained up to that moment.
The idea was to start testing a sparse range of values and then decrease the granularity of the values in order to avoid trying values that were very likely to produce poor results.This search stopped when there was no increase in the predictive power of the models that had been created using the specific machine-learning algorithm and data type conversion evaluated.

Automatic variable selection
The automatic variable selection step had the aim of finding subsets from the 27 candidate variables that could increase the performance of the predictive models, compared with other models created using different sets of candidate variables.
These subsets of variables were generated using the wrapper strategy. 19In this strategy, models are created and evaluated by means of a machine-learning algorithm and a validation method, such as cross-validation, using different subsets of variables.The subset from which the best performance is achieved, in terms of a criterion such as AUC, is chosen as the best subset.
Because of the large number of possible subsets, a heuristic search was used to generate the variable subset candidates that were more likely to create better models, thereby optimizing the process.
The main advantage of this method compared with other strategies is that it evaluates multiple variables together.The drawback is that, because it depends on a machine-learning algorithm to create/evaluate models, it is possible that the subset of variables that produces the best results using one algorithm can produce bad results when using another algorithm or another parameter setting for the same algorithm.

Error estimation
The error estimation step evaluated each machine-learning algorithm using the parameters obtained in the first step and the subsets generated in the second step, in addition to the original variable subset containing all the candidate variables.
This step also served to evaluate the use of discretization.The evaluation was done through tenfold cross-validation, which was repeated ten times to get more reliable prediction performance estimates.

Generalization testing
Finally, one model was generated from the training/validation dataset for each algorithm, using the best results from the preceding step.These best models were then evaluated (hold-out evaluation) in the test set, since this generalization testing has the aim of evaluating model behavior when faced with data that was not used in its creation.The results from this evaluation serve as a quality measurement for these models.

Development of an equation for application of the results
The model with best results from generalization testing was used to create a web tool to apply the questionnaire in practice.
The prediction from the logistic regression model for any given individual is calculated by multiplying that individual's value for each variable in the model by the coefficient derived from the model for that variable, and then summing the results and transforming this sum into a probability of undiagnosed diabetes using the logistic function.If this probability is above the predetermined cutoff (here, 11%), the individual is classified as positive (at high risk of undiagnosed diabetes); or otherwise, as negative.

Study sample
Among the 12,447 ELSA participants included in this study, The first column of Table 5 shows the identifier name of the subset, the second column presents the AUC achieved by the variable subset that was chosen for each algorithm, the third shows the number of variables of each subset and the fourth presents these variable names.
The dataset partitions used for this step were the same as used in the parameter tuning step.Thus, it is possible to gain an insight into the performance improvement in terms of AUC when using a variable subset instead of using all the variables from the original dataset.Furthermore, merely the fact that a smaller subset was used to create the models is already an advantage because this makes the model and its application much simpler.
Because of the nature of the wrapper strategy, it can be expected that each machine-learning algorithm will present better results when using the variable subset created by the algorithm itself.However, in the next step all the subsets were tested with all the algorithms.

Results from error estimation
The aim of this step was to obtain more reliable error estimates regarding algorithm performance.For this reason, 10 repetitions were used instead of 3, for the repeated tenfold cross-validation, thus generating 100 models instead of 30 for each test.
The machine-learning algorithms were tested using the best parameters found in the first step (depicted in Table 4), with the variable subsets generated in the second step (described in Table 5), as well as with the original set of variables.Performance was tested with and without discretization.
Table 6 describes the best results obtained for each machinelearning algorithm, variable subset and data conversion used.
Respectively, the columns represent the name of machine-learning algorithm used; data type conversion; variable subset; AUC mean, standard deviation (SD) and first and third quartiles achieved in cross-validation; and mean and standard deviation of the balanced accuracy (BA).
Like in the results from the parameter tuning step, the artificial neural network algorithm and logistic regression achieved the best results.Without data conversion, these algorithms produced similar models, with AUC of 75.45% (row 1) and 75.44% (row 4), respectively, each using the variable subset generated with its own algorithm, as expected.K-nearest neighbor and naïve Bayes also reached good results, with AUC of close to 75%.The best results with the naïve Bayes classifier were obtained using a subset of variables other than nb-fs.This was possible because the variable subset search with this algorithm used discretized data following the best results from parameter tuning, while the best result in the current phase was without variable transformation.
Finally, as in the parameter tuning step, random forest produced the worst results.Independent of the subset of variables, this algorithm showed a worse yield in terms of mean AUC.
Table 6 also shows the impact of using a specific variable subset, compared with the best results obtained from the models generated using the original variable set.This difference is very small: around 0.25% better using the variable subset instead of all the original variables for the artificial neural network models.
The results obtained with a subset of variables were slightly better (around 0.5%) than the original with logistic regression and K-nearest neighbor models.The best naïve Bayes classifier model result from using a variable subset was more than 1% better than the best result from using all the variables.Finally, random forest models produced the best results using all of the available variables.

Results from generalization testing
In generalization testing, the best learning scheme (which includes data type conversion used, parameter setting, classification cutoff and variable subset) found for each algorithm in the preceding step was evaluated in the test dataset, which had been separated at the beginning of the process and had not been used until this step.
Table 7 shows the best results obtained in the error estimation phase together with the results obtained in generalization testing.
All the algorithms maintained good performance in generalization testing.The biggest loss of performance in relation to the error estimate step, as assessed from changes in the AUC, was 1.64% for the K-nearest neighbor algorithm.The artificial neural network, logistic regression and naïve Bayes had performance losses of 1.30%, 1.03% and 0.80%, respectively.The least loss in generalization testing was 0.458%, achieved by the random forest algorithm, which produced the worst performance in terms of AUC of all the algorithms.Nevertheless, the worst result was an AUC of 72.35%.
Since the best result from this step in terms of AUC (74.41%) was obtained using logistic regression, and given the easy interpretation and applicability of this model, logistic regression was chosen to be used to create the diabetes risk assessment tool.

Web tool proposed for detecting undiagnosed diabetes
Finally, the model generated using the logistic regression algorithm in the generalization test was selected to build a web tool for detecting undiagnosed diabetes.This model produced sensitivity of 68% and specificity of 67.2%.The prototype interface of the tool is shown in Figure 2. Since the model was constructed and probably would be used in Brazil, the tool was created in Portuguese.
The final coefficients of the equation generated are described in Table 8.
New cases can be classified using this model, as follows: 1. Standardize the value of the only numerical variable (a_rcq) by subtracting the training mean (0.8889311) from the value studies include the definition of the target variable, model objectives and candidate variables, among others.These models are generally constructed using conventional statistical techniques such as logistic regression and Cox regression.Systematic reviews 5,16,[24][25][26] present several such studies: some, like ours, have focused on predicting undiagnosed diabetes; while others have focused on individuals at high risk of developing incident diabetes.
Use of machine-learning techniques is still new in this field. 27- 29The main studies have compared the results obtained through using a specific technique with the results obtained through logistic regression.One report 30 described creation of pre-diabetes risk models using an artificial neural network and support-vector machines that were applied to data from 4,685 participants in the Korean National Health and Nutrition Examination Survey (KNHANES), collected between 2010 and 2011.In comparison with results 31 from logistic regression on the same dataset, the models created using support-vector machines and an artificial neural network produced slightly better results.Two other reports 32,33 also compared artificial neural networks with logistic regression for creating predictive diabetes models.In the first, models created using artificial neural networks on data from 8,640 rural Chinese adults (760 of them with diabetes) produced better results (AUC = 89.1% ± 1.5%) than models created using logistic regression (AUC = 74.4% ± 2.1%).In the second, a radial basis function artificial neural network that was applied to data from 200 people (100 cases with diabetes and 100 with pre-diabetes) at 17 rural healthcare centers in the municipality of Kermanshah, Iran, showed better results than logistic regression and discriminant analysis, for identifying those with diabetes.Another study 34 comparing diabetes models created using data from 2,955 women and 2,915 men in the Korean Health and Genome Epidemiology Study (KHGES) showed similar results from logistic regression and naïve Bayes, although naïve Bayes showed better results with unbalanced datasets.Finally, another study 35 used data from 6,647 participants (with 729 positive cases) in the Tehran Lipid and Glucose Study (TLGS) and created models with decision trees reaching 31.1% sensitivity and 97.9% specificity (balanced accuracy was around 64.5%), 36 for detecting increased blood glucose levels.
In summary, use of machine-learning techniques may prove to be a viable alternative for building predictive diabetes models, often with good results, but rarely with notably superior results, compared with the conventional statistical technique of logistic regression.

CONCLUSION
Comparison between different techniques showed that all of them produced quite similar results from the same dataset used, thus demonstrating the feasibility of detecting undiagnosed diabetes through easily-obtained clinical data.The predictive algorithm for identifying individuals at high risk of having

DISCUSSION
We created predictive models for detecting undiagnosed diabetes using data from the ELSA study with different machine-learning algorithms.The best results were achieved through both an artificial neural network and logistic regression, with no relevant difference between them.
Generally, most of the algorithms used achieved mean AUCs greater than 70%.The best algorithm (logistic regression) produced an AUC of 74.4%.Since these test dataset values are superior to the AUCs of several other scores that were previously validated in other populations, 20 this score shows potential for use in practice.
The generalization testing showed the results from asking a population similar to that of ELSA some simple questions.Out of 403 individuals from the testing dataset who had diabetes and did not know about their condition, 274 were identified as positive cases (68.0%sensitivity) using the model generated through the logistic regression algorithm.The web tool prototype for detecting undiagnosed diabetes could be refined for use in Brazil.
[19][20][21][22][23] Many groups have published predictive models for detecting undiagnosed diabetes.Although several groups have reported AUCs above 0.80, these values generally reduce to < 0.70 when tested on independent samples. 20Differences in predictive power across studies can be ascribed to different characteristics relating to the different datasets, and to different techniques and methods for building and evaluating the models.The characteristics that may vary across undiagnosed diabetes -based only on self-reported information from participants in ELSA-Brasil, from which the highest AUC (0.74) was obtained when tested on a part of the sample that had not been used for its derivation -was a logistic regression equation.However, the machine-learning techniques of artificial neural network, naïve Bayes, k-nearest neighbor and random forest all produced AUCs that were similar or slightly smaller.

K
-nearest neighbor (instance-based learning)15 is a classical instance-based learning algorithm in which a new case is classified based on the known class of the nearest neighbor, by means of a majority vote.This type of algorithm is also called lazy learning because there is no model building step and the entire computing procedure (i.e. the search for the nearest neighbor) is performed directly during the prediction.All the cases (training/validation dataset) need to be available during the prediction.

sequential steps: 1 .Figure 1 )Figure 1 )
Figure 1) evaluated each machine-learning algorithm with different sets of configurable parameters of the algorithm by means of tenfold cross-validation, repeated three times.In tenfold cross-validation, the dataset (training/validation) is divided into ten parts, of which nine are used for training (selecting) a model and the tenth for validation of this model.This process is repeated to calculate the validation measurements, such as AUC, while varying the part of the dataset used for validation each time.Finally, the mean of the validation measurements across repeats is calculated.The results from this step ("Best Parameters" item in the Figure), containing the best parameters and cutoffs for classification for each algorithm, were used in the next steps.The second step ("Automatic Variable Selection" box in Figure1) generated four different variable subsets using different algorithms and cross-validation (using only the best settings found in the preceding step), with the wrapper strategy and a forward selection search for automatic variable selection.The best variable subsets found in this step ("Best Variable Subsets" item in Figure1) were used in the next steps.

Figure 1 .
Figure 1.General process of model construction and evaluation.

Four
machine-learning algorithms were used: logistic regression, artificial neural network, K-nearest neighbor and naïve Bayes classifier.The random forest algorithm was not included because it already performs an embedded variable selection.The forward selection search strategy was used in modeling because it tends to generate smaller subsets.The same validation technique (tenfold cross-validation repeated three times), decision criterion (mean AUC) and dataset partition that had been used in the parameter tuning step were used again in this step.The best parameter settings obtained in the parameter tuning step were used to configure the parameters of the algorithms for this step.Each machine-learning technique generated a distinct subset of variables.The subsets thus generated were used in the next step.

Figure 2 .
Figure 2. Prototype for a web interface for the risk equation.

Table 2 .
Categorical variables, including the target variable "a_dm" dataset) was used for generalization tests.The models created were evaluated in terms of area under the receiver operating characteristic curve (AUC), sensitivity, specificity and balanced accuracy (arithmetic mean of sensitivity and specificity).The machinelearning algorithm families of logistic regression, artificial neural network (multilayer perceptron/backpropagation), Bayesian network (naïve Bayes classifier), instance-based learning (K-nearest neighbor) and ensemble (random forest) were used.

Table 3 .
Parameters analyzed in parameter tuning

Table 5 .
Variable subsets generated in automatic variable selection 2,609 (20.96%) were obese.Using the World Health Organization definition (fasting glucose ≥ 110 mg/dl and/or 2 hour postload glucose ≥ 140 mg/dl), 5,539 (44.5%) presented intermediate hyperglycemia.Other details about the study sample can be found in Table 1 and Table 2.

Table 8 .
Coefficients from logistic regression model