SciELO - Scientific Electronic Library Online

vol.135 issue3Relationship between periodontal disease and cardiovascular risk factors among young and middle-aged Brazilians. Cross-sectional studyRandomized clinical study on the analgesic effect of local infiltration versus spinal block for hemorrhoidectomy author indexsubject indexarticles search
Home Pagealphabetic serial listing  

Services on Demand




Related links


Sao Paulo Medical Journal

Print version ISSN 1516-3180On-line version ISSN 1806-9460


OLIVERA, André Rodrigues et al. Comparison of machine-learning algorithms to build a predictive model for detecting undiagnosed diabetes - ELSA-Brasil: accuracy study. Sao Paulo Med. J. [online]. 2017, vol.135, n.3, pp.234-246. ISSN 1516-3180.


Type 2 diabetes is a chronic disease associated with a wide range of serious health complications that have a major impact on overall health. The aims here were to develop and validate predictive models for detecting undiagnosed diabetes using data from the Longitudinal Study of Adult Health (ELSA-Brasil) and to compare the performance of different machine-learning algorithms in this task.


Comparison of machine-learning algorithms to develop predictive models using data from ELSA-Brasil.


After selecting a subset of 27 candidate variables from the literature, models were built and validated in four sequential steps: (i) parameter tuning with tenfold cross-validation, repeated three times; (ii) automatic variable selection using forward selection, a wrapper strategy with four different machine-learning algorithms and tenfold cross-validation (repeated three times), to evaluate each subset of variables; (iii) error estimation of model parameters with tenfold cross-validation, repeated ten times; and (iv) generalization testing on an independent dataset. The models were created with the following machine-learning algorithms: logistic regression, artificial neural network, naïve Bayes, K-nearest neighbor and random forest.


The best models were created using artificial neural networks and logistic regression. ­These achieved mean areas under the curve of, respectively, 75.24% and 74.98% in the error estimation step and 74.17% and 74.41% in the generalization testing step.


Most of the predictive models produced similar results, and demonstrated the feasibility of identifying individuals with highest probability of having undiagnosed diabetes, through easily-obtained clinical data.

Keywords : Supervised machine learning; Decision support techniques; Data mining; Models, statistical; Diabetes mellitus, type 2.

        · abstract in Portuguese     · text in English     · English ( pdf )