Comparison of machine-learning algorithms to build a predictive model for detecting undiagnosed diabetes - ELSA-Brasil: accuracy study

Olivera, André Rodrigues; Roesler, Valter; Iochpe, Cirano; Schmidt, Maria Inês; Vigo, Álvaro; Barreto, Sandhi Maria; Duncan, Bruce Bartholow

doi:10.1590/1516-3180.2016.0309010217

Acessibilidade / Reportar erro

Brasil

Sao Paulo Medical Journal

Español English

Brasil

Español English

sumário « anterior atual seguinte »

Sumário

ORIGINAL ARTICLE • Sao Paulo Med. J. 135 (03) • May-Jun 2017 • https://doi.org/10.1590/1516-3180.2016.0309010217 copy

Comparison of machine-learning algorithms to build a predictive model for detecting undiagnosed diabetes - ELSA-Brasil: accuracy study

Comparação de algoritmos de aprendizagem de máquina para construir um modelo preditivo para detecção de diabetes não diagnosticada - ELSA-Brasil: estudo de acurácia

Authorship SCIMAGO INSTITUTIONS RANKINGS

ABSTRACT

CONTEXT AND OBJECTIVE:

Type 2 diabetes is a chronic disease associated with a wide range of serious health complications that have a major impact on overall health. The aims here were to develop and validate predictive models for detecting undiagnosed diabetes using data from the Longitudinal Study of Adult Health (ELSA-Brasil) and to compare the performance of different machine-learning algorithms in this task.

DESIGN AND SETTING:

Comparison of machine-learning algorithms to develop predictive models using data from ELSA-Brasil.

METHODS:

After selecting a subset of 27 candidate variables from the literature, models were built and validated in four sequential steps: (i) parameter tuning with tenfold cross-validation, repeated three times; (ii) automatic variable selection using forward selection, a wrapper strategy with four different machine-learning algorithms and tenfold cross-validation (repeated three times), to evaluate each subset of variables; (iii) error estimation of model parameters with tenfold cross-validation, repeated ten times; and (iv) generalization testing on an independent dataset. The models were created with the following machine-learning algorithms: logistic regression, artificial neural network, naïve Bayes, K-nearest neighbor and random forest.

RESULTS:

The best models were created using artificial neural networks and logistic regression. These achieved mean areas under the curve of, respectively, 75.24% and 74.98% in the error estimation step and 74.17% and 74.41% in the generalization testing step.

CONCLUSION:

Most of the predictive models produced similar results, and demonstrated the feasibility of identifying individuals with highest probability of having undiagnosed diabetes, through easily-obtained clinical data.

KEY WORDS:
Supervised machine learning; Decision support techniques; Data mining; Models, statistical; Diabetes mellitus, type 2

Associação Paulista de Medicina - APM APM / Publicações Científicas, Av. Brigadeiro Luís Antonio, 278 - 7º and., 01318-901 São Paulo SP - Brazil, Tel.: +55 11 3188-4310 / 3188-4311, Fax: +55 11 3188-4255 - São Paulo - SP - Brazil
E-mail: revistas@apm.org.br

Acompanhe os números deste periódico no seu leitor de RSS

[1] Address for correspondence: Bruce Bartholow Duncan Programa de Pós-Graduação em Epidemiologia e Hospital de Clínicas, Universidade Federal do Rio Grande do Sul (UFRGS) Rua Ramiro Barcelos, 2.600/414 Porto Alegre (RS) - Brasil CEP 90035-003 E-mail: bbduncan@ufrgs.br