1 INTRODUCTION
Cardiovascular diseases, including cerebrovascular and ischemic heart diseases, are major causes of death worldwide and the main cause of death in Brazil. In particular, Acute Coronary Syndromes (ACSs) are prominent in maintaining a high mortality rate despite recent therapeutic advances. These syndromes are characterized by total or partial occlusion of the coronary artery. This leads to ischemia and/or necrosis of the myocardial area irrigated by the coronary artery, following the rupture of an unstable coronary plaque. ACSs include acute myocardial infarction (with and without ST-segment elevation) and unstable angina.
ACSs may result from the interaction of environmental, clinical, genetic, and socio-cultural factors. To obtain a reliable and effective clinical prognosis of patients with an ACS it is thus vital to identify the most important variables. This is also a critical step in developing medical decision-supporting tools associated with clinical and laboratory procedures in order to reduce the mortality rate and financial costs.
In this multi-factorial causal context, non-linear modelling methods have the required flexibility to construct classifiers with good predictive performance. Artificial Neural Network (ANN) (^{Bishop, 1995}) and Support Vector Machine (SVM) (^{Boser et al., 1992}; ^{Cortes & Vapnik, 1995}) are well-established examples of these types of models. They have been used in several studies for diagnosis and prognosis of coronary heart diseases; see e.g. (^{Uğuz, 2012}); (^{Khemphila & Boonjing, 2011}); (^{Sengur, 2012}); (^{Çomak & Arslan, 2012}); (^{Kohli & Verma, 2011}). Comparative analyses between the predictive power of both models in this domain have also been published and are briefly reviewed below since they are the focus of this work.
(^{Berikol et al., 2016}) tested the accuracy of four different classifiers (SVM, ANN, Naive Bayes, logistic regression) for ACS diagnosis using a data set with 228 patients (99 with ACS and 129 without ACS) and 8 variables. The SVM presented higher accuracy (99%) than ANN (90%), Naive Bayes (89%) and logistic regression (91%). (^{Kumari & Godara, 2011}) compared four classification techniques (RIPPER, Decision Tree, ANN, and SVM) in terms of their capability to predict the diagnosis of cardiovascular disease in general. They used a data set with 296 patients and 14 input variables. No preprocessing for input variable selection was performed in this study, and the results indicated that the SVM model was superior. (^{Xing et al., 2007}) assessed the performance of Decision Tree, ANN and SVM to predict the 6-month survival of patients with coronary heart disease using a data set with 1,000 individuals. The results regarding the accuracy of the classifiers employed were very similar: 92.1% for SVM, 91.0% for ANN, and 89.6% for the Decision Tree.
(^{Hannan et al., 2010}) concluded that ANN and SVM perform worse than do medical decisions regarding the prescription of heart disease medication. In (^{Gudadhe et al., 2010}), three-layer ANNs, which were trained using the back-propagation algorithm, outperformed SVM in diagnosing heart diseases. (^{Çomak et al., 2007}) developed an SVM model to classify the Doppler signals of aortic and mitral valves as either normal or abnormal. The input signals of 215 individuals were preprocessed using wavelet decomposition and short-time Fourier transformation techniques. The performance of the SVM algorithm was compared to that of a previous study using an ANN (^{Turkoglu et al., 2002}). The results indicated the superiority of the ANN in terms of sensitivity and specificity. However the authors recommended using the SVM model because of the shorter training times and greater stability in converging to the solution.
Despite it being well-known that the main drivers underlying the cardiovascular diseases can substantially vary from country to country there is a lack of similar comparative studies focused particularly on the Brazilian population. To the best of our knowledge, this is not restricted to the public health and medical settings but is also pervasive over the whole spectrum of Brazilian challenges. This happens regardless of the existence of an extensive literature in Soft Operational Research that highlights the cross fertilisation benefits between different methodologies. For a work that explores the use of multi-methodologies for understanding a real-world process of a Brazilian hospital, see (^{Pessôa et al., 2015}).
We have also noted that many works have recently explored the synergistic links between Operational Research (OR) and Artificial Intelligence (AI) (^{Holsapple et al., 1994}; ^{Brown & White, 2012}; ^{Gomes, 2001}). In particular, the interplay between OR and AI with regard to decision support systems and optimization are discussed in (^{Wojewnik & Kuszewski, 2010}) and (^{Bennett, 2006}), respectively. For an interesting study that compares the capability of ANN, SVM and genetic algorithm to predict the Brazilian Power Quality, see (^{Góes et al., 2015}).
In this study we aim at reducing this gap in applied health studies and exploring the links between OR and AI on behalf of the Brazilian population. Our objective is to compare the predictivepower of the ANN and SVM models in terms of classifying the risk of death (high or low) in Brazilian patients admitted with ACSs. This also differs from previous studies whose aim isoften the diagnosis of cardiovascular diseases instead of the intra-hospital prognosis. In this sense our work comes closer to the survival study of (^{Xing et al., 2007}). However those authors were interested in the post-hospital prognosis since they defined survival as a patient being alive after 6-months of a positive diagnosis of coronary heart disease.
Here the data set has clinical, genetic and socio-environmental variables. However, the use of variables that are not relevant for predicting patient outcomes can disrupt the training and compromise the generalization power of the model. Furthermore, a model with all variables forces us to discard a large number of individuals because of missing data. The number of possible variable sets to be examined grows exponentially with the number of variables. So, a large number of variables - as is the case here - implies a great computational cost as regards time and memory.
In practice, one common way to circumvent this issue is to adopt a heuristic variable selection method that allows us to identify quickly a few but potentially promising variable sets. For this purpose, we first order the input variables using a filter. Next each classification algorithm isused independently to select the most relevant set of input variables.
We have observed that the studies above excluded individuals with missing information instead of accommodating them in their approaches. Maximising the use of the data available is veryimportant because collecting data from patients is often a very expensive and time-consuming process, which requires a considerable amount of human, material and financial resources (^{Kononenko, 2001}). Also note that our emphasis on variable selection is another point that contrasts with the reviewed literature whose works do not often aim at identifying the most critical variables for the diagnosis of heart disease.
The comparative study between ANN and SVM algorithms are based on the well-established Mutual Information Feature Selector under Uniform information distribution (MIFS-U) criterion (^{Kwak & Choi, 2002}; ^{Gonçalves & Macrini, 2011}). To verify the robustness of the results with regard to this mutual information filter we retrain the classification algorithm with better performance using the orders of variables provided by two new filters based on Euclidean distance. The development of these Euclidean filters is our main methodological contribution.
We performed logistic regression analyses, both with and without first-order multiplicativeinteraction of the input variables selected in the preprocessing filter step. The sensitivity results for all tested variable sets were under 10%, as already expected because of the very small ratio of death events per variable (^{Concato et al., 1995}; ^{Peduzzi et al., 1995}, ^{1996}). For brevity, we excluded experiments using logistic regression from the scope of this study.
This article is organized as follows. In the next section, we review the variable selection methods and the SVM and ANN algorithms and introduce our Euclidean filters. In Section 3, the data set used in the experiments is described. In Section 4, we discuss the results of the computational experiments, which includes the comparative experiments and the corresponding robustness analysis. In the Conclusion, final remarks are presented and future works outlined.
2 DATA MINING TECHNIQUES
The variable selection methods and the ANN and SVM models employed in this study are briefly discussed in this section.
2.1 Variable Selection Method
To construct efficient classifiers, variable selection is an important step for the following reasons (^{Salappa et al., 2007}; ^{Guyon & Elisseeff, 2003}; ^{Saeys et al., 2007}):
to avoid overfitting, to reduce noise, and (through this process) to improve the predictive power of the classifier;
to obtain models with reduced computational cost, both in terms of the processing time and the memory requirements; and
to directly elucidate the underlying process responsible for generating the data.
Our data set has 28 input variables (Appendix A) collected from 411 individuals, of whom only 37 died. However, not all variables were collected for all individuals. Requiring information for all variables reduced the size of the training data set to 264 individuals, of whom only 17 died. Here the determination a priori of which variables are relevant to the death prognosis is critical to reduce the dimension of the input space and, thereby to increase the size of the trainingdata set.
The variable selection methods can be grouped into three broad classes: a filter, a wrapper, and an embedded method (^{Guyon & Elisseeff, 2003}; ^{Saeys et al., 2007}; ^{Blum & Langley, 1997}). Filters correspond to a preprocessing technique that selects input variables before training the classification algorithm. The advantages of filters are the ease and speed of implementation, whilst their main disadvantage is that they ignore the interaction with the classifier.
Wrappers use the classification algorithms as black boxes to assess the predictive power of subsets of input variables. These subsets are normally built either randomly or through a heuristic procedure. Finally, embedded methods incorporate the selection method into the classifier training process. The main advantage of these two latter methods is the fact that they interact with the classification algorithm. However this interaction also constitutes the source of their drawbacks, namely greater computational cost (i.e., time and memory) and dependency on the classification algorithm itself.
Our approach combines a wrapper method and a filter. This enables us to take advantage of the benefits offered by these two methods and at the same time to minimize their deficiencies. First, to compare the ANN and the SVM algorithm we use the MIFS-U filter. Via a greedy strategy this filter provides us with an order of input variables based on the degree of mutual information between input variables and the response variable. The density distributions of the variables are approximated by their histograms, and it is assumed that the information contained in these variables is uniformly distributed. In the second experiment, to explore the robustness of the results we use two new filters based on Euclidean distance. These filters are discussed in Section 2.2.
The order of variables is used as an input for the wrapper approach. We adopt the sequential forward-selection strategy, where the classification algorithm (ANN or SVM) evaluates each nested subset of variables until the classification error begins to increase. In other words, the classification algorithm evaluates the subset with the k+1 first variables if and only if the subset with the first k variables yields a classification error below what was obtained from the subset with the k-1 first variables. However, the variable selection is not interrupted until the minimum value of k equal to six is reached.
Classification errors are assessed using the following concepts:
Accuracy (a) This is the probability of correctly predicting the outcome.
Sensitivity (x) This is the probability of correctly predicting the high death risk.
Specificity (y) This is the probability of correctly predicting the low death risk.
The Pearson correlation coefficient (PCC) is used to assess the performance of the classifiers. This coefficient provides a balance between the concepts of sensitivity and specificity. The value of PCC ranges from -1 (total disagreement) to 1 (total agreement), and a zero value represents totally random predictions (^{Baldi et al., 2000}).
The leave-one-out cross validation technique is used to estimate the quantities described above. For a data set with L individuals, this corresponds to performing L trainings, where each training set contains L-1 individuals and the test sample consists of the excluded individual, which is different in each training. In the end, the probabilities are estimated by
where T is the total number of correct predictions obtained by the classifier for a given data set; T_{P} and L_{P} are, respectively, the number of positive data points that are correctly classified and the total number of positive data points; and T_{N} and L_{N} are, respectively, the number of negative data points that are correctly classified and the total number of negative data points.
2.2 New Euclidean Filters
Here two ordering criteria for the input variables are developed using Euclidean distance. The only source of information for these criteria is the data themselves. This is because even among medical specialists there is no established consensus regarding the absolute and/or relative importance of each variable to some given ACS prognosis.
Initially, we briefly analyze the ordering criterion developed in a previous study by (^{Chen et al., 2009}). To minimize text and avoid repeatedly referencing this article, we refer to this particular criterion as CZCL (the initials of its authors' names). Also let T = {(x_{1}, y _{1}),...,(x _{l} , y_{l} )} be a sample with l individuals, where x _{i} = (x_{i} _{1},...,x_{in} ) and where x_{ij} is the value of a variable j for an individual i and y_{i} ∈ {-1, 1} is the value of the response variable for an individual i.
2.2.1 CZCL Criterion
The CZCL criterion is based on two hypotheses:
If small changes of an input variable correspond to large variations in the response variable, the input variable is relevant.
If small changes of an input variable correspond to small variations in a response variable, then the input variable is unimportant.
This criterion orders the variables according to the following score:
Where VA_{k} is a score assigned to the k^{th} variable by domain experts and S_{k} is an objective score for the k^{th} variable. The positive parameters g _{1} and g _{2} set the balance between domain knowledge and data. The score S_{k} is formally defined by
where:
Note that S_{k} corresponds to a normalization of T_{k} in order to use the same scale for the terms T_{k} and VA_{k} . Because our goal is to use only the data, we can drop the term VA_{k} and the parameters g _{1} and g _{2} from Equation 2. Thus, there is no need to compute the value of S_{k} and we can order the variables using the score given by:
The hypothesises above then imply that a relevant variable has a low score T^{CZCL} .
Acording to (^{Chen et al., 2009}), the CZCL criterion has the following advantages:
It takes into account the relations between input and output variables when selecting which variables are more relevant.
It uses the primitive input variables instead of some transformation thereof, as is done in principal component analysis.
It does not require a large number of data points.
It does not require the data to conform to any statistical distribution.
It can capture non-linear relations between input and output variables.
It is simple to implement and has low computational cost.
2.2.2 Disagreement Criterion
Since in this study the response variable has only two categories, we only need to calculate T_{k} (Equation 4) for pairs of individuals who have distinct outcomes. We then have that
Now note that there is no loss of information if we re-express T_{k} by:
Because the term d ^{2} (x_{r} , x_{s} ) is present in the computation of all variables, its omission does not affect the relative order of the variables whilst also achieving a score that is more sensitive to the main term d_{k} ^{2}(x_{r} , x_{s} ). Implementing this change, we can then order variables using a disagreement score given by
where
Keeping valid both hypotheses assumed for the CZCL criterion, the relevance of the k^{th} variable increases as the value of T_{k} ^{D} decreases.
The disagreement criterion has two additional advantages compared to the CZCL criterion:
2.2.3 Inverse Criteria
The scores T_{k} ^{CZCL} and T_{k} ^{D} are based on the Euclidian distance between the input variable k in two different individuals. Because of the sensitivity hypotheses, their corresponding criteria described above select those variables for which the summed distances are small. However, it is also reasonable to assume the converse of this condition: input variables corresponding to large distances are more able to distinguish between two possible outcomes (^{Dash & Liu, 1997}). Based on this new hypothesis, those variables with large scores T_{k} ^{CZCL} or T_{k} ^{D} are more relevant. This assumption yielded criteria called the inverse criteria. Therefore, the inverse CZCL criterion and the inverse disagreement criterion select input variables that have, respectively, the highest scores T_{k} ^{CZCL} and T_{k} ^{D} .
2.3 Artificial Neural Network (ANN) Model
The specific type of ANN (^{Bishop, 1995}; ^{Haykin, 1999}; ^{Teixeira Júnior et al., 2015}) used in this study is a three-layer feed-forward ANN. The input layer has k input neurons and one bias neuron, where k corresponds to the number of input variables included in the model. The hidden layer is initially composed of 10 hidden neurons and one bias neuron. Finally, the output layer has one output neuron. The hyperbolic tangent is adopted as an activation function for the hidden layer, whereas a linear function is assigned to the output layer. This structure is justified by its simplicity. Moreover, if it is assumed there are non-identical data in distinct categories, three-layer neural classifiers are universal classifiers (^{Young & Downs, 1998}).
Supervised training of the ANN is performed using the error back-propagation algorithm associated with the descending gradient method. In this case, the error associated with each input-output neuron pair is computed and back-propagated, and the synaptic weights are adjusted to reduce the total errors. This procedure is performed until the algorithm converges.
To avoid over-fitting we adopt the Bayesian regularization algorithm (^{MacKay, 1992}). In this framework, the ANN weights and biases are assumed to be random variables. The regularization parameters are the unknown variances associated with these distributions and can be estimated through adequate statistical techniques. The result is the minimization of a function that is a linear combination of the quadratic errors and the weights of the hidden and output layers. Bayesian regularization requires the Hessian matrix to be computed, which implies using the Levenberg-Marquardt algorithm (^{Nocedal & Wright, 2006}). Ultimately, this structure enables us to select the smallest set of neurons in the hidden layer that provides the best optimization of the ANN.
2.4 Support Vector Machine (SVM)
Solving a convex quadratic problem, the SVM model (^{Boser et al., 1992}; ^{Cortes & Vapnik, 1995}; ^{Suykens et al., 2010}; ^{Scholkopf & Smola, 2001}) selects a hyperplane that minimizes structural risk. The minimization of structural risk (^{Vapnik, 2006}) establishes a compromise between the complexity of the decision function space and the ability to fit the model to the training data set (empirical risk). This process guarantees a good generalization power for the trained classifier, i.e., a strong propensity to correctly predict the outcome of an individual out of the training sample.
When associated with a kernel function, the SVM model allows non-linear classifiers to be built by implicitly mapping the initial data into a space of higher dimension than the original one. In this case, the linear classifier obtained in a higher-dimension space corresponds to a non-linear classifier in the original space.
Here we use the ν-SVM (^{Scholkopf & Smola, 2001}; ^{Chen et al., 2005}; ^{Scholkopf et al., 2000}). This classifier was initially conceived to recognize two types of patterns and was subsequently extended to multi-class and regression problems. In ν-SVM training, it is necessary to adjust the parameter, which represents the upper limit for the fraction of training errors and the lower limit for the fraction of support vectors. These interpretations of the ν parameter simplify its calibration. The kernel function adopted in this study is the hyperbolic tangent since the goal is to compare the SVM and ANN classifiers (^{Karatzoglou et al., 2006}).
To train the classifier, we use the Sequential Minimal Optimization (SMO) algorithm (^{Platt et al., 1999}). This algorithm analytically determines the global solution, optimizing at each iteration only two Lagrangian multipliers from the convex quadratic problem corresponding to the model's mathematical formulation. The SMO algorithm requires minimal computational memory resources and is extremely fast because it performs only a limited number of very simple operations.
3 EXPERIMENTAL DESIGN
Data were collected from a prospective cohort study of patients of both genders who were admitted with ACS to five hospitals (three public and two private) in the municipality of Niterói, Rio de Janeiro, Brazil, between July 2004 and June 2005 (dos ^{Reis et al., 2007}). Only patients who were over 20 years of age and did not display any signs of terminal cancer, multiple trauma, or dementia were considered.
The data set contains 28 explanatory variables, which are classified into five categories: social and anthropometric variables; variables related to previous cardiovascular history; clinical and laboratory variables concerning hospital admission; diagnosis variables; and genetic variables. The response variable is the occurrence of in-hospital death. These variables are described in Appendix A, which includes information regarding the measurement scales and lists the abbreviation used for each variable.
4 RESULTS AND DISCUSSION
The MIFS-U filter and the ANN models are implemented using the MATLAB software, version 7.0. The Euclidean filter and ν-SVM models are run in the R software, version 2.7.0, using the kernlab package. To avoid scale issues, the data are normalized.
4.1 Comparison between ANN and SVM using the MIFS-U filter
The MIFS-U criterion is used two times consecutively. Initially the filter is applied to a sample with 264 individuals (17 deaths and 247 survivals), for whom data regarding the response variable and all 28 input variables are available. The order obtained for the 28 input variables with respect to the response variable death is shown in Table 1.
Position | Variable | Position | Variable |
---|---|---|---|
1 | Age | 15 | TT genotype |
2 | APR | 16 | Gender |
3 | Creatinine | 17 | HDL cholesterol |
4 | DD genotype | 18 | MT genotype |
5 | E2E2 genotype | 19 | II genotype |
6 | E4E4 genotype | 20 | Smoking |
7 | BMI | 21 | Total cholesterol |
8 | E2E3 genotype | 22 | ACS |
9 | SAH | 23 | DI genotype |
10 | E3E4 genotype | 24 | Killip |
11 | E2E4 genotype | 25 | Triglyceride |
12 | Diabetes mellitus | 26 | PMI |
13 | MM genotype | 27 | Education level |
14 | E3E3 genotype | 28 | Heart rate |
Next the input space is reduced to 16 variables by disregarding the 12 last-ranked variables. The MIFS-U filter is then applied to this new subset of input variables, corresponding to a sample of 351 individuals (23 deaths and 328 survivals). The result is presented in Table 2. The sample size has increased because the use of a smaller number of input variables allows us to reduce the number of individuals for whom not all the required variables are available.
Position | Variable | Change | Position | Variable | Change |
---|---|---|---|---|---|
1 | Age | 0 | 9 | E3E4 genotype | -1 |
2 | APR | 0 | 10 | Diabetes mellitus | -2 |
3 | Creatinine | 0 | 11 | E2E4 genotype | 0 |
4 | BMI | -3 | 12 | E2E3 genotype | +4 |
5 | DD genotype | +1 | 13 | MM genotype | 0 |
6 | E4E4 genotype | 0 | 14 | SAH | +5 |
7 | E2E2 genotype | +2 | 15 | TT genotype | 0 |
8 | Gender | -8 | 16 | E3E3 genotype | +2 |
The results show that the three top-ranking variables remain unchanged in both orders of variables. Note that the variables Age, Any Previous Revascularization (APR), Creatinine, Body Mass Index (BMI), DD, and E2E2 and E4E4 Genotypes are the seven variables holding the most combined mutual information regarding the outcome of interest. Also observe that seven variables (Age; Any Previous Revascularization; Creatinine; and E2E4, E4E4, MM, and TT Genotypes) do not change their ranks between each filter pass, and two variables (DD and E3E4 Genotypes) change by only one position. The variables E2E2 Genotype, E3E3 Genotype and Diabetes Mellitus shift their orders by only two positions whilst the variable Body Mass Index shifts its rank by three positions. In contrast, the variables with the greatest changes in terms of positions are E2E3 Genotype (four positions), Systemic Arterial Hypertension (SAH) (five positions), and Gender (eight positions). The results can therefore be considered stable since the greatest changes in the ordering only appear after the eighth position.
Table 3 summarizes the results for the ANN and SVM models trained using the sample with 351 individuals for all sets of variables. Classifiers built with the three top-ranking variables according to the MIFS-U filter (Age, Any Previous Revascularization, and Creatinine) yielded the best results in both models. Thus, the best classifiers are obtained from the information contained in an integer variable (Age), a categorical variable (Any Previous Revascularization), and a continuous variable (Creatinine).
The fact that genetic and diagnostic variables do not contribute to the construction of the optimal classifiers stands out. In the case of the ANN classifier, an increased number of input variables tends to decrease sensitivity and to increase specificity. Therefore, the choice of the first three variables represents a compromise between these two concepts, as determined using Pearson correlation coefficient. Note that the SVM model with those three variables has superior predictive power compared to any ANN model trained.
4.2 Robustness Analysis using SVM and Euclidean Filters
To verify in which extension the filter biased the wrapper variable selection we revisit the dataset using the Euclidean filter a single time. For brevity, we focus on the SVM model since it clearly outperformed the ANN model previously.
To assess the importance of genetic variables for the prognosis of ACS here we adopt their parametrisation in terms of allele instead of genotype. For example, take the E Apolipprotein gene polymorphism. In the first experiment, we considered six input variables corresponding its six genotypes XY, where X, Y=E2, E3, E4 were their possible alleles. Now we have only three binary variables E2, E3, E4 corresponding to the three allele associated with this polymorphism. Observe that this new definition does not cause any loss of information. We also include three additional variables: time elapsed before first medical attention, family history of coronary arterial disease and physical activity. The first two variables were excluded from the first experiment because they are not directly associated with each sampled individual. The last variable was initially omitted because we assumed that the variables Body Mass Index and physical activity capture similar information. Appendix B describes these three variables andthe re-parametrised genetic variables.
The four criteria (CZCL criterion, inverse CZCL criterion, disagreement criterion and inverse disagreement criterion) are applied to a complete sample with 226 individuals, of whom 16 had fatal outcomes. Tables 4 and 5 show the orders of variables given, respectively, by T_{k} ^{CZCL} and T_{k} ^{D} . As discussed in Section 2.2, the score T_{k} ^{D} provides a more well-defined classification of the input variables than the score T_{k} ^{CZCL} : the distance between the first and last variables using T_{k} ^{D} is 0.56 whilst using T_{k} ^{CZCL} is only 0.02.
Position | Variable | T_{K} ^{CZCL} | Position | Variable | T_{K} ^{CZCL} |
---|---|---|---|---|---|
1 (26) | APR | 0.98404 | 14 (13) | FHD | 0.99172 |
2 (25) | E3 allele | 0.98532 | 15 (12) | Triglyceride | 0.99182 |
3 (24) | TFM | 0.98588 | 16 (11) | Physical Activity | 0.99253 |
4 (23) | PMI | 0.98786 | 17 (10) | ACS | 0.99314 |
5 (22) | E2 allele | 0.98795 | 18 (9) | Gender | 0.99383 |
6 (21) | SAH | 0.98847 | 19 (8) | Diabetes mellitus | 0.99402 |
7 (20) | E4 allele | 0.98925 | 20 (7) | Education Level | 0.99435 |
8 (19) | I allele | 0.98981 | 21 (6) | Killip | 0.99442 |
9 (18) | T allele | 0.999012 | 22 (5) | Smoking | 0.99517 |
10 (17) | BMI | 0.999018 | 23 (4) | HDL Cholesterol | 0.99532 |
11 (16) | D allele | 0.99054 | 24 (3) | Age | 0.99862 |
12 (15) | M allele | 0.99091 | 25 (2) | Heart Rate | 0.99873 |
13 (14) | Total Cholesterol | 0.99163 | 26 (1) | Creatinine | 1.00000 |
Position | Variable | T_{K} ^{D} | Position | Variable | T_{K} ^{D} |
---|---|---|---|---|---|
1 (26) | APR | 0.4374 | 14 (13) | Triglyceride | 0.6192 |
2 (25) | SAH | 0.4815 | 15 (12) | Education Level | 0.6278 |
3 (24) | TFM | 0.5155 | 16 (11) | ACS | 0.6295 |
4 (23) | PMI | 0.5274 | 17 (10) | Diabetes mellitus | 0.6404 |
5 (22) | I allele | 0.5287 | 18 (9) | Gender | 0.6540 |
6 (21) | Physical Activity | 0.5510 | 19 (8) | Smoking | 0.6772 |
7 (20) | FHD | 0.5730 | 20 (7) | HDL cholesterol | 0.6809 |
8 (19) | BMI | 0.5747 | 21 (6) | D allele | 0.6920 |
9 (18) | T allele | 0.5776 | 22 (5) | E3 allele | 0.7618 |
10 (17) | M allele | 0.5777 | 23 (4) | Age | 0.7726 |
11 (16) | E2 allele | 0.6017 | 24 (3) | Killip | 0.8371 |
12 (15) | Total Cholesterol | 0.6083 | 25 (2) | Heart Rate | 0.9247 |
13 (14) | E4 allele | 0.6152 | 26 (1) | Creatinine | 1.0000 |
For each variable set we train an SVM model using a maximal subset of individuals with no missing information with respect to . This means that two different sets of variables may be trained with two different samples. All SVM models with two variables presented poor performance and so they are excluded from the discussion for the sake of conciseness. Tables 6 through 9 summarize the results.
The disagreement and inverse disagreement criteria select variables that allow us to construct classifiers with better performance than those obtained using the CZCL and inverse CZCL criteria. This suggests that the score TkD identifies more efficiently the relevant information in the whole set of input variables.
The classifiers that are constructed with variables selected using the disagreement and inverse disagreement criteria also yield similar results. This indicates that the underlying hypotheses of both criteria may be valid. Therefore, it is worthwhile to explore whether there is a subset of the variables selected by both criteria that provides us with a better classifier (^{Guyon & Elisseeff, 2003}).
Looking at the best classifiers (classifiers 5 and 15) we then select seven variables: (1) any previous revascularization, (2) systemic arterial hypertension, (3) time elapsed before first medical attention, (4) creatinine, (5) heart rate, (6) Killip classification, and (7) age. To balance the proportion between variables identified from each criterion in this set, we have excluded the allele E3 (classifier 15). The first three variables are selected using the disagreement criterion (classifier 5), and the other four variables are selected using the inverse disagreement criterion (classifier 15).
Note that allele E3 is the last variable included in classifier 15. It provides a 9% increase in sensitivity for classifier 15 with respect to classifier 14, although the specificity and accuracy are reduced by approximately 1% and 0.2%, respectively. On the other hand, this variable excludes one dead individual from the sample used to train classifier 14 because data regarding allele E3 is not available for that particular individual. Given the small number of dead patients, it can be hypothesised that the improvement obtained with the inclusion of allele E3 is not directly attributable to this variable but results from the exclusion of this dead individual. So, we can assume that the most relevant variables for classifier 15 are indeed the first four variables: creatinine, heart rate, Killip and age.
Next we proceed to train SVM models using several subsets of those seven variables. The results are summarized in Table 10. All subsets include the following three variables: any previous revascularization (first to be selected by the disagreement criterion), creatinine (first to be selected by the inverse disagreement criterion) and age. The decision to include age in every subset of variables is justified for two main reasons. First, since age is a variable collected for all individuals its inclusion in a classifier does not exclude any individual from the sample. Second, this variable is less error prone since the level of socioeconomic development in large urban areas prevents the great majority of people from being uncertain about their ages.
Also observe that including age in the classifier 14 enables us to improve the performance considerably with respect to classifier 13: a 35.3% increase in accuracy, a 38.3% increase in specificity and a 3.3% increase in sensitivity. In contrast, the performance of classifier 13 is much worse than that of classifier 9: 29.7% decrease in accuracy, 32.3% decrease in specificity and 2.1% decrease in sensitivity. This finding suggests that the discrimination power of the variable age when used in conjunction with creatinine and heart rate is greater than that of the variable Killip employed with the same two variables.
The best classifiers obtained from the combination of the disagreement and inverse disagreement criteria are classifiers 17 and 18. One might argue that systemic arterial hypertension is not relevant since the performance of classifiers 17 and 18 are somewhat similar. To confirm this assumption, a three-variable classifier homologous to the one obtained using the variables creatinine, age and any previous revascularization is evaluated.
In this case, we replace the variable any previous revascularization in classifier 17 by the variable systemic arterial hypertension. In contrast to the variables creatinine and age, which are selected using the inverse disagreement criterion and are non-categorical, the variables any previous revascularization and systemic arterial hypertension are selected using the disagreement criterion and are two-class variables. The performance of the SVM model trained with variables creatinine, systemic arterial hypertension and age is:
= 98.3%, ŷ = 86%, â = 97.4%, PCC = 57.5%.
This result suggests that the variable systemic arterial hypertension is relevant and brings the same kind of information that the variable any previous revascularization does in the presence of the variables creatinine and age. Therefore, both variables (APR and SAH) can be used (although not in the same classifier) to predict the risk of death.
5 CONCLUSION
In this study, we combined the wrapper and filter approaches to select input variables using anincomplete sample. This allowed us to maximize the use of information without resorting to methods for estimating missing data. In the first experiment, we used the order of variablesgiven by the MIFS-U filter to compare the capability of ν-SVM and feed-forward ANN models to predict the risk of death (as high or low) in patients admitted with ACS. In line with previous studies (^{Berikol et al., 2016}; ^{Kumari & Godara, 2011}; ^{Xing et al., 2007}), the results indicated that the ν-SVM model is superior. However, the classifier biases did not diverge in terms of variable selection since both classifiers identified the same optimal subset of input variables: Age, Any Previous Revascularization, and Creatinine.
In the second experiment, we assessed the impact that the MIFS-U filter could have on the variable selection and, therefore, on the performance of the models. For this purpose, we developed two new criteria for variable ordering (the disagreement criterion and the inverse disagreement criterion) based on Euclidean distance. These criteria have very low computational cost and are able to capture non-linear relations between input and response variables. Their combined use enabled us to construct classifiers with good performance both in terms of sensitivity andspecificity.
Moreover, our Euclidean filters did not only recover the same optimal set of three variables chosen by the MIFS-U filter but also highlighted another set of three equally important variables: creatinine, age and systemic arterial hypertension. So, a possible further advance will be to propose a framework to integrate the classifiers constructed using these two variable groups.For example, this development can enable us to classify the death risk of patients hospitalized with acute coronary syndrome into three classes: high risk, for which both classifiers indicate high risk; moderate risk, for which the classifiers diverge (i.e., one indicates low risk and the other high risk); and low risk, for which both classifiers indicate low risk.
The objectives of this study were to identify the relevant variables for the ACS prognosis and to compare the prediction capabilities of ANN and SVM models. The accuracy of our best SVM classifiers was similar to that found for ACS diagnosis in (^{Berikol et al., 2016}). Nevertheless, if the goal was the performance of a specific classifier in itself, the results should be interpreted with care (^{Chatfield, 1995}). Because of the reduced number of individuals in the sample, we used the same data set for variable selection, training and validation. Although the leave-one-out technique allowed us to circumvent this issue, we should recognise that the results tend to be positively biased.
Another possible research stream is to explore causal and explanatory analyses using graphical models such as Bayesian Networks (^{Pearl, 2009}; ^{Schenekenberg et al., 2011}) and Chain Event Graphs (^{Smith & Anderson, 2008}; ^{Collazo & Smith, 2015}). Finally, in a future study it will also be very interesting to examine the impact of different layers of hidden neurons defined for the ANN algorithm on the results.
ETHICAL STANDARDS
The study protocol conforms to the ethical guidelines of the 1975 Declaration of Helsinki. In Brazil, the Research Ethics Committee of the Faculty of Medicine (Fluminense Federal University) and the National Research Ethics Committee approved it. All patients involved in this research signed a consent form. The authors declare that they have no conflict of interest.