Death risk and the importance of clinical features in elderly people with COVID- 19 using the Random Forest Algorithm

Objectives: train a Random Forest (RF) classifier to estimate death risk in elderly people (over 60 years old) diagnosed with COVID-19 in Pernambuco. A "feature" of this classifier, called feature importance, was used to identify the attributes (main risk factors) related to the outcome (cure or death) through gaining information. Methods: data from confirmed cases of COVID-19 was obtained between February 13 and June 19, 2020, in Pernambuco, Brazil. The K-fold Cross Validation algorithm (K=10) assessed RF performance and the importance of clinical features. Results: the RF algorithm correctly classified 78.33% of the elderly people, with AUC of 0.839. Advanced age was the factor representing the highest risk of death. The main comorbidity and symptom were cardiovascular disease and oxygen saturation ≤ 95%, respectively. Conclusion: this study applied the RF classifier to predict risk of death and identified the main clinical features related to this outcome in elderly people with COVID-19 in the state of Pernambuco.


Introduction
From the beginning of the COVID-19 pandemic (coronavirus 2019 disease) until September 27, 2020, Brazil, the largest country in South America and the fifth largest in the world, was already considered the second country in number of deaths from the disease. By mid-October, at least 4,717,991 Brazilians had developed the infection and of these, 141,406 evolved to death. 1 The lethality rate in several states in the Brazilian North/Northeast was much higher than the national average, especially in Pernambuco. 1 Faced with this epidemiological scenario, one of the challenges, besides the vaccine, is the need to guide public health policies for surveillance and control the disease. Through the identification of the main risk factors, for example, it is possible to provide early monitoring of the most vulnerable groups, reducing the chance of evolution to unfavorable clinical outcomes.
Data extracted from patients with COVID-19 are a valuable source of information about both the pathophysiology of the disease and the risk factors associated with death. These data have been widely studied, and it is currently agreed that advanced age and the presence of comorbidities are associated with increased morbidity and mortality. 2 The abundant availability of these data allows the construction of the Learning Machine (LM) algorithms -a branch of Artificial Intelligence -in which it is possible to identify more susceptible people based on individual features. Through methods called Classification, the algorithm learns during a process called training by receiving a set of inputs (clinical characteristics) along with the outputs (outcome). Finally, the algorithm is able to predict an output from inputs not seen during training.
Several LM algorithms are widely used in building predictive models of disease. Random Forest (RF) in particular, has shown higher accuracy when compared to other algorithms. 3 It has the ability to list which attributes contribute to the decision making and is often used as a feature selection technique. Feature selection is considered an essential step in data analysis, as it can reduce the complexity/dimensionality of the problem. 4 An optimized data set leads to a more accurate model and also improves its interpretability. 5 This is especially important in the development of algorithms for clinical screening, as its computational cost should be as low as possible and healthcare professionals are interested in the pathophysiological mechanisms underlying the LM model.

Basic Concepts
This section presents concepts of MA that are essential for understanding the work.

Classifier
Given a set of instances, consisting of constructed examples with attribute values as well as the associated class, a learning (or inducing) algorithm generates as output a classifier (also called hypothesis) so that, given an instance with the unknown class, it can label it. Formally, an instance is a pair {x i , f(x i ) }, where x i is the input (set of attributes) and the f(x i ) is output (class or label). Let X = {{x 1 , f(x 1 )}, {x 2 , f(x 2 )},...,{x n , f(x n )}} be a set of n examples, the task of the learning algorithm is to induce a function h(.) that approximates the function f(.). In this sense, h(.) is called a hypothesis about the objective function f(.), or, h(x 1 ) ≈ f(x 1 ).

Decision Trees
Decision trees are constructed and represented using two elements: nodes and the branches connecting to nodes. To make a decision, the flow starts at the root of the node, navigates through the branches until it reaches a leaf node. Each node in the tree denotes a test of an attribute, and the branches denote the possible values the node can take. During the tree formation process, also known as training or learning, consideration is given to the homogeneity of the classes for each division of the node. Basically, the algorithm evaluates the information gained of the attributes for the separation of the samples present in the data set destined for training. 6 The Gini impurity (GI) is an index for evaluating attributes in the separation of samples with the same label, that is, the homogeneity of the classes is sought to compose a node. The GI is defined from Equation 2.1, where p=p 1 ...p c is the proportion of the samples from the p c to the m node, respectively. The index evaluates all randomly selected predictors to build the tree and will choose the one with the highest degree of homogeneity among the samples. If the m node is pure (homogeneous), then the proportion of the p i (m) class i to the m node will equal 1 and consequently the index will equal 0. The attribute for division is chosen according to the purity decrement shown in Equation 2.2, where node division of m, P esq and P dir , are the proportions of the samples in the left and right in the child node, respectively.

Random Forest Algorithm
Let H ={h 1 , h 2 , h 3 } be a set or ensemble of three classifiers. One instance x i will be labeled by each classifier from H. If the three classifiers make distinct errors, then when h 1 (x i ) is wrong, it is possible that h 2 (x i ) and h 3 (x i ) are correct, so that combining the hypotheses by voting can correctly classify x i . The random forest algorithm or RF 7 is based on the ensemble strategy. It provides diversity by using the concept of random redistribution of the data. Thus, when building each h i ϵ H, for a given training Л , set, a subset of data is generated Л. In this way, the algorithm generates several decision trees, each trained with a random distribution. A major quality of RF is easy to measure the relative importance of each attribute for prediction. The algorithm implemented in Sklearn, 8 for example, provides an excellent tool for this, which measures the importance of features by analyzing how many nodes in the trees using a given attribute to reduce the overall impurity of the forest. It calculates this value automatically for each feature after training and normalizes the results so that the sum of all the importance equaling to 1. The higher the value, the more important the attribute is. The importance of an attribute is calculated as the total (normalized) reduction of the criteria brought about by this attribute. It is also known as the Gini importance. 8

K-FOLD Cross Validation
Cross-validation (K-fold cross validation) is a sampling method used to performanalysis of LM algorithms. 9,10 It consists of randomly dividing the ensemble X into mutually exclusive K folds of equal size. The examples in the K-1 folds are then used to train the model and the induced hypothesis is tested on the remaining fold. This K process is repeated over and over again, so that all folds are used only once as a test set, as shown in Figure 1 which used K=10.

Performance Metrics
The error rate of a h classifier is denoted by err(h) , obtained from Equation 2.3. This measure compares the class assigned by each example classifier to its true class. If the two classes are equal, The accuracy or hit rate is denoted by c and corresponds to the complement of the error rate, as in Equation 2.4.
The error and hit rates can be obtained through a confusion matrix, which corresponds to a matrix whose dimension is the number of classes existing in X. In a confusion matrix referring to a set of examples with two classes, usually called positive and negative, we have: true positives (TP) which correspond to the example that is positive and was classi-

Methods
We identified 11,375 elderly patients who met the eligibility criteria (age over 60 years) and separated them into a single database. These elderly people were notified in the period from February 13 to June 19, 2020 in the state of Pernambuco, Brazil. The data analyzed came from the Secretary of Planning and Management in Pernambuco (SEPLAG-PE), downloaded on June 20 at: www.dados.seplag.pe.gov.br. All the elderly people who were in home isolation or hospitalized were excluded, since these still did not have the outcome concluded by the end of the period considered. A total of 7486 elderly people remained thereafter, of these 4356 (58.19%) were recovered and 3130 (41.81%) died.
The attributes were considered: sex (male, female), age and clinical features, such as: cough, dyspnea, fever, oxygen saturation ≤95%, presence of cardiovascular, chronic respiratory, chronic renal, diabetes, neurological, neoplasms, alcoholism, smoking. The aim was to build an RF, based on these attributes, and present which are the most important in predicting death in elderly patients with COVID-19 in Pernambuco. The work was implemented in Python 11 language, using the RF algorithm, available in the Sklearn module, according to the documentation available at: https://scikitlearn.org/ stable/modules/generated/sklearn.ensemble.Random ForestClassifier.html. A Cross Validation with was employed to calculate the performance and importance of the attributes. The methodology flow chart, illustrated in Figure 1, shows how the metrics that are presented in the results were calculated.

Results
The mean and standard deviation of age was 72.94 ± 9.55 years, with a median of 71.0 years old. The mean age between patients recovered and those who died was 70.95 ± 9.06 and 75.70 ± 9.52 years old, respectively. The female patients corresponded to 3821 (51.04%) the male patients 3665 (48.96%). The overall case fatality rate was 41.81%. The lethality rate by age group, 29.49% being between 60-69 years old, 45.89% between 70-79 years old and 57.65% over 80 years old. In regard to the symptoms presented by the overall group, 4860 (64.92%) had cough, 4403 (58.82%) fever, 3773 (50.40%) dyspnea and 2614 (34.92%) peripheral saturation of O 2 ≤95%. However, in the group of patients who died, the most relevant clinical manifestation was dyspnea, 2244 (71.69%). In relation to comorbidities, the most frequent in the entire sample were Cardiovascular Diseases 1298 (17.34%), Diabetes Mellitus 1081   Table 1.
The RF classifier was able to hit the outcome of of patients in the database. To measure the performance of the classification, a confusion matrix was created, and some metrics were adopted, as shown in Table 2. It is possible to see that to predict the outcome of deaths, the RF algorithm showed a sensitivity of 0.784 and an accuracy rate of 0.783, also obtaining an Area under the ROC curve (AUC) of 0.839. Furthermore, the importance of the attributes showed that age (0.302), the presence of cardiovascular disease (0.252) and oxygen saturation less than or equal to 95% (0.212) are the three most important features for the evolution of elderly patients to die of COVID-19, as shown in Figure 2.

Discussion
Age was the most important attribute related to death, with an importance of 0.302. While the overall lethality rate in Pernambuco at the end of the first three months of the pandemic was 8.25%, 12 the lethality rate for elderly patients in the same period was 41.81%.This value was much higher than the rates found in the literature, which ranged from 5.6% to 28.6%. 13,14 The analysis of lethality by age group also showed higher rates than those presented in Italy, where fatal cases increased mainly after 70 years of age, as 12.5% in the 70-79 years old range, 19.7% in the 80-89 years range, and 22.7% after 90 years. 15 It is worth noting that the high lethality rates found in Pernambuco reflect a period when testing was not widely available. Several articles also show that the presence of comorbidities is a risk factor for adverse clinical outcomes such as death, [16][17][18][19][20][21] with cardiovascular disease always being one of the most prevalent comorbidities in the samples analyzed. In this study, the RF algorithm showed that cardiovascular diseases were the second most important feature for predicting death in elderly people with COVID-19, with a value of 0.252. Although, COVID-19 is best known for causing damage to the respiratory system, it is also known that it can compromise or worsen cardiovascular parameters. Furthermore, a retrospective study showed that 33% of deaths of COVID-19 were attributed to cardiorespiratory failure and 7% to isolated heart failure. 22 The third variable highlighted for death prediction, with an importance value of 0.212, was peripheral oxygen saturation of≤ 95%, in agreement with the current literature. 23 The Ministry of Health even considers the diagnosis of Severe Acute Respiratory Syndrome (SARS) for every individual, of any age, with influenza syndrome and presenting signs of hypoxemia, such as the saturation of O 2 ≤ 95% in room air. 24 Furthermore, studies emphasize that early recognition of hypoxia and administration of The importance of the attributes: analyzing how many nodes of the trees, which use a given attribute, reduce the overall impurity of the forest.