Classification and regression trees for predicting the risk of a negative test result for tuberculosis infection in Brazilian healthcare workers: a cross-sectional study

REV BRAS EPIDEMIOL 2021; 24: E210035 ABSTRACT: Objectives: Healthcare workers (HCWs) have a high risk of acquiring tuberculosis infection (TBI). However, annual testing is resource-consuming. We aimed to develop a predictive model to identify HCWs best targeted for TBI screening. Methodology: We conducted a secondary analysis of previously published results of 708 HCWs working in primary care services in five Brazilian State capitals who underwent two TBI tests: tuberculin skin test and Quantiferon®-TB Gold in-tube. We used a classification and regression tree (CART) model to predict HCWs with negative results for both tests. The performance of the model was evaluated using the receiver operating characteristics (ROC) curve and the area under the curve (AUC), crossvalidated using the same dataset. Results: Among the 708 HCWs, 247 (34.9%) had negative results for both tests. CART identified that physician or a community health agent were twice more likely to be uninfected (probability = 0.60) than registered or aid nurse (probability = 0.28) when working less than 5.5 years in the primary care setting. In cross validation, the predictive accuracy was 68% [95% confidence interval (95%CI): 65 – 71], AUC was 62% (95%CI 58 – 66), specificity was 78% (95%CI 74 – 81), and sensitivity was 44% (95%CI 38 – 50). Conclusion: Despite the low predictive power of this model, CART allowed to identify subgroups with higher probability of having both tests negative. The inclusion of new information related to TBI risk may contribute to the construction of a model with greater predictive power using the same CART technique.


INTRODUCTION
Tuberculosis (TB) is a global epidemics that caused an estimated 1.2 million deaths in 2019 1 . However, this disease is not only curable but also preventable through treatment of TB infection (TBI). Healthcare workers (HCWs) are at high risk of TBI because of occupational exposure 2,3 , and recent TBI is one of the risk factors for progression to active disease. Thus, HCWs without evidence of previous TBI should be annually tested for conversion of one of the available tests, such as the tuberculin skin test (TST) or the interferon-gamma release assays (IGRA), and eventually treated 4 . Those with positive results should be carefully followed up, but no re-testing or treatment is recommended.

Palavras
We hypothesized that CART model for predicting negative result for TST and IGRA would have good overall performance in terms of accuracy, sensitivity, specificity and area under the curve (AUC).

STUDY DESIGN, SETTING AND SOURCE OF DATA
We analysed factors potentially associated with the risk of having negative results for both tests for TBI in a database from a previously conducted cross-sectional observational multicenter study 7,8 . The database contains information on sociodemographic characteristics, health facility, and work conditions collected from face-to-face interviews conducted between June 2011 and September 2013 7,8 , as well as BCG status (through observation of scar) and Quantiferon ® TB Gold in-tube (QFT) -and TST (PPD RT23 -Tuberculin PPD Evans 2 TU) results from HCWs working in primary care services in five Brazilian State capitals, those with the highest TB incidence rates at the moment of data collection: Cuiabá (TB incidence of 52/100,000), Manaus (71/100,000), Salvador (60/100,000), Porto Alegre (87/100,000) and Vitória (40/100,000) 9 .
Since the 1990s, primary care in Brazil has been progressively shifted to the Family Health Strategy model, in which residents of the adjacent area are actively taken in charge by family health teams composed by one medical doctor (MD), one registered nurse (RN), one to two technical nurse assistants (TNA) and six to ten community health agents (CHA) who pay regular home visits, regardless of the demand care 10 .
Some cities have rapidly adopted the Family Health Strategy model, with decentralized TB care, as others still use a mixed approach with TB services still centralized in specialized services. For the original study, primary care units were selected by simple random sampling. The selected units were classified into three categories: "traditional" primary health care units, Family Health Strategy units, and "traditional" primary health care units with CHA program 11 . The number of health units was defined based on the number of professionals in each unit and stratified by type of service organization in Brazil.
All HCWs from the selected units were invited to participate in the study, and those who signed the informed consent were eligible. The description of the sampling procedure is available in the article by Prado et al. 7 . Exclusion criteria were known human immunodeficiency virus (HIV) infection, a positive rapid HIV test, past or current active TB, any positive test for TBI in the past, and pregnancy. HCW who did not return for REV BRAS EPIDEMIOL 2021; 24: E210035 TST reading were also excluded from this analysis. Inclusion criteria were to be MD, RN, TNA or CHA.
The original study was approved by the ethics committee of the Federal University of Espírito Santo on March 3, 2010 (number: 007/10). The currently analysed database is anonymous.

Outcome
Because there is no reference standard for TBI diagnosis, we considered any positive test as evidence of TBI 4 . Thus, our outcome was set as having negative results for both TST and QFT, as a surrogate for absence of TBI. For TST, 5 mm cut-off value was considered, according to Brazilian guidelines 12 .

Predictive variables
After review of the literature 7,13-20 , the following predictive variables were selected from the database: TB incidence in the city -intermediate (TB incidence < 50/100,000) versus high (TB incidence ≥ 50/100,000); type of TB care provided in the city (centralized or decentralized); type of health clinics (traditional, traditional with CHA program or family health clinics); working in primary care unit with specialized services including TB services; working in specialized TB services; air flow in the unit (no open door or window versus at least one open door or window); professional category (MD, CHA, TNA and RN); years served in a primary health care unit; work in highly TB-exposed setting (necropsy room, radiotherapy and respiratory disease wards); home visits to nursing home, asylum or prison; assistance of patients with active TB; use of N95 masks (always versus not always/never); TB training; household contact of active pulmonary TB; morbidities (diabetes mellitus, chronic cardiovascular disease, rheumatic disease, respiratory disease, chronic kidney disease or chronic liver disease); tobacco use (smokers versus nonor ex-smokers); age (years) and alcohol use (no or yes).

CLASSIFICATION AND REGRESSION TREES ANALYSES
Supervised learning methods can be used as strategy for the prediction of test results. In supervised learning, input variables (X) and an output variable (Y) enter an algorithm to learn the mapping function from the input to the output Y = f(X). The goal is to approximate the mapping function so well that new input data (X) can predict the output variables (Y) for that data 21 . Supervised machine learning algorithms include CART.
CART models were developed using an algorithm first introduced by Breiman et al. 22 . These models offer clear interpretation by relating continuous or categorical predictive variables to the outcome of interest based on optimal splitting criteria from an automated algorithm. CART is a non-parametric method that builds a binary classification system (tree) through recursive partitioning, so that the data set is successfully split into increasingly homogenous subgroups 6 . Firstly, a variable that optimally separates outcome groups is selected, and a binary split is made. Then, from both subgroups, subsequent variables are selected, and second levels of binary splits are made. Variables can be used more than once within a model. Variable splits are made recursively until stopping criteria are reached, and a terminal node is defined with a prediction for the specific subset of data in this node 5,6 .
The trees should be read from top to bottom in order to obtain a prediction for a specified outcome. Starting at the top of a tree, branches corresponding to observed features are followed until a terminal node has been reached and the fraction of patients contained in each outcome group is displayed. These tables may be used to assess the probability that a patient falls within each outcome category 5 .
CART models were constructed using all HCWs, followed by cross validation in 10 subsets the same complete dataset. The R package rpart was used to develop the CART models 23 . For cross-validation, the minimum number of observations in terminal nodes was 15.

DATA ANALYSES
Information was encoded and stored anonymously in an Excel for Windows ® database; data analyses were performed using RStudio software 24 and Stata 13 software 25 . The variance inflation factor (VIF) was used to evaluate the presence of collinearity between the predictive variables (R package usdm). The variable has been classified as not correlated VIF is less than 10. The association of the predictive variables with the outcome was calculated in bivariate analyses as the crude odds ratio (OR) with their 95% confidence intervals (95%CI) and p.
The performance of the model was evaluated using the receiver operating characteristics (ROC) curve and the AUC, cross-validated using 10 subsets the same dataset. The accuracy, sensitivity, specificity, negative predictive value (NPV) and positive predictive value (PPV) were also calculated. For this, we generated a discrete variable in which each terminal node of CART received an increasing score according to the probability of occurrence of the outcome of interest. This score was equal for nodes 7 and 10. The cut--off point used to calculate discriminatory capacity was selected in order to maximize the sensitivity and specificity (probability of outcome threshold = 0.35).

RESULTS
Out of 740 enrolled HCWs, 708 were included. The reasons for exclusion were: 22 (3%) HCWs did not return for TST reading, 7 (1%) had active TB or were under TB treatment, 1 (0.1%) was HIV positive (positive rapid test), and 2 (0.3%) excluded because refused to have blood drawn (Supplementary Figure 1). A BCG scar was observed in 87.6%. Then mean age was 41.4 [standard deviation (SD) = 9.9] years, and 633 (89.4%) were female (Supplementary Figure 1); mean time of work was 9.5 (SD = 6) years. Among the total HCWs included, 247 (34.9%) were negative for both tests and 461 (65.1%) were at least one of the positive tests, with 57.3% presented TST positive and QFT negative. We did not identify collinearity between the predictive variables.
Among  Table 1) and be older (OR = 0.97, 95%CI 0.95 -0.98) were associated with the lowest chance of presenting negative results for both TST and QFT.
The CART model (Figure 1) also identified the professional category as the most important predictor of negative test results. The following set of features were associated with a higher probability of having negative results of both tests: • MD or CHA working for less than 5.5 years in primary care (node 11, probability = 0.60); • MD or CHA working for more than 5.5 years in primary care in a city with decentralized assistance to patients with TB and with any morbidity (node 10, probability = 0.60); • MD or CHA working for more than 5.5 years in primary care in a city with centralized assistance to patients with TB for more than 14.5 years in primary care (node 7, probability = 0.59).
Conversely, the following features were associated with a lower probability of having the outcome (both tests negative): • MD or CHA working for more than 5.5 years in primary care in a city with centralized assistance to patients with TB for less than 14.5 years in primary care (node 6, probability = 0.27); • TNA or RN (node 2, probability = 0.28).
Performance measures and the associated confidence intervals for the CART model are presented in Table 1 for cut-off equal 0.35. The sensitivity was 44%, with a predictive  Terminal nodes containing predictions for new observations include 2, 6 and 9 (predict the risk for at least one positive test) and 7, 10 and 11 predict the risk for negative tests. To obtain a prediction, one starts at the top of the tree and follows the arrow corresponding to data for the new observation until a terminal node is reached.

DISCUSSION
In this study, we found a high prevalence of TBI among HCWs, with only 34.9% being negative for both TST and QFT. This finding indicates that the strategy of predicting those without evidence of TBI, at risk for conversion, and focusing efforts to test them may contribute to better allocate human and supply resources. The use of CART could be an alternative to this end. However, the current model -that used the available variables -had a low predictive power. Nevertheless, the CART was still useful for selecting subgroups that are most likely to have negative results of both tests, thus at-risk for conversion and worthy testing. Most importantly, the current exercise pointed out a method that could be useful if further variables were available.
CART identified that MDs and CHAs were twice more likely to be uninfected than TNA or RN when working less than 5.5 years in the primary care setting with high specificity (78%) despite a PPV of 52% because of the low prevalence of this condition in a high--TB burden country. Among those with more than 5.5 years of work, existing morbidities and work in a city with decentralized assistance to TB patients also had high specificity.
Interestingly, CART identified mostly occupational characteristics for TBI, as opposed to the bivariate analysis that also identified individual characteristics. In a systematic review of 85 studies from low and middle-income countries published from 2005 to 2017, occupational categories and years of work had already been reported to be an independent risk factor for both prevalent and incident TBI 26 . However, in our study, CART identified subgroups with distinct characteristics at higher risk. RN and TNA had a less probability for negative tests results regardless of any other variable, while among MD and CHA, years of work in primary health care influenced this probability, which could double in those having worked for less than 5.5 years (node 11). Thus, unlike bi or multivariate analyses, CART points out to a set of characteristics of subgroups that can be then identified, allowing a specific strategy to be proposed to these different groups.
Time of work was also clearly relevant in other subgroups (nodes 6, 7 and 11). One difficult-to-explain finding was the high probability of negative results among HCWs in which TB care is centralized (node 7). This finding should be interpreted with caution since the node contains only 17 observations.
Our study has a few limitations. First, the cross-sectional design does not allow to predict the risk for conversion (incidence of TBI), which would be more informative than the risk for absence of prevalent TBI 27 . Second, there is no reference test for detecting TBI, thus the estimated TBI prevalence might be impacted by the TST and QFT performance. Persons with one negative (discordant) test might be uninfected. The cross validation was performed with the same set used to build CART, but the external data should be used to further validate the CART model 28 . Third, the overall accuracy of the CART model was low because the three nodes with the highest pretest probability (nodes 7, 10 and 11) have only 127 observations. Finally, given the possibility of cross-reaction between TST and BCG (31), the result of this test may have been affected by previous BCG vaccination since 87.6% of HCWs had the BCG scar.
Despite these limitations, we here demonstrate the possibility of the use CART for the development of a simple and intuitive predictive model for absence of TBI in HCWs considering a strict criterion, i.e., both QFT and TST results. CART identified specific subgroups that should be prioritized for targeted TBI testing, such as those with less time of work or those with existing morbidities working outside specialised TB services. The inclusion of new information related to TBI risk among HCWs at this level of attention may contribute to the construction of a model with greater predictive power.