Predictive performance of BI-RADS magnetic resonance imaging descriptors in the context of suspicious (category 4) findings*

Objective To determine the positive predictive value (PPV) and likelihood ratio for magnetic resonance imaging (MRI) characteristics of category 4 lesions, as described in the Breast Imaging Reporting and Data System (BI-RADS®) lexicon, as well as to test the predictive performance of the descriptors using multivariate analysis and the area under the curve derived from a receiver operating characteristic (ROC) curve. Materials and Methods This was a double-blind review study of 121 suspicious findings from 98 women examined between 2009 and 2013. The terminology was based on the 2013 edition of the BI-RADS. Results Of the 121 suspicious findings, 53 (43.8%) were proven to be malignant lesions, with no significant difference between mass and non-mass enhancement (p = 0.846). The PPVs were highest for masses with a spiculated margin (71%) and round shape (63%), whereas segmental distribution achieved a high PPV (80%) for non-mass enhancement. Kinetic analyses performed poorly, except for type 3 curves applied to masses (PPV of 73%). Logistic regression models were significant for both patterns, although the results were better for masses, particularly when kinetic assessments were included (p = 0.015; pseudo R2 = 0.48; area under the curve = 90%). Conclusion Some BI-RADS MRI descriptors have high PPV and good predictive performance-as demonstrated by ROC curve and multivariate analysis-when applied to BI-RADS category 4 findings. This may allow future stratification of this category.


INTRODUCTION
Mainly because of its high sensitivity, magnetic resonance imaging (MRI) has progressively attained a prominent position in the diagnosis of breast cancer and screening of high-risk women (1,2) . That triggered the widespread dissemination of the method and brought challenges to referring physicians, particularly breast care specialists and oncologists; old treatment paradigms had to be reassessed in light of the (relatively) new technique, leading to a fair amount of uncertainty (3,4) .
One frequent claim concerns the proportionately low specificity of breast MRI when compared with mammography and ultrasound (5,6) . This argument, albeit fallaciousgiven that the slightly lower specificity of breast MRI is partly credited to its unparalleled sensitivity-is frequently coupled with questions regarding the high number of false-positive results reported (7,8) . These potential limitations would increase the numbers of unnecessary operations and aggressive procedures applied to any suspicious abnormality (3) . As a consequence, the capability of MRI to distinguish between benign and malignant lesions with accuracy has always been under scrutiny within the medical community (9) .
In an effort to address some of these matters, the American College of Radiology (ACR) Breast Imaging Reporting and Data System (BI-RADS ® ) included MRI in its two latest editions (10,11) . The concepts of standardized terminology (lexicon) and general assessment categories were adjusted to the particularities of MRI, similar to what had previously been done for mammography and ultrasonography. Nevertheless, stratification guidelines for MRI category 4 findingswhich have estimated cancer likelihoods ranging from > 2% to < 95% (11) -were not issued, in contrast to what is already the norm for the other imaging methods (12,13) . In order to achieve this feat, it is paramount to examine the predictive values of individual descriptors in this particular context.
The aim of this study is to establish the positive predictive values (PPVs) and positive likelihood ratios (PLRs) for BI-RADS descriptors applied to category 4 abnormalities. We also identified the most cancer-related features and probed them in a multivariate model.

MATERIALS AND METHODS
This retrospective cross-sectional study stems from a graduate (sensu stricto) project sponsored by an academic medical institution and a regional private referral clinic in women's healthcare. The independent review board of the medical school approved the study (Report no. 518.466) and waived the requirement for written informed consent.

Study design and population
Between November 2009 and December 2013, 1973 breast MRI studies were performed at our private practice. Of those 1973 studies, 238 (12.06%) revealed one or more suspicious findings (BI-RADS 4 lesions). It is our standard practice to schedule visits with premenopausal women between the 5th and 14th days of their menstrual cycle. This protocol is bypassed only when the requesting physician considers the situation urgent.
Our information technology team restored and anonymized 158 MRI studies to an independent image databank linked to a restricted version of the electronic medical record database of the institution (80 records could not be accessed due to random data corruption secondary to defective media). Exclusion criteria were: Records related to repeated investigations of the same patient without new findings were excluded (only the first one was included in the analyses), as were those related to small lesions (less than 5 mm), those including reports of pronounced image artifacts (such as patient movement and field inhomogeneity), and those in which there was no histopathologic outcome or adequate site correspondence between the MRI finding(s) and the pathologic description. We decided to include 13 records related to lesions with diagnostic cytology only, because they were conclusive and had been monitored for at least two years. We also included one record related to a patient with two suspicious areas of enhancement that disappeared during follow-up and were not more aggressively investigated (Figure 1). We excluded a total of 60 records-42 due to lack of follow-up, 16 because of image artifacts, and two because A B the lesions were considered foci. Therefore, the final study sample comprised 98 patients, with ages ranging from 28 to 88 years (mean of 51.5 years), among whom 121 findings were classified as BI-RADS 4 lesions.

MRI protocol
Images were acquired in a single 1.

Image analysis and data collection
The selected examinations were stored in an offline Advantage Windows workstation, version 4.4, with the Functool Software Package (GE Healthcare) for post-processing. Two examiners with at least 1000 breast MRI readings to their credit, blinded to the clinical data and pathologic outcomes, independently reviewed the images and described suspicious findings strictly according to the BI-RADS MRI lexicon. No studies were reclassified under a different BI-RADS category after review. Subsequently, the reviewers reanalyzed divergent descriptions and achieved a consensus in all cases. The same procedure was implemented for the semi-quantitative kinetic assessment, with regions of interest (containing at least four pixels) positioned over the most suspicious area of enhancement, and the kinetic curve generated was classified as type 1, 2 or 3, corresponding to the delayed phases (persistent, plateau, and washout, respectively).
Mass and non-mass enhancement (NME) were equally described in terms of the internal enhancement pattern, T2 signal intensity, and type of curve, mostly based on the ter-minology of the 2013 (5th) edition of the ACR BI-RADS (11) . Masses were termed specifically for shape and margin, while symmetry (when applicable) and distribution were defined for NME, because we opted to keep symmetry as an NME finding, according to the 2003 edition of the ACR BI-RADS (10) .

Pathologic outcome
A pathologist specialized in breast diseases, with more than 10 years of experience, reevaluated and dichotomized the available reports as positive or negative for malignancy. By doing so, lesions considered to be indeterminate or highrisk in nature (atypical findings, lobular neoplasia, complex sclerosis, or papillary lesions) were categorized as non-malignant. In cases with mixed histological features, the most aggressive pattern was used as the grouping indicator. The 13 cases investigated only by cytology were also validated, because they were compatible with imaging findings and showed benign evolution during follow-up.

Statistical analysis
Lexicon descriptors were categorized and correlated with the outcome (malignant or non-malignant). We considered positive cancer diagnoses as true positives for individual descriptors and, based on that assumption, PPVs and PLRs were calculated with 95% confidence intervals (95% CIs).
The association between independent descriptors and cancer frequencies were explored using the chi-square test of independence and, when fewer than five occurrences were expected, by Fisher's exact test. We adopted a 5% level of significance (p < 0.05) for two-tailed tests. Predictor variables were assessed by univariate logistic regression, and those with a p < 0.20 were inserted into separate multivariate predictive models for masses and NME. Each covariate had a minimum of six least frequent outcomes, and, because there was more than one finding for some participants (average of 1.24 findings), the intracluster correlation coefficient was employed. In addition, should any significant predictor demonstrate relevant collinearity-defined as variance inflation factor above 3.5-it would be discarded. Odds ratios (ORs) and 95% CIs were calculated for significant predictors. The fit and explained variability of the models were assessed by the Hosmer-Lemeshow test and pseudo R 2 . We also generated receiver operating characteristic (ROC) curve for the best model and derived its area under the curve (AUC). For those computations, we used Stata statistical software, version 12.0 (StataCorp LP; College Station, TX, USA).
The pathologic material originated from 21 (17.4%) mammography or ultrasound-guided core biopsies, 3 (2.5%) mammography-guided mammotome excisions, 1 (0.8%) MRI-guided mammotome excision, 12 (9.9%) fine-needleaspirations, and 1 (0.8%) nipple discharge study (the last two with additional 2-year imaging follow-up). Of the 121 suspicious MRI findings, the majority-81 (66.9%)-were defined on the basis of samples obtained during surgical procedures (excisional biopsy or treatment). No pathologic confirmation was obtained for 2 (1.7%) of the suspicious MRI findings: both were NMEs observed in a single participant and progressively disappeared during MRI follow-up (therefore being considered benign functional abnormalities). We did not find a significant difference between surgical and nonsurgical procedures in terms of the cancer yield (p = 0.084 from Fisher's exact test).

Cancer likelihood based on morphology and T2 signal intensity
The main enhancement patterns demonstrated similar frequencies of malignancy, 24 (42.9%) of the 56 masses and 29 (44.6%) of the 65 NMEs being categorized as positive (p = 0.846 from the chi-square test), as shown in Tables 1  and 2. However, 14 (87.5%) of the 16 ductal carcinomas in situ presented as NMEs, whereas the majority-22 (59.5%)of the 37 invasive cancers appeared as masses.
The individual mass descriptors with the highest PPVs were spiculated margin (71%) and round shape (63%), both of which had equally high PLRs (Table 1). In the NME group, segmental distribution had a high PPV (80%), there was one case of multiple regions of enhancement, and there were three cases of clumped internal pattern, all four cases being categorized as positive for malignancy ( Table 2).
In the univariate analyses of descriptors for masses, T2 signal intensity did not achieve the cutoff significance level to be introduced into the multivariate logistic regression (p = 0.252). The model including shape, margin, and internal enhancement as predictor variables could significantly distinguish between malignant and non-malignant lesions (p = 0.038; pseudo R 2 = 0.24), round shape independently reaching significance (OR: 12.91; 95% CI: 2.33-71.45; p = 0.003) and rim enhancement reaching marginal significance (OR: 7.15; 95% CI: 0.96-53.19; p = 0.055). NME-related terms performed more poorly, only distribution and internal enhancement being eligible for multivariate modeling. The logistic regression analysis was significant (p = 0.031; pseudo R 2 = 0.16) mainly because multiple regions of enhancement and clumped internal pattern were seen in a small number of all positive cases.

Cancer likelihood based on enhancement kinetics
Semi-quantitative kinetic analyses displayed different levels of performance when considered in association with the main patterns of enhancement. Washout (type 3) curves did have a high PPV (73%) but only when associated with masses; as for NME, none of the kinetic curves were substantially linked to malignancy (Table 3). When the kinetic analyses were inserted into the predictive models, the one for NME showed no significant improvement. Masses, however, displayed better model fit, with increased significance and explained variability (p = 0.015; pseudo R 2 = 0.48), with four significant adjusted descriptors: round shape, type 3 curve, heterogeneous enhancement, and rim enhancement ( Table 4). The ROC curve for this model showed an AUC of 90% (Figure 2).

DISCUSSION
The capability of MRI to differentiate between malignant and non-malignant findings has occasionally been called into question because of the overlapping characteristics between the two and the variable predictive values (14)(15)(16) . Suspicious (category 4) lesions are even more problematic, because they present, by definition, an unacceptably wide * Data in parentheses are percentages of the total number of non-mass enhancement (n = 65), except on the first row, where they are percentages of the total number of lesions (n = 121). † Expressed as %. -, Values that could not be calculated due to a lack of the finding in the study sample. * Data in parentheses are percentages of the total number of lesions by main enhancement pattern: mass and non-mass enhancement. † Expressed as %. range of malignancy likelihoods (11,17) . In this study, we displayed positive predictive measures for each BI-RADS descriptor and applied simple multivariate logistic regression models to test their performance in this context. The observed results indicate that stratification of category 4 findings based on lexicon-established terminology might also possible for MRI. There are a large number of works about predictive values of breast MRI characteristics (15,18,19) . However, to our knowledge, this is the first study to focus exclusively on BI-RADS 4 lesions. Most studies of this topic have investigated categories 4 and 5 together, and some have also included categories 0 through 3 (14,15) . The reader should bear this in mind if comparing our results with those of other authors.
The prevalence of malignancy in our sample (43.8%) is within the range specified by the BI-RADS as one of the MRI screening benchmarks (11) . The BI-RADS stipulates that the biopsy yield of malignancy (PPV 3 )-comprising categories 4 and 5-should be within the 20-50% range. Our finding is also consistent with most of the reported values, which range from 20% to 60% (20,21) . Nevertheless, to account for pretest probability, we provided the corresponding likelihood ratios.
We detected almost equivalent frequencies of malignancy between masses and NMEs, similarly to Liberman et al. (18) . However, the vast majority of intraductal carcinomas (87.5%) appeared as NME, in accordance with most of the published results, in which reported frequencies range from 53.8% to 90% (18,22,23) . Nevertheless, among all of the non-mass findings, ductal carcinoma in situ was in the minority: only 14 out of 29 (48.3%) had no invasive component. One plausible explanation for this is that when more than one pathologic feature was found in a single lesion, we focused only on the most aggressive feature.
The single mass-specific descriptor with the highest predictive measures was spiculated margin (PPV of 71%), because this term is often considered highly suggestive of malignancy. Unexpectedly, round shape was the second most predictive descriptor (PPV of 63%). Round shape is variably associated with cancers, particularly of the triple-negative subtype, depending on the imaging method studied. Liberman et al. (24) noted a 42% frequency of malignancy as a function of that descriptor applied to mammography. In contrast, Rahbar et al. (25) found that only 6% of the round masses identified by ultrasonography were malignant. The other patternspecific terms in our sample, including irregular margin, had poorer predictive performances, with PPVs below 60%.
There were two non-mass descriptors with PPVs of 100%-clumped internal pattern and multiple regions of enhancement-the latter representing an invasive multicentric lobular carcinoma (Figure 3). Although the first term is commonly linked to higher cancer frequencies (14,26) , the small number of both findings in our sample precludes further conclusions. Segmental distribution also had substantial predictive power (PPV of 80%), although not enough to be included as a category 5 descriptor, in contrast with the findings of Kuhl et al. (27) .
Logistic regression analyses showed good performance for masses, especially when kinetic curves were considered, as demonstrated by the ROC curve (AUC of 90%), although not for NME. These findings are somewhat similar to those reported by Yamaguchi et al. (28) , who, like us, found that only a few of the adjusted variables were significantly predictive. To account for correlated BI-RADS descriptors, as shown by Benndorf et al. (29) , we ran model diagnostics and excluded predictors with high degrees of collinearity.
Our group elected to define the kinetic signal intensity graphic only by the composite description of the early and delayed enhancement phases, characterized as type 1, 2, or 3 curves. By taking this approach, analogous to that proposed by Kuhl et al. (30) , we avoided the independent appraisal of the two distinct phases, which departs from our clinical routine. We observed a likelihood of malignancy for type 3 curves higher than that previously reported (15,30) , although only when associated with masses.
This study has certain limitations. We excluded a considerable number of examinations, for various causes. How- Figure 2. Receiver operating characteristic curve generated from the logistic regression model for nodules. Shape, margin, internal enhancement, and kinetic curve assessment were the independent variables. The performance of the model was considered highly satisfactory because it achieved an area under the curve of 90%. ever, most were excluded because of digital storage issues. Although we do not identify any systematic bias, because the data corruption was apparently random, we acknowledge that there is a potential for unknown bias. For some of the descriptors, the data were insufficient to generate reliable PPVs and PLRs with manageable 95% CIs. In addition, the PPVs presented here are valid only when considering the stratification of BI-RADS 4 findings (4A, 4B, and 4C) and should not be generalized to other categories. Furthermore, some terms-"dark internal septations", "symmetric", "diffuse", and "clustered ring"-were not utilized in this sample. We also chose to keep 15 findings (12.4%) that had no histopathologic validation. That was done in order to minimize selection bias and avoid an increase in the number of positive outcomes, which would have occurred had we kept only the findings submitted to aggressive investigation. The small number of MRI-guided procedures was an expected shortcoming, despite our facility being the only center in the region that is equipped for such procedures. We also had some correlated observations, as some patients presented more than one finding. However, we corrected them statistically by running multivariate models with cluster-robust standard errors.

CONCLUSION
We have shown that PPVs of certain BI-RADS descriptors can be used to discriminate malignant outcomes in the particular context of MRI category 4 abnormalities. The results presented here suggest that stratification of these lesions into low (4A), moderate (4B), and high (4C) suspicion subgroups is feasible and might be achieved in future editions of the ACR BI-RADS as larger studies are published.