Print version ISSN 1807-5932
Clinics vol.67 no.2 São Paulo 2012
Interobserver concordance in the BI-RADS classification of breast ultrasound exams
Maria Julia G. CalasI,II; Renan M.V.R. AlmeidaI; Bianca GutfilenII; Wagner C.A. PereiraI
I Programa de Engenharia Biomédica COPPE, Universidade Federal do Rio de Janeiro, Rio de Janeiro, Brazil
IIDepartamento de Radiologia, Faculdade de Medicina, Universidade Federal do Rio de Janeiro, Rio de Janeiro, Brazil
Breast ultrasound is an important complement to the clinical/mammographic investigation of breast lesions. This operator-dependent method entails real-time image detection and analysis and requires extensive training and experience in identifying and differentiating between benign and malignant lesions (1-4).
Lesion contour and shape are considered to be the main features that allow differentiating benign and malignant lesions, the former with high sensitivity and the latter with high specificity. Many authors believe that combined ultrasound methods may yield greater accuracy (5-10). However, using morphological characteristics for lesion differentiation demands a high rate of interobserver agreement, an issue that has been extensively examined for mammography but that has been given less attention for ultrasound. Interobserver agreement is thus a matter of strong concern in clinical radiological practice (3-11).
To better characterize the interobserver agreement in breast ultrasound, this study examined a group of 14 breast imagers who used the Breast Imaging Reporting and Data System (BI-RADS) ultrasound classification on 40 breast lesions. The study was exclusively concerned with lesion categorization agreement among the observers according to the BI-RADS lexicon. The accuracy of the observers was not directly assessed through comparisons with the final lesion histology.
MATERIALS AND METHODS
This study used 40 B-mode echographic images of lesions obtained from 40 patients who were examined at a private institution and who subsequently underwent surgery as indicated by their referring physicians.
The study was approved by an Institutional Ethics Committee, and all of the patients provided written informed consent. All of the examinations were performed by one radiologist using Logic 5 ultrasound equipment (GE Medical Systems, Inc., Milwaukee, WI, USA) with a 12 MHz transducer. Short-and long-axes orthogonal images were recorded for each patient according to the American College of Radiology (ACR) standards (12). The image evaluation criteria were based on six BI-RADSUS categories (12): incomplete (0), negative (1), benign (2), probably benign (3), suspicious (4), and highly suggestive of malignancy (5). The surgical histopathological data were also obtained.
Fourteen breast-imaging radiologists participated in the study. They worked in different institutions but had similar numbers of ultrasounds, mammography exams, and biopsy procedures and had 4 to 23 years of breast radiology experience (<5 years, n = 2; 510 years, n =8; >10 years, n = 4). The retrospective review was performed on hard copies of the digitized sonographic images. Each observer received a compact disc with images from 40 lesions and a form with the ultrasound morphological criteria and BIRADS classification. They were instructed to classify the lesions according to this system and were given 30 days to return the material. While no specific training was provided before the study, all of the readers had been using the BIRADS lexicon since 2005. The observers had no access to clinical or histopathological information from the patients, and all of them complied with the instructions provided. To ensure patient anonymity, the names were removed from the images and materials, and each patient was identified by a code.
The observers' BI-RADS analyses were classified according to their level of concordance: a) total, the same BI-RADS category was assigned; b) partial, different categories were assigned but grouped into negative (2 or 3) and positive (4 or 5) categories so that the biopsy recommendations were the same; and c) disagreement, different categories were assigned (at least for one observer) and produced recommendations for different management plans (biopsy or follow-up). We considered category 0 (incomplete) as partial agreement with the negative (2 or 3) and positive (4 or 5) categories in the sense that the patient would have to be submitted to further studies to define the final classification.
The proportions of discordant classifications according to the experience time categories (<5 years, 510 years, and >10 years) were analyzed using Chi-square tests.
The modified Fleiss' kappa index was used to analyze concordance because the data were grouped into six categories (13). The index values give the following interpretations: poor (k<0), slight (k = 0.00.20), fair (k = 0.21 0.40), moderate (k = 0.410.60), substantial (k = 0.610.80), and almost perfect (k = 0.811.00). The analyses stratified by the BI-RADS categorization and by the data grouping described above. The statistical analysis was performed using the R Project for Statistical Computing software (14).
The average age of the 40 subjects was 50.7 years, ranging from 16 to 88 years. The lesion sizes ranged from 6 to 27.0 mm (mean diameter, 15.4 mm); 19 were benign, and 21 were malignant. The concordance analysis identified three cases of total agreement, 13 of partial agreement, and 24 cases of total disagreement (Table 1).
In the three cases of total agreement among all the reviewers, two fibroadenomas were classified as BI-RADS 3, and one carcinoma was classified as BI-RADS 4 (Table 1). In the 13 cases of partial agreement, 10 carcinomas were assessed by all of the reviewers as BI-RADS 4 or 5, with a recommendation for tissue sampling. In one of these cases, an observer classified a carcinoma as BI-RADS 0; this classification was considered to be partially concordant because this category demanded further studies. One addition al carcinoma in the partial agreement group was incorrectly classified as benign, and two fibroadenomas were classified as BI-RADS 2 or 3 by 13 observers (with one observer choosing BI-RADS 0, which was considered to be partial agreement) (Table 1).
In the 24 cases of disagreement (Table 2), the histopatho logical analyses confirmed that 15 cases were benign and nine we re malignant lesions. In 5 of the 15 benign cases, only one observer disagreed; 13 agreed. A single observer disagree d in three of the nine malignant cases. However, nine ob servers disagreed on a benign hematoma case (longest lesion axis = 18.2 mm), and eight observers disagreed on a carcinoma case (longest lesion axis = 26.9 mm). The proportions of discordant classifications were not significantly different by the experience time categories (11%, 12%, 15%; p = 0.62).
The kappa value for the original BI-RADS categories was 0.389 (fair agreement). This value was 0.612 when the categories were grouped as previously described, indicating substantial agreement.
Most inter-and intra-observer BI-RADS concordance studies have examined mammography because BI-RADS has been used for mammography since 1993. Recent studies of interobserver agreement in BI-RADS ultrasound assessments have yielded kappa values ranging from 0.28 to 0.83, indicating a subjectively derived assessment of the morphological lesion characteristics (6-9,11,15-24).
One limitation of this study is the small number of cases (40) compared to other studies, which have had 55 to 267 cases (6-9,11,15-25). These cases did not consist of a random sample from the relevant female population. However, the cases were not selected according to pathological characteristics; therefore, no direct selection bias was apparent.
No previously published study has used 14 observers, although one used 10 radiologists (with only 10 patients) (8). Additionally, most interobserver studies have used static image diagnosis (6-9,11,15-25). The exceptions include Berg et al. (8) and Bosch et al. (9), both of which were real-time analyses.
A second limitation of this study is the retrospective analysis of photographic records rather than real-time examination, which reflects the real clinical situation. However, no images were rejected by the observers. Another possible limitation is our not examining the possible correlations between clinical information and mammographic findings. The importance of these correlations may be seen in Skaane et al. (21), who concluded that the knowledge of previous mammography results is important for properly using BI-RADS in ultrasound. They measured kappa indices of 0.58 (range 0.520.66) for mammography, 0.48 (range 0.370.61) for ultrasound, and 0.71 (range 0.630.79) for both methods combined.
Berg et al. (8) measured a kappa value of 0.52 for the BI RADS categorizations of 11 radiologists, which was com parable to the results of mammography agreement studies. After grouping BI-RADS categories 1, 2, and 4A together and categories 4B, 4C, and 5 together, Berg et al. (8) obtained a kappa value of 0.56. When the categories were dichotomized as BI-RADS 1, 2, 3 vs. 4A, 4B, 4C, 5, the kappa value was 0.48. These results differ from our current finding of an increase in the kappa value (from 0.3 to 0.6) after category grouping.
Using BI-RADS for mammography and ultrasound, Lazarus et al. (23) have identified a high concordance for highly suspicious lesions (k = 0.56, BI-RADS 5). Similar results were obtained in our study: 11 of the 16 (complete or partial) agreement cases were classified as BI-RADS 4 or 5.
Baker and Soo (3) analyzed 152 photographic records from 86 hospitals; in 23 cases (15.1% of the records), they noted a disagreement in interpretation. These disagreements were defined as classification differences that resulted in treatment changes, similar to the definition of disagreement used in this study. Their discrepancies included four false-negative cases, 14 false-positives, 3 cases that were described as cysts but which were found to be solid m asses in biopsies, and two cases of differences between the sonographic and mammographic findings. We identified 24 cases of disagreement in our study, of which 15 were benign and nine were malignant lesions.
Eight of the 14 observers in our study classified a case of medullary carcinoma as benign. This result is similar to one reported in Rahbar et al. (19); the observersagreed on the criteria leading to a benign classification, even in one case of medullary carcinoma. These misclassifications are understandable because this type of carcinoma is characterized by a partially circumscribed contour and a discrete posterior acoustic enhancement that can be confused with a complicated cyst.
Shimamoto et al. (15) evaluated 54 lesions (30 benign and 24 malignant) and reported accuracies ranging from 53.7% to 61.1% in the junior observers group and from 64.8% to 72.2% in the senior group. The authors suggested that agreement was more dependent on case difficulty than on observer experience. Although our study was not designed to evaluate the accuracy of the individual observers or to correlate that accuracy with variables such as experience and lesion size, the lack of significant differences among our experience categories suggests similar results. At this point, it is important to note that the BI-RADS lexicon has been used by the observers in our study since 2005, and this familiarity may explain why experience was not statistically significant. Perhaps experience would have been more of an influence if the study had included lesion detection.
Del Frate et al. (24) found that interobserver variation depended on the size of the lesion, with a better concordance (k = 0.710.83) for lesions >7 mm. In Abdullah et al. (25), five breast radiologists retrospectively evaluated 267 breast masses (113 benign and 154 malignant) using the BI-RADS lexicon. The reviewers had no access to any other patient data. The interobserver BI-RADS agreement was assessed with the Aickin revised ê statistic and varied considerably (k = 0.30). This result is similar to the value (k = 0.28) reported by Lazarus et al. (23) and is slightly below those reported in this study; however, it is lower than the value (k = 0.53) reported by Lee et al. (11). This inconsistency is probably related to the subdivision of BI-RADS category 4 (i.e., 4A, 4B, 4C), which reduces the frequency of agreement. This consideration was also discussed by Lee et al. (11), who noted a low percentage of 4B responses (4.8% vs. 19.4% in Abdullah et al.) (25).
The most recent study published by Lai et al. (26) used a methodology similar to ours. It evaluates 30 breast lesions that underwent resection surgeries and utilizes 12 observers with different amounts of experience using ultrasound with BI-RADS for breast imaging. For experienced observers, the kappa values of categories 3, 4 and 5 were 0.72, 0.28 and 0.60, respectively. The authors concluded that diagnostic agreement decreases as the breast imaging experience of the radiologist decreases. Our study found that experience is not directly related to agreement. This difference is perhaps explained by the most-experienced group of professionals in Lai et al. having more than three years of experience, while the least-experienced professionals in our study had less than five years.
Several studies have proposed using diagnostic methodologies based on image parameter estimation to improve the consistency of image interpretation. These techniques aim to quantify the morphological characteristics of tumors, such as shape and texture, and to use the results for differentiating between benignancy and malignancy. These complex procedures, including computer-aided diagnosis (CAD) systems that may reduce discrepancies between observers and thus improve ultrasound accuracy, continue to be investigated (27-30).
In practice, BI-RADS categorization is defined by a combination of the mammographic and sonographic features, but this generalization did not hold in our study. Although the sample used here included only 40 lesions, this study allowed identifying the critical issues that deserve attention and further inquiry. Our kappa value for the BI-RADS classification (0.389, fair) indicates the need for standardization. Our results also indicate the need for a more meticulous version of BI-RADS, the need for real-time quantitative lesion analysis to reduce observer variation and the need to improve the accuracy of ultrasound examinations.
The four authors form an interdisciplinary research group (all from the Federal University of Rio de Janeiro) investigating the clinical and engineering aspects of radiology and have worked together since 2006 (with some publications in congresses and in a peer-reviewed journal). The corresponding author (Maria Julia G. Calas, MD) is a PhD student, and the subject of this manuscript was taken from her thesis; therefore, she was responsible for most of the specific details described. Nevertheless, each of the four authors contributed to all aspects of the study, from design to manuscript preparation, and all are familiar with the contents. Calas MJG conceived and designed the study, prepared and edited the manuscript, and was also responsible for the study integrity guarantee, literature research, clinical and experimental studies and data analysis. Gutfilen B was responsible for the manuscript preparation, study integrity guarantee, experimental studies and data analysis. Pereira WCA conceived and designed the study, prepared and edited the manuscript, and was also responsible for the experimental studies and data analysis. Almeida RMVR prepared the manuscript, and was also responsible for the study integrity guarantee and statistical analysis.
1. Berg WA, Blume JD, Cormack JB, Mendelson EB, Lehrer D, Böhm-Vélez M, et al. Combined screening with ultrasound and mammography vs mammography alone in women at elevated risk of breast cancer. JAMA. 2008;299(18):2151-63, http://dx.doi.org/10.1001/jama.299.18.2151. [ Links ]
2. Karssemeijer N, Otten JDM, Verbeek ALM, Groenewoud JH, Koning HJ, Hendriks JHCL, et al. Computer-aided detection versus independent double reading of masses on mammograms. Radiology. 2003;227:192200, http://dx.doi.org/10.1148/radiol.2271011962. [ Links ]
4. Chen SC, Cheung YC, Su CH, Chen MF, Hwang TL, Hsueh S. Analysis of sonographic features for the differentiation of benign and malignant breast tumors of different sizes. Ultrasound Obstet. Gynecol.2004;23:18893, http://dx.doi.org/10.1002/uog.930. [ Links ]
5. Paulinelli RR, Freitas-Júnior R, Moreira MAR, Moraes VA, Bernardes Júnior JRM, Vidal CSR, et al. Risk of malignancy in solid breast nodules according to their sonographic features. J Ultrasound Med. 2005;24:635-41. [ Links ]
6. Arger PH, Sehgal CM, Conant EF, Zuckerman J, Rowling SE, Patton JA. Interreader variability and predictive value of US descriptions of solid breast masses: pilot study. Acad Radiol. 2001;8(4):335-42, http:// dx.doi.org/10.1016/S1076-6332(03)80503-2. [ Links ]
7. Baker JA, Kornguth PJ, Soo MS, Walsh R, Mengoni P. Sonography of solid breast lesions: observer variability of lesion description and assessment. AJR. 1999;172(6):1621-25. [ Links ]
8. Berg WA, Blume JD, Cormack JB, Mendelsen EB. Operator dependence of physician-performed whole-breast US: lesion detection and characterization. Radiology. 2006;241(2):355-65, http://dx.doi.org/10.1148/ radiol.2412051710. [ Links ]
9. Bosch AM, Kessels AGH, Beets GL, Vranken KLCG, Borstlap AC, Von Meyenfeldt MF, et al. Interexamination variation of whole breast ultrasound. Br J Radiol. 2003;75:328-31, http://dx.doi.org/10.1259/bjr/17252624. [ Links ]
10. Heinig J, Witteler R, Schmitz R, Kiesel L, Steinhard J. Accuracy of classification of breast ultrasound findings based on criteria used for BIRADS. Ultrasound Obstet Gynecol. 2008;32:573-58, http://dx.doi.org/ 10.1002/uog.5191. [ Links ]
11. Lee H-J, Kim E-K, Kim MJ, Hyun Youk JI, Young Lee JI, Kang DR, et al. Observer variability of Breast Imaging Reporting and Data System (BIRADS) for breast ultrasound. Eur J Radiol. 2008;65(2):293-98, http:// dx.doi.org/10.1016/j.ejrad.2007.04.008. [ Links ]
12. American College of Radiology. Illustrated Breast Imaging Reporting and Data System Atlas (BI-RADSH®): Ultrasound. In: 4th ed, Reston, VA, American College of Radiology. (ACR); 2003. [ Links ]
15. Shimamoto K, Sawaki A, Ikede M, Satake H, Naganawa S, Tadokoro M, et al. Interobserver agreement in sonographic diagnosis of breast tumors. Eur J Ultrasound. 1998;8:25-33, http://dx.doi.org/10.1016/S0929-8266(98) 00047-0. [ Links ]
16. Calas MJG, RMVR Almeida, Gutfilen B. Pereira WCA (2009) Intraobserver interpretation of breast ultrasonography following the BI-RADS classification. Eur J Radiol. 2010;74(3):525-28, http://dx.doi.org/ 10.1016/j.ejrad.2009.04.015. [ Links ]
17. Costantini M, Belli P, Ierardi C, Franceschini G, La Torre G, Bonomo L. Solid breast mass characterization: use of the sonographic BI-RADS classification. Radiol Med. 2007;112(6):877-94, http://dx.doi.org/ 10.1007/s11547-007-0189-6. [ Links ]
18. Hong AS, Rosen EL, Soo MS, Baker JA. BI-RADS for sonography: positive and negative predictive values of sonographic features. AJR. 2005;184:1260-65. [ Links ]
19. Rahbar G, Sie AC, Hansen GC, Prince JS, Melany ML, Reynolds HE, et al. Benign versus malignant solid breast masses: US differentiation. Radiology. 1999;213(3):889-94. [ Links ]
20. Stavros AT, Thickman D, Rapp CL, Dennis MA, Parker SH, Sisney GA. Solid breast nodules: use of sonography to distinguish between benign and malignant lesions. Radiology. 1995;196(1):123-34. [ Links ]
21. Skaane P, Engedal K, Skjennald A. Interobserver variation in the interpretation of breast imaging. Comparison of mammography, ultrasonography, and both combined in the interpretation of palpable noncalcified breast masses. Acta Radiol. 1997;38:497-502. [ Links ]
22. Skaane P, Engedal K. Analysis of sonographic features in the differentiation of fibroadenoma and invasive ductal carcinoma. AJR. 1998;170:109-14. [ Links ]
23. Lazarus E, Mainiero MB, Schepps B, Koelliker SL, Livingston LS. BIRADS Lexicon for US and Mammography: interobserver variability and positive predictive value. Radiology. 2006;239(2):385-91, http://dx.doi. org/10.1148/radiol.2392042127. [ Links ]
24. Del Frate C, Bestagno A, Cerniato R, Soldano F, Isola M, Puglisi F, et al. Sonographic criteria for differentiation of benign and malignant solid breast lesions: size is of value. Radiol Med. 2006;111:783-96, http:// dx.doi.org/10.1007/s11547-006-0072-x. [ Links ]
25. Abdullah N, Mesurolle B, El-Khoury M, Kao E. Breast Imaging Reporting and Data System Lexicon for US: interobserver agreement for assessment of breast masses. Radiology. 2009;252(3):665-72, http:// dx.doi.org/10.1148/radiol.2523080670. [ Links ]
26. Lai XJ, Zhu QL, Jiang YX, Dai Q, Xia Y, Liu H, Zhang J, You SS, Wang HY. Inter-observer variability in Breast Imaging Reporting and Data System (BI-RADS) ultrasound final assessments. Eur J Radiol 2011, http://dx.doi.org/10.1016/j.ejrad.2011.04.069. [ Links ]
27. Drukker K, Giger ML, Vyborny CJ, Mendelson EB. Computerized detection and classification of cancer on breast ultrasound. Acad Radiol. 2004;11:526-535, http://dx.doi.org/10.1016/S1076-6332(03)00723-2. [ Links ]
29. Alvarenga AV, Pereira WCA, Infantosi AFC, Azevedo CM. Complexity curve and grey level co-occurrence matrix in the texture evaluation of breast tumour on ultrasound images. Medical Physics. 2007;34(2):379-87, http://dx.doi.org/10.1118/1.2401039. [ Links ]
30. Alvarenga AV, Pereira WCA, Infantosi AFC, Azevedo CM. Assessing the performance of morphological parameters in distinguishing breast tumors on ultrasound images. Medical Engineering & Physics. 2010;32(1):49-56, http://dx.doi.org/10.1016/j.medengphy.2009.10.007. [ Links ]
No potential conflict of interest was reported
Tel.: +55 21 2562 8629