Interobserver Agreement of Gleason Score and Modified Gleason Score in Needle Biopsy and in Surgical Specimen of Prostate Cancer

Introduction: Gleason score, which has a high interobserver variability, is used to classify prostate cancer. The most recent consensus valued the tertiary Gleason pattern and recommended its use in the final score of needle biopsies (modified Gleason score). This pattern is considered to be of high prognostic value in surgical specimens. This study emphasized the evaluation of the modified score agreement in needle biopsies and in surgical specimen, as well as the interobserver variability of this score. Materials and Methods: Three pathologists evaluated the slides of needle biopsies and surgical specimens of 110 patients, reporting primary, secondary and tertiary Gleason patterns and after that, traditional and modified Gleason scores were calculated. Kappa test (K) assessed the interobserver agreement and the agreement between the traditional and modified scores of the biopsy and of the surgical specimen. Results: Interobserver agreement in the biopsy was K = 0.36 and K = 0.35, and in the surgical specimen it was K = 0.46 and K = 0.36, for the traditional and modified scores, respectively. The tertiary Gleason grade was found in 8%, 0% and 2% of the biopsies and in 8%, 0% and 13% of the surgical specimens, according to observers 1, 2 and 3, respectively. When evaluating the agreement of the traditional and modified Gleason scores in needle biopsy with both scores of the surgical specimen, a similar agreement was found through Kappa. Conclusion: Contrary to what was expected, the modified Gleason score was not superior in the agreement between the biopsy score and the specimen, or in interobserver reproducibility, in this study.


INTRODUCTION
Prostate cancer tends to be morphologically heterogeneous (1), showing several patterns of differentiation, classified by Gleason system (2).Pros-tate needle biopsy provides random samples, which might not represent neoplasia in all its heterogeneity, generally downgrading the tumor (3)(4)(5).By clinical accompaniment, a worse prognosis was found in the patients who had small proportions of Gleason patterns 4 and 5 tumors, which are not mentioned in the Gleason score (6)(7)(8)(9)(10).From this observation, the concept of modified Gleason score was created incorporating these small most aggressive patterns in the patient's score and being used in some prognostic nomograms (11,12), Figure -1.
Several studies deal with interobserver agreement of Gleason score, with all sorts of different results (13).Other studies deal with the agreement as regards modified Gleason score in slides (14).Recently, Helpap reported better association between needle biopsy and surgical specimen using the modified Gleason score (11).Gleason histopathological classification shows high level of subjectivity.Despite its undeniable clinical importance, as a diagnostic method, the Gleason score, more precisely the modified score, needs to be evaluated in relation to its reliability.Taking this into account, we tried to evaluate the interobserver agreement and the association between needle biopsy and the surgical specimen adopting Gleason and modified Gleason scores.

MATERIALS AND METHODS
A hundred and ten patients suffering from prostate cancer without any previous treatment and who would be referred to a radical prostatectomy agreed to participate in the research.They signed the consent term and sent their needle sextant biopsies, coming from different laboratories, to be reevaluated.Those biopsies had about two cores per sextant, mean total of 12 cores (range 6 to 24 cores).The surgical specimen was processed in the same laboratory, by partial sampling, producing about nine slides per surgical specimen (range 7 to 20), evaluating apex, distal third, mid third, proximal third, bladder neck, right and left seminal vesicles.Thus, the surgical specimen was not processed as a whole.All material was stained with hematoxylin-eosin.All the available slides of the needle biopsy and of the surgical specimens, with or without cancer, were evaluated by the observers.
Three pathologists belonging to different services of Pathological Anatomy examined the slides of the needle and surgical samples of these patients.They did not know the clinical data nor did they know about the pairing between needle biopsy and surgical specimen.They filled in a protocol in which they should classify the primary, secondary and the most aggressive Gleason patterns of each examined area of both specimens.At the end of this task, Gleason score was calculated (the sum of primary and secondary patterns) of each sextant separately (12).The score of the specimen was the highest score found among the evaluated sextants, therefore, the global score was not calculated (4,5,7,15).From the most aggressive Gleason pattern, the tertiary pattern was determined, whenever it was possible.The modified Gleason score was calculated (the sum of primary and tertiary patterns) (12).Similarly, the highest modified score of the examined slides was adopted as the modified Gleason score of the specimen.The primary Gleason pattern was defined as the most frequent Gleason pattern of the sample.The secondary Gleason pattern was the second most frequent pattern, obligatory higher than 5% of the tumor area (12).When the secondary pattern was less than 5%, the primary pattern was repeated.The tertiary Gleason pattern corresponded to the third Gleason pattern, necessarily more aggressive than the secondary pattern (12).In order to avoid terminology confusion, the Gleason score was called traditional Gleason score, being clearly differentiated from the modified Gleason score.
Data were collected in a data bank and statistically evaluated by Stata program version 9.1 (StatCorp.4905 Lakeway Dr, College Station, USA).Kappa (K) and weighted Kappa test were used to evaluate the interobserver agreement and the agreement between the Gleason score of the two specimens.The interpretation of the agreement by Kappa value was done by the intervals: K < 0, poor; K = 0-0.2,slight; K = 0.2-0.4,fair; K = 0.4-0.6,moderate; K = 0.6-0.8,substantial; and K = 0.8-1.0,almost perfect (13).In the statistic inferences, in general, the level of significance of 5% was adopted and, consequently, a confidence level of 95% was used.
The three pathologists are specialized in the same university even though they nowadays work in different hospitals and laboratories.The experience of working in surgical pathology and the weekly amount of prostate tests, criteria adopted by Taille (13), allow us to classify the observers 1 and 2 as experienced and the observer 3 as less experienced.
Some slides, considered unsatisfactory, were rejected from the research.
In the biopsies, there was a predominance of Gleason pattern 3 in the primary pattern, range from 66% to 86%, and of secondary pattern, range from 63% to 71% among the observers.Similarly, in the surgical specimens, Gleason pattern 3 was more frequent in the primary pattern, being found from 75% to 81%, and as secondary pattern from 60% to 69% of the observations.There was an absolute predominance of Gleason grade 3 in the primary and secondary grade in both specimens.Gleason grade 6 was also predominant in the needle biopsy.In the surgical specimen there was a similar proportion of Gleason score 6 and 7. Table-1 shows the distribution of traditional and modified Gleason scores.Interobserver agreement in needle biopsy as regards to primary Gleason grade was reasonable to moderate, according to Kappa.In the surgical specimen, the agreement was moderate to substantial.In the secondary Gleason pattern there was a divergence among the observers, the agreement was generally low, occasionally reasonable.As for the most aggressive Gleason pattern it was from reasonable to moderate (Table -2).
Interobserver agreement of traditional Gleason score in the needle samples was reasonable, with ex-act agreement among 60% to 68% and agreement +/ -1 Gleason score from 91% to 98%.In the specimens the agreement was from reasonable to moderate, with exact diagnosis from 66% to 71% and accepting difference of one unit from 96% to 99%.Modified Gleason grade presented similar agreement in both specimens, being reasonable to moderate.Exact diagnosis in the biopsy was from 58% to 69% and accepting agreement +/-1 Gleason score from 86% to 97%.In the specimen the exact diagnosis was from 60% to 64%, accepting divergence of one unit chang-   ing from 94% to 95%.By adopting weighted Kappa, values similar to Kappa (not weighted) were found (Table -3).
Tertiary Gleason pattern was diagnosed in 8%, 0% and 2% of the biopsies and in 8%, 0% and 13% of the surgical specimen according to observers 1, 2 and 3, respectively.Thus, traditional and modified Gleason scores, according to observer 1, were the same in 92% of both specimens.Observer 2 did not consider any pattern as tertiary, having 100% precision between the two Gleason scores.Examiner 3 had 98% of the needle biopsies and 87% of the surgical specimens with the same diagnosis between the two scores.
Traditional and modified Gleason scores were used to evaluate the association among their scores in both specimens by each observer.For observer 1, adopting the traditional score in needle biopsy and in surgical specimen K = 0.24 was found.Adopting the modified score in the biopsy and the traditional one in the specimen, we got K = 0.21.The same happened when using the modified score in the needle biopsy and in the surgical specimen.Examiner 2 did not find any difference in the association of scores between specimens (K = 0.26).When examiner 3 used the traditional score in the needle biopsy and in the surgical specimen, the value for Kappa was 0.18 and when using the modified score in the biopsy and the traditional one in the specimen, Kappa was 0.17.Adopting the traditional Gleason score in both specimens, lower downgrading in needle biopsy was found than by adopting the modified score in both samples .

COMMENTS
The sample used reflects a group of patients referred to radical prostatectomy, in other words, young patients, with localized illness and generally low Gleason score.The three observers, also young, had similar academic and professional background and learned the Gleason system during medical residence  in the same institution.Therefore, a good agreement among them would be expected.Higher agreement of primary Gleason pattern was found in the surgical specimen and not in the needle biopsy.By observing smaller areas, it is expected that more attention would be devoted to a specific area and higher agreement would happen.On the other hand, once the specimen is better represented in tissue extension, the suspected areas with borderline pattern were better examined, resulting in higher agreement.This reflects the difficulties in di-agnosing secondary pattern, which besides involving the identification of Gleason patterns, demands tumor volume determination.As a rule, secondary Gleason pattern is the one that is more than 5% of the tumor area and with smaller extension than the primary pattern.Determining the tumor extension is not necessary for the diagnosis of the most aggressive Gleason pattern, the recognition of the worst pattern is sufficient.Glaessgen found a weak agreement as regards the diagnosis of the most aggressive patterns and considered that the difficulty in diagnosing them was big- ger than in determining their volume (14).The experience did not influence the agreement much because it was not higher between the more experienced observers, what contradicts some authors (15,16).Interobserver agreement of traditional Gleason score was slightly higher in the surgical specimen than in the needle biopsy.By adopting the modified Gleason score, the agreement was similar in needle biopsy and surgical specimen.In general, adopting weighted Kappa, the agreement values were a little higher, but without altering the previous relations.It is interesting to notice that the modified Gleason score did not show any superiority over traditional score, as Glaessgen reported (14).Evaluating the agreement in relation to the patterns, it is higher in the primary pattern and in the most aggressive one (this is intimately related to tertiary pattern) and too low in the secondary pattern.The modified score would be expected to obtain a higher agreement, but this did not happen.This fact might have happened due to the small number of tertiary pattern diagnosed and, as a result, the two scores were similar.However, this number is similar to the one found in Griffiths' study, where the diagnostic proportion of tertiary Gleason pattern was 6% for general pathologists and 9% for uropathologists, showing weak agreement in relation to tertiary pattern (17).This pattern, in general, refers to patterns 4 or 5, which can present borderline structures making the diagnosis more difficult (3).Generally, the studies regarding the use of tertiary pattern use it in the prognostic evaluation, in surgical specimens.(8-10) Mosse, when evaluating the prognosis of patients with tertiary pattern 5, found a worse prognosis in those with Gleason score 6 or 7 in the surgical specimens.( 8) It is known that, statistically, those scores are the most frequent ones.
Considering that prostate cancer is heterogeneous and multicentric (1), it is assumed that the biopsy, which samples a small portion of it, might not represent it efficiently (3)(4)(5).Traditionally it is believed that Gleason score in needle biopsy tends to downgrade the surgical specimen, because a less differentiated pattern may not have been sampled in the biopsy (4,12).That was observed by observers 1 and 2. Taking the downgrading concept as a starting-point, some authors suggest the use of modified Gleason score, which would better reflect the real tumor char-acteristics for it values the most aggressive small patterns (6).The International Society of Urological Pathology (ISUP) on Gleason grading recommends the inclusion of tertiary pattern (modified Gleason score) in needle biopsies.(12) In the surgical specimens, however, it is still recommended to mention the tertiary pattern, whenever it is present, without including it in the score (Gleason score).(12) Considering that the needle biopsy downgrades the score, it was expected that the modified Gleason score would have a better agreement with the traditional Gleason score in the surgical specimen.However this fact could not be demonstrated.The modified score in needle biopsy compared to traditional and modified score in the specimen, presented the same Kappa values or even slightly inferior ones when adopting the traditional score in the biopsy.The best representation in the biopsy was not proved when adopting the modified Gleason score.This fact, as previously mentioned, might have happened due to the low diagnosis of tertiary pattern.Helpap, on the contrary, evaluating slides of 368 patients, found improvement of the exact agreement between the two specimens using the modified Gleason score instead of the traditional score, ranging from 58% to 78% (8).However, he did not use the Kappa test to evaluate the real agreement, nor reported the diagnostic proportion of tertiary pattern.

CONCLUSIONS
In this study, the modified Gleason score did not prove to be superior in reproducibility compared to the traditional Gleason score, both in the needle biopsy and in the surgical specimen.Contrary to what was expected, the use of the modified score in the biopsy was not superior to the traditional score, comparing to the Gleason scores of the specimen.Within the aim of the study, the modified Gleason score was not superior to the traditional one.These conclusions might be due to the methodology used, as well as to the observers involved.Isolated morphological analysis is based in criteria of low reproducibility.It is necessary to reevaluate the association between the two Gleason scores, using different samples with a higher amount of tertiary pattern.

EDITORIAL COMMENT
At a consensus conference organized in 2005 by the International Society of Urological Pathology (ISUP), the Gleason grading system underwent its first systematic revision (1).The purpose of the meeting was to standardize both the perception of histological patterns and how the grade information is compiled and reported.One of the decisions of the ISUP working group was that high-grade tumor of any quantity on needle biopsy should be included in the Gleason score.The ISUP recommendations contribute to a general shift upwards of the Gleason scores and it may be necessary to re-iterate some previous studies on grading of prostate cancer.Helpap et al. recently compared conventional and modified Gleason grading in radical prostatectomy specimens and preoperative biopsies and reported on the distribution of modified Gleason score and its correlation with other prognostic factors such as age, stage and serum PSA (2)(3)(4).Few studies have been performed on interobserver reproducibility of this new variant of Gleason grading.
In a study by Glaessgen et al., the reproducibility of modified Gleason grading among four genitourinary pathologists was analyzed using a set of 69 consecutive radical prostatectomy specimens (5).Mean weighted kappa for conventional and modified Gleason score were 0.56 (range 0.52-0.66)and 0.58 (range 0.49-0.74),respectively.This study was carried out before the ISUP consensus meeting was held and only addressed the effect of inclusion of tertiary patterns of higher grade in the Gleason score.Hence, recent changes in pattern recognition were not taken into account.Furthermore, the ISUP recommendations to include tertiary higher patterns in the score pertained to needle biopsies, while the study by Glaessgen et al. was done on radical prostatectomy specimens only (5).
Veloso et al., in this paper, present a similar study on the reproducibility of a modified Gleason grading, now done on both needle biopsies and radical prostatectomy specimens.Again, only the effect of inclusion of tertiary higher patterns was studied.In needle biopsies a weighted kappa of 0.36 was reached both with conventional and modified Gleason grading.
In radical prostatectomy specimens, the weighted kappa was 0.46 and 0.36, respectively.This interobserver agreement was slightly lower than that of previous studies.For example, in a biopsy study on conventional Gleason score by Glaessgen et al., a weighted kappa of 0.48 to 0.55 (mean 0.51) was reached among 4 genitourinary pathologists using a consecutive series of needle biopsies from 69 men (279 glass slides) (6).Allsbrook et al. circulated 46 needle biopsies containing prostatic carcinoma among 10 genitourinary pathologists (6).The weighted kappa for Gleason score ranged from 0.56 to 0.70.However, the biopsies of this series were selected rather than consecutive which may lead to a better reproducibility.
From studies performed so far, it seems that the interobserver reproducibility of the Gleason grading remains essentially the same with modified Gleason grading and results are probably more influenced by the study design.
Revision of a grading system may be necessary when we gain new knowledge of the biology of cancer.However, it must also be remembered that a revision has consequences in terms of modified prognostic impact of a certain grade and also warrants new studies to verify the value of the novel grading system (7).Whether modified Gleason grading of needle biopsies is superior as predictor of prognosis remains to be seen.

EDITORIAL COMMENT
This paper by Dr Veloso et al. deal with interobserver agreement of Gleason score and modified Gleason score (1) in needle biopsy and in surgical specimen of prostate cancer.This group of authors found that the modified Gleason score was not superior in the agreement between the biopsy score and the specimen, or in interobserver reproducibility.
The Gleason grading system is a powerful tool to prognosticate and aid in the treatment of men with prostate cancer.The needle biopsy Gleason score correlates with virtually all other pathologic parameters, including tumor volume and margin status in radical prostatectomy specimens, serum PSA levels and many molecular markers.The Gleason score assigned to the tumor at radical prostatectomy is the most powerful predictor of progression following radical prostatectomy.However, there exist significant deficiencies in the practice of this grading system.Not only does there exist problems among practicing pathologists but also a relative lack of interobserver reproducibility among experts.

Correlation
There have been several studies addressing the correlation between Gleason scores in needle biopsies and corresponding radical prostatectomy specimens.Although earlier studies used the thicker (14-gauge) needle biopsies (2,3), more recent series based on thin-core (18-gauge) needles used in conjunction with biopsy guns attached to transrectal ultrasound.Sextant or other modes of systematic sampling are typically performed in the more current series.In a recent compilation of data on 3,789 patients from 18 studies, exact correlation of Gleason scores was found in 43% of cases and correlation plus or minus one Gleason core unit in 77% of cases (4).Under-grading of carcinoma in needle biopsy is the most common problem, occurring in 42% of all reviewed cases.Importantly, over-grading of carcinoma in needle biopsies may also occur, but this was only found in 15% of cases.In general, adverse findings on needle biopsy accurately predict adverse findings in the radical prostatectomy specimen, whereas favorable findings on the needle biopsy do not necessarily predict favorable findings in the radical prostatectomy specimens in large part due to sampling error.

Sampling error
Perhaps the most important factor is sampling error, which relates to the small amount of tissue removed by thin-core needle biopsies.The average 20-mm, 18-gauge core samples approximately 0.04% of the average gland volume (40 cc).The most common type of sampling error occurs when there is a higher grade component present within the radical prostatectomy specimen, which is not sampled on needle biopsy (5).This typically occurs when a needle biopsy tumor is graded as Gleason score 3 + 3 = 6.In the radical prostatectomy, there exists a Gleason pattern 4, which was not sampled on the biopsy, resulting in a prostatectomy Gleason 3 + 4 = 7.
In some instances, under-grading results from an attempt to grade very tiny areas of carcinoma, socalled minimal or limited adenocarcinoma (6).Scores of minimal adenocarcinoma in needle biopsies show a reasonably strong correlation with radical prostatectomy scores, but the Gleason scores do not have the same power to predict extra-prostatic extension and positive margin status as they do in nonminimal carcinomas (6).
Over-grading can result from sampling error in cases where the high-grade pattern is selectively represented in needle biopsy.It may only represent a very minor element in the radical prostatectomy specimen.Even the same cancer focus may have different grades depending on the area sampled.

Borderline cases
The other source of discrepancy between biopsy and radical prostatectomy is borderline cases.In the description of the Gleason grading system, there are some cases that are right at the interface between two different patterns where there will be interobserver variability and possible even intra-observer variability (7).

Pathology error
Pathology error is most frequently seen when pathologists assigned a Gleason score of ≤ 4 on a needle biopsy, which in fact was Gleason score 5-6.Many pathologists under-grade needle biopsies by confusing quantitative changes with qualitative changes.When there is a limited focus of small glands of cancer on needle biopsy, by definition this is a Gleason pattern 3. Gleason pattern 3 consists of small glands with an infiltrative pattern.Biopsying truly lowgrade adenocarcinoma of the prostate could not result in just a few neoplastic glands but rather would be more extensive, as low-grade adenocarcinoma grows as nodules of closely packed glands rather than infiltrating in and amongst normal glands.
Under-grading may result from difficulty in recognizing an infiltrative growth pattern or failing to recognize the presence of small areas of gland fusion (7).

Pathologists' education and experience
The pathologists' experience in grading thincore needle biopsies can also influence overall correlation with radical prostatectomy results.With experience, pathologists recognize grading pitfalls; in particular, the fact that Gleason scores of 4 and lower are almost non-existent in needle biopsy situation.Furthermore, small areas of fusion in the presence of a predominantly grade 3 background are recognized and will yield a Gleason score of 7, which often correlates well with radical prostatectomy results (8).

Intra-observer and interobserver variability
Reproducibility studies can be categorized as intra-observer and interobserver.For investigations of intra-observer agreement of Gleason grades, exact agreement was reported in 43% to 78% of cases (8,9), and agreement within plus or minus one Gleason score unit was reported in 72% to, 87% of cases.Gleason wrote tat he duplicated exactly his previous histologic scores approximately 50% of times.Highly variable levels of interobserver agreement on Gleason scores have also been reported, with range of 36% to 81% for exact agreement and 69% to 86% observers within plus or minus one Gleason score unit.Improvements in Gleason grading reproducibility can be achieved by recognizing problematic areas and educating physicians via meetings, courses, website tutorials, and publications that specifically focus on the Gleason grading system (10).score by adding the most common and highest grade patterns.In the above example, this would result in a Gleason score of 3 + 5 = 8.This study by Veloso et al. found that the interobserver reproducibility for the modified biopsy Gleason score was not superior to the routine Gleason score and was also not more accurate in predicting radical prostatectomy Gleason score.The major limitation of their study, as they acknowledge, is the limited number of cases with a tertiary pattern on biopsy, ranging from 0%, 2%, to 8% amongst the three observers out of a total of 110 patients.With such small numbers, it would be impossible to show any differences between the routine and modified Gleason score.In a recent paper on 2,370 men with prostate cancer, Patel et al. also found that Gleason score 7 with tertiary pattern 5 was uncommon, occurring in 1.5% of cases (1).However, they documented that Gleason score 7 tumor on biopsy with tertiary pattern 5 has the same prognosis as Gleason score 8 tumor when treated by radiotherapy or radical prostatectomy.These findings are in concert with several studies that have documented the same adverse prognostic significance of tertiary pattern 5 in radical prostatectomy specimens.The growing body of evidence suggests that Gleason score 3 + 4 with tertiary pattern 5, whether on biopsy or radical prostatectomy, should be considered as Gleason score 8.

3 G
= percentage of traditional Gleason score; Mod G = percentage of modified Gleason score.

Table 2 -
Interobserver agreement as regards primary, secondary and the most aggressive Gleason grade in needle biopsies and surgical specimens, considering each patient as an independent event. 3 G1 = primary Gleason; G2 = secondary Gleason; most aggressive G = the regards most aggressive Gleason grade; weighted Kappa= with linear weight (disagreement by 1 category = 0.67 and disagreement by 2 categories = 0.33).Confidence level of 95% was used.

Table 3 -
Interobserver agreement as regards Gleason score (traditional) and modified Gleason score in needle biopsies and surgical specimens, considering each patient as an independent event.

Table 4 -
Agreement between traditional and modified Gleason scores in needle biopsies and surgical specimens according to observers.considering each patient as an independent event.Number of needle biopsies whose Gleason score was downgraded or upgraded in relation to the surgical specimen.
Biopsy G = Gleason score in the needle biopsy; Surgical G = Gleason score in the surgical specimen; Biopsy Gmod = modified Gleason score in the needle biopsy; Surgical Gmod = modified Gleason score in the surgical specimen; Down = downgrading of the needle biopsy; Over = overgrading of the biopsy; Confidence level of 95% was used.