Relationships between Bloom’s taxonomy, judges’ estimation of item difficulty and psychometric properties of items from a progress test: a prospective observational study

ABSTRACT BACKGROUND: Progress tests are longitudinal assessments of students’ knowledge based on successive tests. Calibration of the test difficulty is challenging, especially because of the tendency of item-writers to overestimate students’ performance. The relationships between the levels of Bloom’s taxonomy, the ability of test judges to predict the difficulty of test items and the real psychometric properties of test items have been insufficiently studied. OBJECTIVE: To investigate the psychometric properties of items according to their classification in Bloom’s taxonomy and judges’ estimates, through an adaptation of the Angoff method. DESIGN AND SETTING: Prospective observational study using secondary data from students’ performance in a progress test applied to ten medical schools, mainly in the state of São Paulo, Brazil. METHODS: We compared the expected and real difficulty of items used in a progress test. The items were classified according to Bloom’s taxonomy. Psychometric properties were assessed based on their taxonomy and fields of knowledge. RESULTS: There was a 54% match between the panel of experts’ expectations and the real difficulty of items. Items that were expected to be easy had mean difficulty that was significantly lower than that of items that were expected to be medium (P < 0.05) or difficult (P < 0.01). Items with high-level taxonomy had higher discrimination indices than low-level items (P = 0.026). We did not find any significant differences between the fields in terms of difficulty and discrimination. CONCLUSIONS: Our study demonstrated that items with high-level taxonomy performed better in discrimination indices and that a panel of experts may develop coherent reasoning regarding the difficulty of items.


INTRODUCTION
Assembling a knowledge test can be a challenging task, especially with regard to calibrating the difficulty of the test. Although many studies have addressed how useful experts' opinions can be, their predictions of the difficulty is often different from what the students perceive.
This uncertainty relates to the multiple factors involved in the cognitive process that is necessary for answering a question and to the tendency of item-writers to overestimate students' performance. 1,2 Questions can require lower or higher levels of cognitive processing, depending on whether students have to recall, minimally understand or apply their knowledge.
Although studies have investigated experts' predictions and the requirements for cognitively processing the items, little attention has been paid to the combination of these two factors.
Knowing whether there are relationships between the type of cognitive processing that the item requires, experts' predictions and the difficulty of the items may help experts to predict the difficulty of knowledge tests better.
Bloom's taxonomy of educational objectives was designed to classify the learning objectives, skills and abilities that are expected from learners at the end of an educational program. 3,4 Educational objectives may range from memorization of knowledge to creation of new knowledge in an increasingly complex and hierarchical fashion. 3,5 Within this framework, cognitive processing is represented as a cumulative hierarchy that is made up of lower and higher levels of acquired knowledge. There are two low levels, which relate to remembering and minimally understanding the knowledge. There are two intermediate levels (third and fourth levels), which relate to applying the knowledge to a new situation and making connections between ideas (analyses). There are two high levels, which relate to justifying decisions (evaluations) and creation of new knowledge. In theory, mastery of lower levels is required in order to attain higher levels.
Questions that assess higher levels of complexity of knowledge are difficult to produce, and there is a debate regarding whether multiple-choice questions have the capacity to assess higher levels of complexity, i.e. situations of creation of new knowledge. 6 More importantly, higher-order cognitive processing has been shown to improve students' knowledge retention, compared with low-order cognitive processing. Additionally, medical practice requires the use of higher-order cognitive processing more than lower-order processing. Although there is a trend within medicine towards assessing students at higher levels of cognitive processing, little attention has been paid to Bloom's taxonomy when setting pass/fail scores.
Setting pass/fail scores is the main concern in educational assessment. 7,8 There are two main categories of procedures for setting standards: norm-referenced (relative) and criterion-referenced (absolute). Relative methods take the results from the test into account to set the standards. They help rank the examinees but may lead to a large variation in the cutoff scores and are poorly accepted in some cultures. Absolute methods are widely used worldwide, but they face several criticisms because they lead to large variation in failure rates and do not consider the different difficulties between different exams. 9,10 One example of a criterion-referenced method that is often used within medical education is the Angoff method. In this method, the judges of the examination estimate the percentage of borderline examinees who will respond correctly to the test items. The judges' estimates are then averaged for each item, and the cutoff is set as the sum of the averages. 11 Progress tests have been used in Brazilian schools for more than fifteen years. [12][13][14] They have been gaining greater attention over the last five years because of the Brazilian Association of Medical Education's efforts to improve the quality of medical students' evaluations throughout the country. 15 Therefore, progress tests give rise to a good opportunity for studying the psychometric properties of assessment items.

OBJECTIVE
Although some studies have analyzed the application of Bloom's taxonomy to test items 16 and the utility of Angoff methods using standard settings, 17,18 the relationship between these two has not been extensively examined. In the current study, we investigated the relationships between the exam judges' estimates (through an adaptation of the Angoff method) and the classification of the difficulty and discrimination levels of items, using Bloom's taxonomy in a progress test setting.

Study design
For this prospective observational study, data from the 2018 progress test from a consortium of ten Brazilian medical schools, mainly in the state of São Paulo, Brazil, were examined. Our examination of the progress test was designed to assess the knowledge that final-year medical students should have, in order to provide feedback to medical students and institutions. 15 All the students at these ten schools underwent the same test once a year, on the same day, at the same time.
The students had four hours to complete the test, and after two hours had elapsed, they could use the question booklet of the test for self-study purposes. Written feedback with commentary and bibliographic references for each item was provided a few days after the test.
A blueprint for the progress test was developed by the consortium, consisting of six fields of knowledge: basic science, internal medicine, pediatrics, surgery, obstetrics and gynecology, and public health. Every year, the coordinators of the progress test create a set of orders for items that address the blueprint.
Each school is represented at the meetings by an academic staff member. This representative is responsible for the exchange of information between his school and the others, as well as for delivering the orders to his colleagues, who will be responsible for writing the required items. A single order from the coordinators might therefore consist of up to ten written items.
Afterwards, several specialists from the consortium schools hold a meeting to select the items that will make up the final exam: 20 items for each field, thus totaling 120 multiple-choice items, each presenting four alternative responses. Any unused items are stored in a database.

Bloom's taxonomy classification of the items
The items were classified in accordance with the levels of cognitive domains that were proposed by Bloom, as revised by Anderson and Krathwohl. 5 Here, items focusing on remem-

Angoff adaptation
In this study, the panel of experts was asked to set the expected difficulty for each item selected. The difficulty would be estimated by considering the performance of a sixth-year medical student. In the original use of the Angoff method, the expected percentage of correct answers among the examined population was ascertained. 19 Here, we asked the experts to classify the items as follows: difficult (expectation that more than 80% of the answers would be incorrect), medium (expectation that 40% to 80% would be incorrect), and easy (expectation that less than 40% would be incorrect). The expected level of difficulty of the items was developed based on an agreement that was reached after a discussion among the judges.

Statistical analysis
A specialized institution marked the tests and performed psychometric analysis on the items by focusing on their difficulty, the discrimination index and biserial correlation. This last aspect will not be discussed further in the present study. For the purpose of the present study, test responses that consisted of guessing constant answers were excluded from the analysis (i.e. proportion of correct answers < 25%). We only used the data from the sixth- year students at the ten medical schools.
As described above, items with a difficulty index greater than 0.8 were considered difficult, items with indices lower than 0.4 were considered easy and items with indices between 0.4 and 0.8 were considered medium.
The normality of the data was tested using the Shapiro-Wilk test. The differences in mean values were tested using single-factor analysis of variance (ANOVA) followed by the Tukey post-test for the parametric data; or using the Kruskal-Wallis test followed by the Dunn test for the nonparametric data. Correlations between the different data were made using the Spearman correlation test.
We set the statistical significance level at a P-value of 0.05. 20 The statistical analyses were performed using the Statistical Package for the Social Sciences (SPSS), version 24.0, and the BioEstat software, version 5.0.

Ethical considerations
Since we dealt with secondary data and no student was identified, ethics committee approval was not necessary.

RESULTS
A total of 4,596 students participated in the test (94.1% of the total population), from which 4,563 were included in the general psychometric analysis. Of these, 771 students were in their sixth year ( Table 1). One item relating to obstetrics and gynecology was invalidated due to inconsistent answers, and therefore, 119 items were analyzed.
More than 50% of the items relating to internal medicine, pediatrics, surgery and obstetrics and gynecology were classified as presenting high-level taxonomy, whereas most of the items relating to basic sciences and public health were classified as presenting low-level taxonomy. The distribution of the items was significantly different between the fields (P < 0.001), such that public health presented higher frequency of items with low-level taxonomy, compared with internal medicine, pediatrics, surgery and obstetrics and gynecology. In addition, the distribution of items was statistically different between pediatrics and basic sciences (Figure 1). Table 2 presents the distribution of items according to their taxonomy among the fields of knowledge.

Item difficulty
The panel of experts judged 62 items as easy, 41 as medium and 16 as difficult. Based on the analysis of the real difficulty of the items, 79 items were easy, 82 were medium and only one item was difficult (Figure 2). For 65 items (54%), the expected difficulty was the same as the difficulty in reality; 13 items (11%) were underestimated (i.e. they were more difficult than expected); and 41 items (34%) were overestimated (i.e. they were easier than expected). The rates of concordance between expected difficulty and difficulty in reality were 60% for basic sciences, pediatrics and public health; 50% for internal medicine and surgery; and 47% for obstetrics and gynecology. The analysis on the difficulty of the items in reality according to the levels of difficulty set by the panel experts demonstrated mean difficulties of 0.28, 0.37 and 0.49, for items considered easy, medium and difficult, respectively. These differences were statistically significant (F = 8.604; P < 0.01): the items that were considered easy presented mean difficulty significantly lower than that of the items considered medium (P < 0.05) and the items considered difficult (P < 0.01).
Obstetrics and gynecology and basic sciences were the categories with the highest mean difficulty, followed by internal medicine, surgery, pediatrics and public health ( Table 2). We did not find any significant differences between the fields of knowledge (F = 0.323; P = 0.898), although there was a trend towards public health to be considered easier.
The mean difficulties of the items classified as having low, intermediate and high-level taxonomies were 0.29, 0.34 and 0.36, respectively. We did not find any significant differences between the levels of taxonomy regarding difficulty (F = 0.993; P = 0.374), and we did not find any correlation between the taxonomy of the items and their difficulty (rho = 0.172; P = 0.06).

Item discrimination
The mean discrimination indices were 0.38 for obstetrics and gynecology; 0.32 for pediatrics, surgery and internal medicine; 0.31 for public health; and 0.27 for basic sciences ( Table 2).
Although obstetrics and gynecology demonstrated a trend towards greater discrimination, we did not find any significant differences between the fields of knowledge (H = 8.734; P = 0.12).
Comparison of discrimination between the items accord-

DISCUSSION
This study sought to use progress tests to investigate the relationships between the difficulties and discrimination and the judges' estimates of exam items, through an adaptation of the Angoff method; and to classify them using Bloom's taxonomy. Items with higher-level taxonomy had higher discrimination indices than those with lower-level taxonomy. We also found that items that were expected to be easy were indeed easier than items that were expected to be difficult.
At the end of medical school, students are expected to demonstrate high-order cognitive processes. For example, students in the initial years of training perform better in questions with lower-level taxonomy, whereas students in their final years perform better in relation to items with higher-level taxonomy. 21 In our test, items with higher-level taxonomy predominated, which was expected because the test was designed to include vignette-based items.  Public health was significantly different from internal medicine, pediatrics, surgery and obstetrics and gynecology. Pediatrics was also significantly different from basic sciences. Overall, P < 0.0001. In addition, tests with higher-level taxonomy had better discrimination indices than tests with lower-level taxonomy. These data emphasize the need to develop tests for better discrimination of items with high-level taxonomy. In this regard, case-based questions might be more suitable for higher-order cognitive processing 12 and consequently might be more appropriate for tests that are designed to assess the knowledge of final-year students.
Interestingly, in the field of public health, the indices of discrimination and difficulty tended to be lower. This can possibly be explained in terms of the predominance of lower-order cognitive processes that are involved in the items from this subject.
These findings may relate to the characteristics of this field: students are required to have sufficient knowledge of legislation and conceptual frameworks.
Although the test was easier than estimated by the judges, The panel of experts demonstrated coherent reasoning in classifying the difficulty of the items. In addition, the group analysis indicated that the items that were expected to be easy presented lower mean difficulty indices than the items that were expected to be medium or difficult; while items that were expected to be medium had lower mean difficulty indices than items that were expected to be difficult (although these differences were not statistically significant).
Similarly, Kibble and Johnson found coherence between the intended and actual difficulty of the items, with a successful estimation rate of 48%. 26 Conversely, they did not find any correlation between the taxonomy of the test items and their difficulty and discrimination indices. These authors placed doubt on the useful-   Currently, item response theory is used to compose exams using previously tested items. 28,29 Despite the advantages of this method, it has limited usefulness with regard to new written items. Our data suggest that classification of items using Bloom's taxonomy (which can be performed prior to application of the exam) can select the items with better discrimination performance. Lastly, future research could provide correction formulas based on the judges' expectations, in order to better predict the real difficulty of the items.

CONCLUSION
In conclusion, the items with higher-level taxonomy provided better discrimination of the students' performance; and the panel of experts demonstrated that they coherently deduced the difficulty of the exam items.