Interobserver Reliability of the International Classifi Cation of Primary Care

OBJECTIVE: The International Classifi cation of Primary Care was developed as an attempt to overcome the limitations of the International Statistical Classifi cation of Diseases and Related Health Problems, 10th revision, when used for primary health care. The aim of the study was to evaluate the interobserver reliability of the International Classifi cation for Primary Care when coding reasons for health-related interruption of daily activities. METHODS: Data analyzed pertained to 801 subjects from Phase 2 of the Pró-Saúde Study, involving the employees of a Rio de Janeiro university who reported having been prevented from carrying out any of their usual activities (work, study, or leisure) for health-related reasons in the two weeks prior to data collection. Health problems reported in response to an open question were separately coded by two classifi ers. Interobserver reliability with respect to number of health problems was calculated by weighted kappa; for the remaining analyses (chapters and full codes), crude kappa coeffi cients were used. RESULTS: A total of 1,641 health problems were coded by the fi rst classifi er, and 1,629 by the second. Interobserver reliability with respect to the number of health problems coded was substantial (weighted kappa=0. CONCLUSIONS: The results suggest that the International Classifi cation of Primary Care is adequate for the coding of health-related reasons for interruption of daily activities.


INTRODUCTION
Self-reported morbidity surveys are a common tool in health evaluations, especially because they allow access to the morbidity profi le of the general population.The population that seeks and obtains health care -usually the object of outpatient and hospital-based studies -may show a profi le distinct from that of the general population. 6One of the issues to be considered when conducting a morbidity survey is the way in which data is collected and coded.Collection can be based both on lists of frequent health problems and symptoms, or on open questions.Studies using the open question without a list of problems or symptoms require a coding strategy.The 10 th revision of the International Statistical Classifi cation of Diseases and Health Related Problems (ICD -10)  is the classifi cation system internationally adopted for coding diagnoses and elaborating health statistics.a The International Classification of Primary Care (ICPC) 15 is an attempt to complement ICD in the context of primary care, and is characterized by the inclusion of patients' complaints and social problems. 3n order to ensure that comparability between its codes and those of ICD-10, ICPC underwent a revision that became known as ICPC-2.This new version has a biaxial structure: the fi rst axis is divided into chapters that comprise the organic, psychological, and social systems to which the report refer (general and nonspecifi c, blood, digestive, eye, ear, among others); the second axis presents components referring to the type of report (signals and symptoms, procedures, diagnoses, and diseases).The code for a reason for a medical appointment is composed of one letter, which represents the chapter, and two digits, which represent components.Because it was created for primary health care, and because it considers the patient's discourse as it is enunciated, one can expect that ICPC will perform well with open questions contained in health surveys; however, this classifi cation system is recent, and has not been substantially explored in Brazil.In a review of the literature, b a single study evaluating the use of this classifi cation in Brazil 7 was identifi ed.Moreover, only one other study used ICPC-2 to code self-reported morbidity in questionnaires, but this study did not provide an evaluation of reliability. 14e aim of the present study was to evaluate the interobserver reliability of independent ICPC-2 coding of answers to an open question regarding the reason for interruption of usual activities.

METHODS
The data analyzed in the present study are part of a larger project, the Pró-Saúde Study, which has been described in a prior publication (Faerstein et al 2 ).That study involves a cohort of technical-administrative employees from a university in Rio de Janeiro state, and was aimed at investigating the association between social determinants and a variety of health-related outcomes.So far, data collection for the Pró-Saúde Study was carried out in three phases, in 1999, 2001, and 2006/07.In the present study we evaluated employees that: a) responded to the self-administered questionnaire a Jamoulle M, Roland M. The WONCA Classifi cation Committee, 1972-1997, 25  in Phase 2 (2001); b) provided a valid response to the question on interruption of usual activities due to health reasons; and c) reported having been prevented from carrying out any of their daily activities due to health reasons in the two weeks prior to the interview.Of the total 812 subjects who reported interruptions of their usual activities for health reasons, 11 did not provide the reason for interruption, and were thus excluded from the analysis.Therefore, the population of the present study comprised 801 employees.
The question used to evaluate the interruption of usual activities due to health reasons was the following: "in the last two weeks, which health problem or problems did you have that prevented you from carrying out any of your usual activities (for example, working, studying, leisure activities, or house chores)?

" ["Nas duas últimas semanas, qual foi ou quais foram esses problemas de saúde que você teve ou tem que o(a) impediram de realizar alguma dessas suas atividades habituais (por exemplo, trabalho, estudo, lazer ou tarefas domésticas)?"]
Coding was carried out by two classifi ers: one specialized in disease classifi cation, especially ICD-10, but without prior experience with ICPC-2; and another without any prior classifying experience.Both classifi ers went through a training program, consisting of reading of classifi cations followed by discussion sessions on the logic of the system, prior to beginning the actual classifi cation process.Later, as a test, the two classifi ers coded the reported reasons for 29 medical appointments at a primary care unit, as well as 59 questionnaires from the Pró-Saúde Study that were not included in the main sample.A meeting was then held in order to arrive to a consensus coding, which involved the participation of a third classifi er.This meeting also served to establish directives for coding data pertaining to this study population.b Since any given complaint could be interpreted using more than one code, there was a divergence in the number of reasons codifi ed by each classifi er.We therefore carried out a reliability analysis between the two classifi ers regarding the number of reasons coded, using the weighted kappa test. 13In addition, we analyzed the reliability of ICPC-2 according to chapter (for instance, if one classifi er coded a reason as P01 and the other as P02, there was agreement in terms of the chapter), according to full code within a given chapter (analyzing separately chapter P, classifi ers should agree as to the full code, e.g., P01 and P01), and according to global full code (classifi ers should agree as to the full code; however, reliability calculation was carried out for the set of all codes, i.e., not taking into consideration the chapter).The analysis of chapters and full codes was subsequently stratifi ed according to sex, schooling, and occupation.Stratifi cation was carried out because subjects, depending on their characteristics, tend to express themselves in different ways, in greater or lesser detail, and using different linguistic peculiarities.Therefore, as self-administered questionnaires do not allow for further clarifi cation, the quality of the report may lead to further diffi culty when coding.In all cases, reliability between the two classifi ers was estimated using the kappa statistic. 195% Confi dence Intervals (95%CI) for the kappa statistic were calculated using the kapci routine, 10 developed for Stata software.We used the classifi cation proposed by Shrout 11 for interpreting kappa values, as follows: k<0.10 -virtually no reliability; 0.10 to 0.40 -slight reliability; 0.41 to 0.60 -fair reliability; 0.61 to 0.80 -moderate reliability; and 0.81 to 1.0 -substantial reliability.This classifi cation is recently being employed by a number of authors, 12 and represents a further development over the classifi cation proposed by Landis & Koch. 4

RESULTS
The fi rst classifi er codifi ed a total 1,641 reasons, and the second, a total of 1,621 reasons, with both classifi ers identifying a median of two reasons per subject.Analysis of the number of reasons codifi ed per subject had a crude agreement of 82.4%, with crude kappa = 0.78 (95% CI: 0.77;0.78)and weighted kappa = 0.94 (95% CI: 0.93;0.94),indicating substantial reliability.
Table 1 shows that reliability estimates for chapter codes were substantial, both globally and within each stratum.When the full code was considered, estimated reliability was moderate to substantial.The small difference related to occupation in the analysis of full codes ceased to exist when only chapter codes were considered.Reliability was also similar for both sexes.However, there were differences relate to schooling, reliability being lower for subjects with less than highschool than for the remainder.Nevertheless, reliability was still considered as moderate among the former.
Table 2 presents the full code agreement level within each chapter of ICPC-2.Reasons included in some of the chapters were infrequent, which hindered the evaluation of reliability for these chapters.Substantial reliability was seen for chapters N (nervous system), R (respiratory tract), and P (psychological).Events tended to be concentrated in a handful of codes, such as headache (178 of 248) in the nervous system chapter; acute upper respiratory tract infection (35 of 183), chronic/acute sinusitis (28 of 183), and fl u (52 of 183) in the respiratory tract chapter; and feelings of anxiety/nervousness/tension (71 of 263), acute reaction to stress (31 of 263), and depressive disorders (99 of 263) in the psychological chapter.Chapter K (circulatory apparatus) showed slight reliability.In-depth analysis of discrepancies in chapter K showed that 19 of 41 discrepancies were codifi ed by one classifi er as K85 (elevated blood pressure), and by the other as code K86 (uncomplicated hypertension), both with similar meaning.Had both classifi ers attributed the same code to such reasons, crude agreement would have risen to 74.4, with a crude kappa of 0.7 (95%CI: 0.63:0.70),considered as moderate.

DISCUSSION
The present study has detected substantial reliability between two classifi ers for number of reasons and chapter, and moderate to substantial reliability with respect to full codes.Such high reliability was unexpected, given that one of the classifi ers had no prior experience in morbidity coding.
The lower full code reliability of reasons reported by subjects with elementary schooling may indicate that classifi ers had diffi culty in interpreting the language used by this group.On the other hand, the classifi cation could be made to encompass a wider range of expressions, thus increasing the generalization of its application.
Reliability at the chapter level was similar for health professionals and for other employees, being actually slightly higher among the latter.Health professionals were expected to have greater ability to provide information on their health-related problems, and interobserver reliability was thus expected to be higher for this group.However, this was not the case.A possible explanation for this result is that more detailed reports may actually lead to greater diffi culty in classifi cation.For instance, "disc herniation in the cervical spine with canal compromise" ("hérnias discais na coluna cervical com comprometimento do canal") was coded by one classifi er as "vertebral syndrome," and as "vertebral syndrome with radiating pain" by the other.
The concentration of reasons into a handful of more frequent codes may have contributed to the higher reliability of these chapters.The slight reliability of coding for the circulatory system chapter may be attributed to the existence of different codes for classifying very similar health problems.The distinction between elevated blood pressure and uncomplicated hypertension may be relevant in other contexts, but is not in the case of primary care or health surveys, given that the patient or responder may not have a clear idea of the difference between these two terms, and would be likely to use either one indiscriminately.Notwithstanding, given that this is an important aspect of coding, it will require special emphasis when classifi ers are trained.
The only similar study found in the literature was carried out by Letrilliart et al. 9 These authors studied the reliability of the classifi cation of codes attributed by general physicians trained in using ICPC-2, who classifi ed health problems directly from the patient's discourse, and by epidemiologists, who based their classifi cation on medical information extracted from a database.The authors found a weighted kappa coeffi cient of 0.65 (95% CI: 0.52;0.77)for reliability in the number of reasons codifi ed.Crude agreement at the code level (considering chapter code only) was 69.2% (83 of 120) and crude kappa was 0.84 (95% CI: 0.78;0.91).
Even though the comparison of kappa statistics based on different study populations is questionable (given that this measure is infl uenced by the prevalence of the phenomenon at hand), these values are lower than those estimated in the present study.One may speculate that these discrepancies can be explained, at least partially, by the use of different sources for capturing the data (primary vs. secondary) as well as different professionals (clinicians vs. epidemiologists).
In the study by Van der Heyden et al, 14 ICPC-2 was applied to an open question included in a self-administered questionnaire.These authors reported certain problems with using this classifi cation, including lack of specifi city of responses, responses inadequate to the question, cases in which the code to be employed was not clear, and high time demand for coding responses.Lack of specifi city was also common in the present study.This may constitute a drawback in applying ICPC-2 to self-administered questionnaires, given that further classifi cation of obscure points in the response is not possible.In the case of primary care, or when questionnaires are completed by the interviewer, there is the possibility of asking further questions in case of unspecifi c responses.In the present study, there were no cases of responses that were inadequate to the question, nor cases in which the code to be used was not clear.
We found that classifi cation became faster as classifi ers became more experienced with the process.However, coding may have taken longer than necessary due to certain problems with ICPC-2, including the lack of expressions in the index such as "allergy," "muscular distension," "nausea," "fl u," "hernia," and "cold," only to list a few.The classifi cation of procedures is also fl awed, including terms such as "excision" and "exeresis," but lacking the word "surgery".A classifi cation whose use is not restricted to physicians 7 should carry a wider range of options, so that individuals without prior knowledge of such procedures will also able to work with this system.
In conclusion, ICPC-2 showed good interobserver reliability for coding health reasons for the interruption of usual activities.The fact that the population of the present study comprised the staff of a public university means that it differs from the general population in several aspects.Therefore, the present results should only be generalized to populations with a similar profile to that of the Pró-Saúde Study.However, analyses according to sex, schooling and occupation show similar performance across different strata, suggesting that ICPC-2 may also perform adequately in other contexts.

Table 1 .
Agreement between classifi ers with respect to full code and chapter when classifying reported reasons for medical appointments, according to respondent sex, schooling, and occupation, using ICPC-2.Pró-Saúde Study, Rio de Janeiro, Southeastern Brazil, 2001.

Table 2 .
Interobserver agreement with respect to full code when classifying reported reasons for medical appointments according to chapter, using ICPC-2.Pró-Saúde Study, Rio de Janeiro, Southeastern Brazil, 2001.