Edinburgh Research Explorer Brazilian WHOQOL-OLD module version: a Rasch analysis of a new instrument

OBJECTIVE: To evaluate the Brazilian version of WHOQOL-OLD Module and to test potential changes to the instrument to increase its psychometric adequacy. METHODS: A total of 424 older adults living in a city in Southern Brazil completed the WHOQOL-OLD instrument, in 2005. Rasch analysis was used to explore the psychometric performance of the scale, as implemented by the RUMM2020 software. Item-trait interaction, threshold disorders, presence of differential item functioning and item ﬁ t, were analyzed. RESULTS: Two (“death and dying” and “sensory abilities”) out of six domains showed inadequate item-trait interactions. Rescoring the response scale and deleting the most misperforming items led to scale improvement. The evaluation of domains and items individually showed that the “intimacy” domain does perform well in contrast to the ﬁ ndings using the classical approach. In addition, the “sensory abilities” domain does not derive an interval measure in its current format. CONCLUSIONS: Unidimensionality and local independence were seen in all domains. Changes in the response scale and deletion of problematic items improved the scale’s performance.

The world has been experiencing a profound and irreversible demographic shift as older people are living longer and healthier than ever before. 24 The most dramatic increases in proportions of older people are evident in the most advanced age groups (people over 80 years old) with an almost fi vefold increase from 69 million in 2000 to 377 million in 2050. 24 The World Health Organization (WHO) has described this demographic shift as a major societal achievement, and a challenge 25 . Increased longevity has been experienced in the developed and the developing world alike, but where developed countries grew rich before it grew old, developing countries are growing old before they have grown rich. 25 This shift in the age pyramid due to increased elderly population demands further research specifi cally approaching the aging process. One important area to be assessed is quality of life. Although there are several studies on this issue, systematic reviews have pointed out that the instruments most frequently used in these investigations are not sufficiently comprehensive and/or are not validated for application in older adult populations. 4,11 RESUMO OBJETIVO: Analisar a versão brasileira do Módulo WHOQOL-OLD, indicando alterações potenciais do instrumento para aumentar a adequação psicométrica.

INTRODUCTION
The WHO Quality of Life Group has recently developed the WHOQOL-OLD Module. 16 Through a simultaneous transcultural methodology, this instrument is designed to be suitable for cross-cultural comparisons. In addition, it was developed to specifi cally assess quality of life of the elderly, thus ensuring that important areas concerning old age are covered by the instrument. Its comprehensiveness is sustained by an initial intense qualitative phase. 7,10 The WHOQOL-OLD module represents an additional tool, alongside the WHOQOL-100 or WHOQOL-BREF, and it is a useful alternative in the investigation of quality of life in older adults, including relevant aspects not covered by instruments originally designed for non-elderly populations.
The validation of the Brazilian version of the WHO-QOL-OLD Module is reported in detail elsewhere. 8 Briefl y, it involved classic psychometric approach to analyze internal consistency, discriminant validity, criterion validity, concurrent validity and test-retest reliability. The fi ndings indicated suitable psychometric properties for this version.
The Rasch measurement theory is a modern psychometric approach to the development and validation of instruments. It has emerged as a powerful tool for examining instrument performance in depth, allowing both the instrument as whole and individual items to be assessed. In addition, the Rasch model is also helpful for providing potential solutions for misperforming instruments. It is suggested that combining both traditional and modern psychometric approaches is a valuable strategy to enhance power of validation processes. 20 Furthermore, the use of the Rasch measurement model for the development and application of quality of life instruments has been increasingly stressed. 16,19,21 The present study aimed at evaluating the Brazilian version of the WHOQOL-OLD Module using a modern psychometric approach and testing potential changes to the instrument in order to increase its psychometric adequacy.

METHODS
The data collected for the original classic validation 8 was also analyzed in this study. A minimum sample of 300 subjects stratifi ed by gender (50% women and 50% men), age (60-69 years, 70-79 years and over 80) and self-perceived health status (50% considering themselves healthy and 50% unhealthy) was selected at a university hospital, nursing homes, and in the community according to the WHOQOL-OLD project. Convenience sampling was used. The stratifi cation process provided minimum subsamples that allowed for the assessment of the instrument under different conditions. Inclusion criteria were age 60 or above and clinical ability to understand and respond to the instruments administered. Subjects were required to answer the question "In general, do you consider yourself healthy or unhealthy?," and were later stratifi ed as healthy or unhealthy exclusively according to their subjective selfperception, regardless of their actual objective health status. This methodology is based on the theoretical background for quality of life instruments developed by the WHO, where the quality of life construct is seen as multidimensional and basically subjective 23 .
Subjects completed a sociodemographic information form, the WHOQOL-OLD Module and the Geriatric Depression Scale 15-item version. 18 The WHOQOL-BREF instrument was also part of the assessment, and its psychometric performance is reported elsewhere. 4 The sociodemographic information form included questions about gender, age, educational level, marital status, subjective self-perception of health status, and consumption of alcohol, tobacco and illegal substances. The data obtained from this questionnaire was utilized for demographic description, as well as for differential item functioning (DIF) analysis.
The WHOQOL-OLD is a 24-item self-report instrument. It is divided into six domains (sensory abilities, autonomy, past-present-future activities, social participation, death and dying and intimacy). Each domain provides an individual score. In addition, an overall score is calculated from the set of 24 items. Answers are based on a 5-point Likert response scale. 16 It is validated in Brazilian Portuguese, and this version presents good classic psychometric performance. 8 Data was examined by way of the Rasch model using RUMM 2020 software. 3 Linacre states that the ideal sample size varies according to the scale targeting. For a well-targeted scale (40-60% endorsement rates on dichotomous items), a sample size of 108 would have a 99% confi dence of person estimation of + 0.5 logits. For non well-targeted scales, though, a minimum sample size for satisfactory estimations would be 243 subjects. 13 The Rasch model is understood as a template which puts into operation the axioms for additive conjoint measurement. 14 This theory presents a set of methods to determine whether a variable has an additive structure and, then, is amenable to be measured on an interval scale. 17 Originally developed to be applied in dichotomous scales, the Rasch model is also applicable to polytomous data. 1 Basically, the Rasch model assumes that the probability of a given subject endorsing an item is a function of the relative distance between the item location and the person location on a linear common scale 15 . In the case of a scale measuring depression, for example, the probability that a person is endorsing an item is a logistic function of the difference between the subject's ability (level of depression) and the level of depression expressed by the item. The following equation illustrates this statement: where ln is the normal log, P is the probability of a person n to endorse the item, θ is the person's level of ability and b is the level of ability expressed by the item. If the data fi ts the Rasch model, then both the person's ability and item diffi culty will be placed in a common metric scale (log-units scale or logit), which allows a linear transformation of the raw scale. Thus, when the data fi ts the model, and the assumptions of local independence are met, the scale is then suitable for valid parametric approaches. 14 Since the Rasch analysis is strongly dependent on unidimensionality, each one of the six WHOQOL-OLD domains was tested individually as separated scales. 15 Apart from unidimensionality, local independence is also considered a Rasch assumption. Items are required not to have dependence on each other, so that the probability of endorsing one item is not associated to any other in the scale. Local independence should be examined for each scale to be analyzed by the Rasch model. If Rasch assumptions are satisfi ed, and the scale fi ts the expected model, then it is also guaranteed that the performance of the instrument is stable and not dependent on the sample being assessed, or on certain characteristics such as gender or age, which is called specifi c objectivity. 21 First, overall fi t statistics were examined. An item-trait interaction was analyzed using the chi-square test, which indicates the invariance property if p-value is not signifi cant (thus indicating similarity between expected and observed models). The standardized distributions of items and persons were examined by way of a diagram.
Furthermore, individual item statistics were analyzed for residuals and chi-square statistics. Again, if a determined item fi ts the model, low residual ( + 2.5) and non-signifi cant chi-square statistics are expected. Bonferroni correction was applied to control for multiple test effects. Threshold disorders were also examined using threshold maps and category probability curves for each individual item.
An estimate of internal consistency was also obtained through the person separation index (PSI), which is comparable to the Cronbach's alpha coeffi cient. Items were examined for DIF. The presence of DIF indicates that a subgroup (e.g., males or young adults) has a consistent way of responding to an item, despite having the same amount of the latent trait. Both uniform DIF (when the difference is constant through the whole range of the item curve) and non-uniform DIF (when the difference occurs only at a certain level of attribute) were checked.
Finally, modifi cations were tested when fi t statistics indicated misfi t. Item rescoring and deletion were carried out in order to achieve the best item structure possible.
All respondents were informed about the objectives of the study and confi dentiality of the data obtained. Subjects signed an informed consent approved by the Research Ethics Committee of the university hospital where the study was carried out.

RESULTS
The sample comprised 424 subjects and its characteristics are described in Table 1. The Geriatric Depression Scale means and standard deviation (SD) indicate that the sample is predominantly non-depressed. In addition, around two thirds of the subjects perceived themselves as being healthy, despite their objective health condition. Subjective self-perception is known to be related to depression levels. Thus, the high rate of "healthy" subjects may be considered an indirect effect of low depression levels in the sample.
As for Rasch analysis results, the verifi cation for missing values showed that only items 1 and 3 had extremely low missing value rates (between 0.2% and 0.4%). The distributions of responses across the fi ve points did not show major problems. These fi ndings corroborate the high responsiveness of the WHOQOL-OLD in a Brazilian sample. It is likely that the close assistance research staff offered to subjects during data collection is somehow related to the unexpected low number of missing values. Table 2 shows item contents, missing values, medians and distributions.
The item-trait interaction was analyzed for the six domains individually through chi-square statistics. This test aims at checking whether the observed model (i.e., the data collected) fi ts the expected model (based on a probabilistic adaptation of Guttmann scale). 2 Thus, as Kline states, it is primarily a test of "badness-of-fi t," since statistical positive results (p-values above the critical one, after Bonferroni correction) indicate that the observed model is different from the expected. 12 The "death and dying" domain had an inadequate result (domain χ 2 = 51.72, p=0.00012). The "sensory abilities" domain also showed high chi-square results (domain χ 2 =101.10 and p=0.0000). Items 4, 5, 9 and 20 showed reversed threshold. Thresholds indicate the point where there is exactly a probability of 0.50 that a subject will respond to the item between a certain response category and the adjacent one. Threshold disorders, thus, suggest that the response scale is not effi cient to discriminate between two ability levels, so that subjects with more ability could respond in the same category as another with lower ability. In other words, the response scale would not be working adequately to order subjects with distinct levels of ability. These items were examined and rescored according to the point of the disorder in  Figure 1 illustrates the category probability curves of the item 5 in its 5-point original form and after rescoring. One can see that the original form presents reversed thresholds (i.e., category number 2 is not endorsed at any point). After rescoring, categories are well distributed. The RUMM2020 software 3 automatically renames the categories in order to assign the value 0 for the fi rst category. In the instrument, however, the categories range from 1 to 5.
The distributions of persons and item thresholds are illustrated in Figure 2. Persons' locations are placed on the top half of the chart. The mean person location value was 0.719 (SD=0.744). This is slightly above the average scale items (which would be zero logits). Threshold distribution is located on the bottom half of the chart. The scale's peak of information (if taken as a 24-item set) is located between 0 and -1 logits. However, thresholds adequately cover all the range of ability, which ensures that the scale is able to provide information for all levels.
DIF was assessed by gender (male and female) and age (60 to 79 years and 80 or older). Item bias indicates that item performance is not homogeneous and, thus, has distinct performance on different subjects when controlling for the level of underlying construct measured by the test. 6 As a result, scores obtained from an item with DIF are not comparable across populations. Items were analyzed for uniform and non-uniform DIF. Briefl y, the former is related to a constant difference of functioning through the entire spectrum of the construct, while the latter indicates that the DIF is seen only in a certain part of the curve. 5 Uniform DIF items can be either excluded from the scale or, alternatively, be used to create two different scales (and then the item would have distinct weights in each). 22 Item 3 ("sensory abilities" domain) showed uniform DIF for age. No DIF was found for other items.
The fi rst step in the scale modifi cation was rescoring response categories. Besides solving threshold disorders, the item-trait interaction showed improvement for the "sensory abilities" domain (original χ 2 =142.44; and after rescoring χ 2 =93.32). This improvement was not suffi cient to adjust this domain to the expected model. Item 3 showed differential functioning, as well as misfi t of chi-square test and residuals. These three statistics suggest that item 3 is not performing according to the expected Rasch model. Thus, item 3 was deleted and the domain was then re-examined. The item-trait interaction showed improvement (χ 2 changed from 93.32 to 59.28). However, the model after deleting is still misfi tting.
The "death and dying" domain also showed item-trait interaction misfi t in its original format (χ 2 =60.03).
Rescoring item 20 resulted in improvement of the model (χ 2 =51.72). At this stage, values were still nonsignifi cant, indicating persistent misfi t. Deletion of item 18 (which presented high chi-square results) resulted in an adjusted structure. Table 3 describes the fi t statistics for the refi ned WHO-QOL-OLD version.

DISCUSSION
The WHOQOL-OLD Module was developed through a simultaneous transcultural methodology, which is able to include different cultural contexts from the fi rst steps of the instrument construction. 9 This is regarded as a major characteristic of the WHOQOL-OLD. 16 In addition to the theoretical design, it is also crucial that a new international measure is adequately validated. This ensures that the original strengths of the instrument remain in the new version in a different language. The validation of a scale or instrument is a longitudinal process and ideally should involve its testing in distinct contexts.
The combination of different psychometric approaches for the validation or development of a new measure is supported in the literature. Particularly, it has been argued that the Rasch measurement model is able to add important input, since it puts into operation the axioms for additive conjoint measurement. 14 Using both traditional and Rasch analyses seem to be a useful strategy and provide relevant insight regarding scale performance. 16,20 The fi ndings of the present study are in line with the results previously reported through classical psychometric theory. 8 The "sensory abilities" domain showed inadequate performance in multiple linear regression analyses in previous studies. The Rasch analysis corroborated the domain misfi tting. The "intimacy" domain, however, showed misperformance in the classical psychometric approach (multiple linear regression), but not in Rasch analysis. This discrepancy indicates that the domain itself functions well as a set, and the items show satisfactory performance. It is suggested the previous fi ndings are due to limitations of the multiple linear regression, particularly the choice of a suitable dependent variable.
Rescoring and item deletion has not resulted in adequate improvement in the "sensory abilities" domain.
Interestingly, item rescoring and deletion signifi cantly improved the performance of the "death and dying" domain. After these changes, the model statistics fi t the Rasch model. These potential changes should not produce crucial modifi cations in the scale format, since they can be made during the statistical analysis phase and not necessarily in the data collection stage. Replications of these fi ndings in different samples are needed to confi rm the results.