Assessment of the measurement properties of quality of life questionnaires in Brazilian women with breast cancer

BACKGROUND: There are several questionnaires available to assess quality of life in breast cancer, however the choice of the best questionnaire often does not take into account the adequacy of these questionnaires' measurement properties. OBJECTIVE: To test the measurement properties of two generic quality of life questionnaires and one quality of life questionnaire specific for women with breast cancer. METHOD: We assessed 106 women after surgery for breast cancer. The assessment included application of the SF-36, WHOQOL-bref, and FACT-B+4 questionnaires as well as the Global Perceived Effect and Pain Numerical Rating scales. The participants were interviewed on three occasions to investigate internal consistency, floor and ceiling effects, construct validity, reproducibility, and responsiveness. RESULTS: Most of the instruments' domains showed adequate internal consistency (Cronbach's alpha varying from 0.66 to 0.91). Reliability varied from poor to substantial (ICC2,1 between 0.39 and 0.87) and agreement varied from negative to very good. The SF-36 presented doubtful agreement and showed floor and ceiling effects in three domains. The domains of the generic questionnaires presented moderate to good correlation with the FACT-B+4 (Pearson varying from 0.31 to 0.69). The internal responsiveness varied from small to large (ES varying from -0.26 to 0.98) and external responsiveness was found in only some of the instruments' domains. CONCLUSIONS: Most of the measurement properties tested for the WHOQOL-bref and FACT-B+4 were adequate as was their ability to assess quality of life in women with breast cancer. The SF-36 showed inadequacy in agreement and floor and ceiling effects and should not be used in women with breast cancer.


Introduction
Breast cancer is a significant public health issue in Brazil, and it is considered the second most common cause of death among women 1 . After surgical treatment, the patient experiences severe physical and motor consequences that negatively influence the clinical condition. Some examples of these changes are limitation of the upper limb movements, pain and functional impairment, paresthesia, postural asymmetries, fibrosis of the glenohumeral joint, and lymphedema [2][3][4][5] . Some studies show the correlation between the treatment of breast cancer and functional impairment and demonstrate that the measurement of quality of life related to health becomes important to understand how the functional impairment interferes, in general, in the daily activities of the women diagnosed with breast cancer [6][7][8][9] .
Quality of life (QoL) assessment consists basically of questionnaires, most of which have been created in English and are aimed toward English-speaking populations [10][11][12] . The number of instruments available to assess QoL in cancer patients has increased and today there are several breast cancer-specific questionnaires in the literature 12,13 . The Functional Assessment of Cancer Therapy -Breast plus Arm Morbidity (FACT-B+4) is a QoL questionnaire specific for women with breast cancer. The FACT-B+4 has been already tested in the Brazilian population and showed appropriate internal consistency, reproducibility, and construct validity 14 compared with other specific QoL questionnaires.
Additionally, generic questionnaires can be proposed for this assessment. The Medical Outcomes Study 36 -Item Short-Form Health Survey (SF- 36) and World Health Organization Quality of Life -bref (WHOQOL-bref) questionnaires have been used to assess general QoL in Latin America [15][16][17][18][19][20] . However, measurement properties are not always tested in most instruments, taking into account the language and target population. To date, no published studies have completely tested the measurement properties of QoL assessment questionnaires in Brazilian-Portuguese and applied them to women with breast cancer.
Considering the choice of the most appropriate questionnaire for women with breast cancer, the aim of the present study is to test the measurement properties of the SF-36 and WHOQOL-bref compared to the FACT-B+4. The secondary objectives are to determine the preference and acceptance of the QoL questionnaire and assess its ease of comprehension. The hypothesis of this study is that the generic questionnaires available for general clinical purposes will be acceptable and will have good clinimetric results for the population of women with breast cancer when compared to the FACT-B+4.

Method Sample
The study included 106 women who underwent breast cancer surgery, constituting a convenience sample that was assessed between 27 March and 28 November 2012. The inclusion criteria were: women aged 18 years or more with a primary diagnosis of breast cancer at any stage of the disease, submitted to breast cancer surgery in the last 5 years, discharged from hospital (to avoid immediate postoperative adaptations and consequent influence on the QoL), having received or currently receiving treatment with radiotherapy, chemotherapy, and/ or hormone therapy, and recruited at Hospital do Câncer AC Camargo -Fundação Antônio Prudente, in the city of São Paulo, SP, Brazil. The exclusion criteria were: breast cancer as a secondary diagnosis and inability to read, write or speak fluently in Portuguese.
The participants who agreed to participate signed an informed consent form prior to data collection. The study was approved by the Research Ethics Committee of Universidade Cidade de São Paulo (UNICID), São Paulo, SP, Brazil (protocol 13616825) and by the Human Research Ethics Committee of Fundação Antônio Prudente -Hospital do Câncer AC Camargo, São Paulo, SP, Brazil (protocol 1627/11).

Assessment sheet
An assessment sheet was used to gather sociodemographic, clinical data, and clinical characteristics of the cancer. Some data were obtained directly from the patient's electronic medical records.

Medical Outcomes Study 36 -Item Short -Form Health Survey (SF-36)
The SF-36 21 26, 28 and 30), and one extra question (question 2) not included in the total score. The score for each dimension varies from 0 to 100, with zero being the worst possible health condition and 100 being the best possible health condition 22 . The score was calculated according to the scoring rules of the RAND 36 Health Survey item 1.0, in two phases: 1) all of the items were scored on a scale of 0 to 100; and 2) the mean of the items of each dimension were calculated to create the eight scores of the scale. Any unanswered questions were not included in the calculation. At last, the scores for each dimension represent the mean of all answered items 23 .

World Health Organization Quality of Life -bref (WHOQOL-bref)
The WHOQOL-bref questionnaire is an abbreviated version of the WHOQOL-100 24 that has been adapted to Brazilian-Portuguese 25 . It contains 26 questions, including 2 general questions, and the remaining 24 questions representing each of the 24 aspects of the original instrument. It is divided into four domains: physical health (questions 3, 4, 10 and 15 to 18), psychological (questions 5, 6, 7, 11, 19 and 26), social relationships (questions 20 to 22), and environment (questions 8, 9, 12 to 14 and 23 to 25). The WHOQOL-bref scores are calculated according to an algorithm 26 that considers the number of answered questions in each of the domains and standardizes the scores of all domains from zero to 100, with zero being the worst possible health condition and 100 being the best health condition. The algorithm inverts the score values for questions 3, 4, and 26 to calculate the final score 25,27,28 .

Functional Assessment of Cancer Therapy -Breast plus Arm Morbidity (FACT-B+4)
The breast cancer-specific questionnaire FACT-B+4 consists of 36 questions, 27 of which refer to overall QoL and 9 to specific problems of patients with breast cancer 29 . In 2001, a four-question subscale was added to the FACT-B questionnaire to assess arm morbidity in patients submitted to breast surgery 30 . The FACT-B+4 has been adapted into Brazilian-Portuguese 31 . It is divided into six scales with independent scores: physical well-being ranging from 0 to 28 (questions GP1 to GP7), social/family well-being ranging from 0 to 28 (questions GS1 to GS7), emotional well-being ranging from 0 to 24 (questions GE1 to GE6), functional well-being ranging from 0 to 28 (questions GF1 to GF7), breast cancer subscale ranging from 0 to 36 (questions B1 to B9) and arm subscale ranging from 0 to 20 (questions B3 and B10 to B13). The answers are presented on a five-point Likert scale. The score is calculated separately for each scale by adding up the points for each question. The values for some questions (GP1 to GP7, GE1, GE3 to GE6, B1 to B3, B5 to B8, B10 to B13) are inverted in the calculation of the final score. When there were any unanswered questions, the mean of the answered questions was considered for that scale. The results are added to obtain the final total score ranging from 0 to 164. The higher the score is, the better the patient's QoL 29,30 .

Global Perceived Effect scale (GPE)
For this research, the GPE scale 32 was adapted to assess the patient's level of perception of recovery since the day of diagnosis with breast cancer. The guiding question was "Compared to when you received your diagnosis, how would you describe your quality of life these days?". It is an 11-point numerical scale (-5 to 5), with -5 being vastly worse; 0 being no change; and 5 being complete recovery. The higher the score is, the better the recovery from the condition 32 .

Pain Numerical Rating scale (PNR)
The five-point adapted PNR scale 33 was used to verify the patient's degree of understanding regarding the QoL questionnaires. The guiding question is: "Did you understand what was asked in the questionnaire?" The minimum value is 0, meaning "I did not understand anything", and the maximum value is 5, meaning "I understood perfectly and did not have any questions" 33 .

Procedures
The researcher collected the participants' sociodemographic and clinical data and administered the QoL questionnaires at baseline. After that, the participants were informed of the subsequent days when the questionnaires would be administered over the phone, i.e. 48 hours and 30 days after the first session. The 48-hour interval between the first and second session was established to avoid significant changes in the patient's QoL, thus allowing the evaluation of the test-retest reproducibility of the questionnaire. The 30-day interval between the first and third session was established to allow sufficient time for changes in QoL and thus test the responsiveness of the questionnaires 34 .

Statistical analysis
The assessments of the measurement properties, described in detail in Table 1, were conducted according to procedures recommended by Maher et al. 11 and Terwee et al. 34 .

Results
A total of 111 eligible women were invited to take part in the study: 5 women declined to answer the questionnaires and 106 women agreed to participate. Of the 106 participants, 99 responded to the second assessment session after 48 hours and 94 responded to the third assessment session after 30 days. These drop outs were caused by side effects of chemotherapy, pneumonia associated with hospital stay, low immunity, infection, necrosis of surgical wound, and second surgery. Table 2 shows the clinical and demographic characteristics, and Table 3 shows the scores for the QoL questionnaires applied in the three assessment sessions. The postoperative period ranged from 3 days to 4 years.
Regarding acceptance and preference for the questionnaire that best represented QoL, 53.8% of the participants chose the FACT-B+4 (Table 2). Regarding ease of comprehension of the questionnaires, the means were similar (Table 3).
In the assessment of the internal consistency, Cronbach's alpha for all of the instruments was adequate, with the exception of: the social functioning dimension of the SF-36; the social relationships domain of the WHOQOL-bref, with the highest value of Cronbach's alpha if item deleted reached when question 21 was deleted; and the emotional well-being scale and breast cancer subscale of the FACT-B+4, with no change when using Cronbach's alpha if item deleted (Table 4).
Considering reliability, the SF-36 had six dimensions with moderate reliability, the WHOQOLbref had substantial reliability in all domains, and the FACT-B+4 had five scales with moderate reliability (Table 4). In most dimensions of the SF-36, agreement was classified between doubtful and negative; the Table 1. Measurement properties tested.

Internal consistency
The homogeneity of the items of the questionnaire was tested using Cronbach's alpha 11,35 and Cronbach's alpha if an item deleted. The Cronbach alpha values are considered adequate when equal to or greater than 0.70 and less than 0.95 11,35 .

Reproducibility
The term reproducibility incorporates two measurement properties: reliability and agreement. Reliability was tested using Type 2,1 Intraclass Correlation Coefficient (ICC 2,1 ) with 95% confidence intervals (CIs). An ICC of less than 0.40 represents poor reliability; between 0.40 and 0.75 represents moderate reliability; between 0.75 and 0.90, substantial reliability; and greater than 0.90, excellent reliability. Agreement was measured using the following measurements: Standard Error of the Measurement (SEM) 36 and Smallest Detectable Change (SDC) 11,35 . The SEM was calculated by the ratio of the standard deviation of the mean difference to the square root of two. The percentage of the SEM related with the total score of the questionnaire can be interpreted as follows: ≤5%: very good; >5% and ≤10%: good; >10% and ≤20%: doubtful and >20%: negative 37 . The SDC was calculated using the formula SDC=1.645 × √2 x SEM, with 90% CI, which reflects the smallest detectable change in an individual's score. Thus, it can be interpreted that values above the SDC describe a change in the individual's score above the error of the measurement 35 .

Construct validity
We correlated the domains with the most similarities, e.g. the SF-36 dimensions physical functioning, role-physical, role-emotional, and social functioning with the FACT-B+4 scales functional wellbeing, physical well-being, emotional well-being, and social/family well-being, respectively, and the WHOQOL-bref domains physical health, psychological, and social relationships with the FACT-B+4 scales physical well-being, emotional well-being, and social/family well-being, using Pearson's correlation test (r). When r<0.30, the correlation was considered weak, when r≥0.30 and <0.60 the correlation was considered moderate and when r≥0.60 the correlation was considered good 36 . It is expected that the generic quality of life questionnaires SF-36 and WHOQOL have a positive correlation with the FACT-B+4 with r≥0.60, assuming that the construct of the evaluated domains of the three questionnaires were similar.

Responsiveness
The analysis of the responsiveness was based on the participants who showed clinical changes, considering a two-point change (negative or positive) in the GPE scale. The internal responsiveness was assessed by calculating the effect size (ES: mean of difference between initial assessment and 30-day follow-up, divided by the standard deviation of the initial assessment) with 84% CI. We chose 84% CI to allow a direct comparison of the ES of different instruments since CIs that do not exceed 84% are equivalent to Z scores at 95% 38,39 . A value for ES ≤0.20 represents a change of approximately 1/5 of the standard deviation at the beginning of treatment and it is considered small. A value of 0.50 is considered moderate and a value ≥0.80 is considered large 40 . The external responsiveness was measured by two tests: 1) Pearson's Correlation test to determine the correlation between the initial and 30-day assessments of the dimensions of the SF-36 22  Floor and ceiling effects These measurements were calculated by the percentage of patients who achieved the maximum score (ceiling) or the minimum score (floor). These effects are considered when 15% of respondents reach the ceiling or floor scores, leading to implications on the questionnaire's reproducibility and responsiveness 11,35 . WHOQOL-bref had good agreement in all domains; and the agreement levels of the FACT-B+4 varied from very good to doubtful (Table 4). Regarding the floor or ceiling effects, values above 15% were only found in three dimensions of the SF-36, with floor effect in the role-physical and role-emotional dimensions and ceiling effect in the role-emotional and social functioning dimensions (Table 4). In the assessment session after 30 days, 62 patients had changes <2 points and 32 patients had clinical changes ≥2 points in the GPE scale. The analysis of responsiveness considered the data from these 32 patients. Regarding internal responsiveness, the SF-36 showed moderate ES in all dimensions except physical functioning and general health perceptions, which had small ES, and bodily pain, which had large ES. The WHOQOL-bref showed small ES in all domains, except physical health, with moderate ES. The FACT-B+4 showed moderate ES in all scales except social/family well-being, emotional wellbeing, and functional well-being, which had small ES. With 84% CI, there was no difference between similar domains, i.e. in all comparisons there was overlapping between the CIs. For example, the rolephysical dimension of the SF-36 presented ES=0.29 with 84% CI of 0.04 to 0.54 which overlapped the CI of the physical health domain of the WHOQOLbref, with ES=0.53 and 84% CI of 0.24 to 0.80, and of the physical well-being scale of the FACT-B+4, with ES=0.33 and 84% CI of 0.02 to 0.63.
In the external responsiveness assessment using ROC curve analysis, all dimensions of the SF-36 were responsive, except for physical functioning, role-physical, and role-emotional. In the WHOQOLbref, all domains had values above 0.70. The physical well-being, functional well-being, and total score scales of the FACT-B+4 were responsive. The Pearson correlation analysis showed a significant and moderate correlation in the dimensions bodily pain, general health perceptions, vitality, and mental health of the SF-36. The WHOQOL-bref showed significant good and moderate correlation for the domains psychological and social relationships, respectively. The FACT-B+4 showed a moderate correlation for the functional well-being and total score scales of the FACT-B+4 (Table 5).

Discussion
Most of the domains of the SF-36, WHOQOLbref, and FACT-B+4 showed acceptable values for the measurement properties. All instruments showed good comprehension represented by similar means. With regard to the questionnaire which best-represented QoL, 53.8% of the participants chose the FACT-B+4, possibly due to the fact that Table 3. Scores of quality of life questionnaires and scales used in the study in the three assessment sessions, in mean and standard deviation.

FACT-B+4 -Scales
Physical well-being (0-28) 21.0 (7. this instrument included specific questions to breast cancer and upper limb limitations. In our study, the SF-36 showed adequate Cronbach's alpha in all dimensions except social functioning. Similar studies with different samples were found in the literature. In a population of Chinese medical students, Cronbach's alpha ranged from 0.63 to 0.82, with the lowest value in the social functioning dimension. This result may be due to the fact that the items of this dimension are not sensitive to cultural variations and may need to be adapted to the characteristics of the target population 41 . In Chinese patients with chronic diseases, Cronbach's alpha ranged from 0.54 to 0.93, with the lowest values in the dimensions bodily pain (0.54) and social functioning (0.62) 42 . In contrast, in a study with a population of 50 healthy individuals and 80 patients with chronic disease, Cronbach's alpha ranged from 0.72 to 0.89 43 .
Moderate reliability was found in all dimensions of the SF-36 except role-emotional, which had poor reliability, making it impossible to obtain similar results among the participants of this study. Other studies found in the literature show substantial to excellent reliability. In a population of Chinese patients with chronic disease, ICC values ranged from 0.83 to 0.96 42 . In a sample of 130 Arabic individuals, ICC ranged from 0.95 to 0.98 43 . However, both of these studies may have overestimated the results because they did not report the type of ICC used. Table 4. Values of internal consistency, reproducibility and floor or ceiling effects.

Internal consistency Reproducibility Floor or ceiling effects
Cronbach's alpha (Cronbach's alpha if an item was deleted) That may be the reason why these studies found higher ICC values than those in our study. For agreement, the present study found high standard error of measurement (SEM) values (most of the dimensions showed values >10% and ≤20%) and smallest detectable change (SDC) ranging from 20.51 to 80.30, characterizing the SF-36 as having doubtful agreement. We found the presence of floor effect in the dimensions role-physical and role-emotional and the presence of ceiling effect in the dimensions roleemotional and social functioning. These specific dimensions were probably unable to detect change in the patients' health condition, with implications on reproducibility and responsiveness. For construct validity, analyzed by the combination of dimensions from the SF-36 and the FACT-B+4, the results indicated a significant correlation in all dimensions except the social functioning dimension of the SF-36. No studies were found that conducted a similar correlation between these two questionnaires.
The assessment of the internal responsiveness showed that responsiveness ranged from small to large. Considering external responsiveness, the SF-36 was characterized as a responsive instrument. Furthermore, a significant correlation was found between the dimensions that had AUC values above 0.70. The SF-36 showed at least one dimension with inadequate values in all measurement properties

SF-36 -Dimensions
Physical tested. This result implies that the SF-36 should not be used to evaluate QoL in patients with breast cancer. The WHOQOL-bref presented adequate internal consistency in most of the domains, except for the social relationships domain. No studies were found on assessment of the measurement properties of the WHOQOL-bref in patients with breast cancer. In other populations, studies that tested the internal consistency of the WHOQOL-bref found similar values 25,28,[44][45][46] . One study in which the internal consistency of the WHOQOL-bref was compared to that of the WHOQOL-100 found a higher Cronbach's alpha. Thus, the low value of the abbreviated questionnaire can be explained by the low number of questions in the social relationships domain given that Cronbach's alpha is dependent on the number of items of a scale 25,34 .
Reliability was substantial in all domains of the WHOQOL-bref. These results are similar to those of one study 28 , in which the values varied from substantial to excellent. However, this study 28 does not report the type of ICC used. For the agreement, good SEM values were found and an SDC of 13.43 to 22.01, characterizing the WHOQOL-bref as having good agreement.
There were no floor or ceiling effects. The construct validity presented a good correlation. No study was found that conducted a similar correlation between the two questionnaires. Internal responsiveness showed small responsiveness in most of the domains. A study with smokers also found small responsiveness for all domains except the psychological domain 44 . The assessment of the external responsiveness by the AUC showed responsiveness in all domains. However, only the psychological and social relationship domains showed significant correlation. After the analysis, the WHOQOL-bref can be used to assess QoL in patients with breast cancer given that the measurement properties were adequate and the instrument was able to detect clinical changes over time.
The FACT-B+4 showed adequate values for internal consistency, with the exception of the emotional well-being scale and the breast cancer subscale. Other studies found lower internal consistency values for the same scales, suggesting that there is no homogeneity in these scales. For example, in the original validation study of the arm subscale of the FACT-B+4, the internal consistency ranged from 0.62 to 0.83 30 ; in a sample of breast cancer patients before surgery with upper limb lymphedema, the internal consistency varied from 0.52 to 0.92 47 .
For reliability, most scales showed moderate reliability. Conflicting results were found in a sample of patients with lymphedema, with reliability ranging from 0.40 to 0.88 47 , and the study did not report the type of ICC used. The agreement values for the scales of the FACT-B+4 were characterized between good and doubtful. For the FACT-B+4 total score, a very good agreement was observed. Floor or ceiling effects were not observed. In contrast, another study on women with breast cancer showed ceiling effects in the physical well-being and social/family well-being scales and the arm subscale 47 . For construct validity, the FACT-B+4 presented better correlation with the WHOQOL-bref, with good correlation between all scales.
The assessment of internal responsiveness showed small to moderate responsiveness. External responsiveness, based on the analysis of the AUC, was only found for the physical well-being, functional well-being, and total score scales. The correlation analysis showed moderate correlation for the functional well-being scale and total score.
The measurement of QoL is important to understand how functional impairment interferes in the daily activities of women undergoing treatment for breast cancer. Considering that the assessment of QoL is multidimensional 48,49 , with different meanings depending on the variety of life contexts, maintenance of functional capacity, general satisfaction, personal fulfillment, and social interaction 48,49 , physical therapists should investigate QoL with the goal of improving the treatment and monitoring the evolution of the clinical condition, which contributes to prevention interventions or treatment directions 6,50 .
Some limitations can be suggested in this study. The inclusion criteria included the largest possible number of women with breast cancer regardless of their phase of treatment. The wide variety in the type of surgery and time since surgery may have become a limitation because a more homogeneous sample in regard to treatment phase or surgery type could have resulted in similar changes in QoL. However, the current sample was based on previous studies 51, 52 . Another limitation was the 30-day interval for the responsiveness assessment. Perhaps if this followup time had been longer, greater clinical changes could have occurred and better results could have been found.
Most of the measurement properties tested for the WHOQOL-bref and FACT-B+4 were adequate as was their ability to assess QoL in women with breast cancer. The domains of WHOQOL-bref and FACT-B+4 are interconnected in the measurement of QoL in the studied population. The SF-36 showed inadequacy in agreement and floor and ceiling effects and should not be used to assess QoL in women with breast cancer.