Methods for estimating prevalence ratios in cross-sectional studies

OBJECTIVE: To empirically compare the Cox, log-binomial, Poisson and logistic regressions to obtain estimates of prevalence ratios (PR) in crosssectional studies. METHODS: Data from a population-based cross-sectional epidemiological study (n = 2072) on elderly people in Sao Paulo (Southeastern Brazil), conducted between May 2003 and April 2005, were used. Diagnoses of dementia, possible cases of common mental disorders and self-rated poor health were chosen as outcomes with low, intermediate and high prevalence, respectively. Confounding variables with two or more categories or continuous values were used. Reference values for point and interval estimates of prevalence ratio (PR) were obtained by means of the Mantel-Haenszel stratifi cation method. Adjusted PR estimates were calculated using Cox and Poisson regressions with robust variance, and using log-binomial regression. Crude and adjusted odds ratios (ORs) were obtained using logistic regression. RESULTS: The point and interval estimates obtained using Cox and Poisson regressions were very similar to those obtained using Mantel-Haenszel stratifi cation, independent of the outcome prevalence and the covariates in the model. The log-binomial model presented convergence diffi culties when the outcome had high prevalence and there was a continuous covariate in the model. Logistic regression produced point and interval estimates that were higher than those obtained using the other methods, particularly when for outcomes with high initial prevalence. If interpreted as PR estimates, the ORs would overestimate the associations for outcomes with low, intermediate and high prevalence by 13%, almost by 100% and fourfold, respectively. CONCLUSIONS: In analyses of data from cross-sectional studies, the Cox and Poisson models with robust variance are better alternatives than logistic regression is. The log-binomial regression model produces unbiased PR estimates, but may present convergence diffi culties when the outcome is very prevalent and the confounding variable is continuous. DESCRIPTORS: Cross-Sectional Studies. Estimation Techniques. Prevalence Ratio. Logistic Models. Comparative Study.


INTRODUCTION
In cross-sectional studies with binary outcomes, the association between exposure and outcome is estimated by means of prevalence ratios (PRs).When adjustments for potential confounders are needed, logistic regression models are commonly used.This type of model yields estimates of odds ratios (ORs), and frequently ORs are reported in the same way as PR estimates are.However, ORs do not approximate well to PRs when the initial risk is high, and in these situations, interpreting ORs as if they were PRs may be inadequate. 1,2,9,12me alternative statistical models that may directly estimate PRs and their confi dence intervals have been discussed in the literature. 1,4,6,10,12,14Cox, log-binomial and Poisson regression models have been suggested as good alternatives for obtaining PR estimates adjusted for confounding variables.Using data adapted from a cross-sectional study, Barros & Hirakata 1 (2003) showed that these models yield adjusted PR estimates that are very similar to those obtained by means of the Mantel-Haenszel (MH) method.
The aim of the present study was to empirically compare the Cox, log-binomial, Poisson and logistic regression models with regard to estimating adjusted PRs, comparing their results with those obtained using the MH method.

METHODS
The data used came from a population-based cross-sectional study that had the aim of estimating the prevalence of dementia and other mental health problems among elderly people (aged 65 years or older) who were living in an economically deprived area of the district of Butantã, in the city of Sao Paulo (SP), between May 2003 and April 2005. 8Standardized procedures were used to assess cognitive functioning and psychiatric symptoms.Information on sociodemographic and socioeconomic characteristics was obtained.A total of 2,072 participants were included in the study.
Three outcomes were chosen: diagnoses of dementia, possible cases of common mental disorders (CMD) and self-rated poor health.Diagnoses of dementia were obtained by means of a procedure developed by the 10/66 Dementia Research Group, for use in population-based studies in developing countries, with a detailed assessment of the onset and course of dementia. 7Individuals were classifi ed as possible cases of CMD by means of the Self-Report Questionnaire (SRQ-20), a questionnaire developed by the World Health Organization for studies in developing countries. 10The cutoff point used was 4/5, in accordance with the validation of the Brazilian version of the SRQ-20. 9Self-rated health was assessed using a single question ("On the whole, how would you classify your health over the last 30 days?"), with the following answer options: "very good", "good", "regular", "poor" and "very poor".These were then pooled, in order to classify participants as having self-rated good health ("very good" and "good") or self-rated poor health ("regular", "poor" and "very poor").The three outcomes were chosen based on their prevalence (low for dementia, intermediate for CMD and high for self-rated poor health).Each outcome was associated with one main exposure and two potential confounding factors.For the outcomes of dementia and CMD, the main exposure was educational level and the confounding variables were age and gender.For selfrated poor health, the main exposure was the presence of depressive episodes, diagnosed in accordance with the ICD-10 criteria for depression, and the confounding variables were income and gender.
In relation to previous studies, we extended the application of these methods to situations with two confounding variables (some with more than two levels of exposure or measured as continuous values) in order to verify the point and interval PR estimates generated by each multivariate model.Outcomes of different frequencies were analyzed, in order to examine how, as the prevalence of the outcome increases, the Cox, log-binomial, Poisson and logistic models behave in relation to estimating PRs.
Reference values for the adjusted PR estimates and respective 95% confi dence intervals (95% CI), for the associations between each outcome and the respective main exposure, were obtained by means of the Mantel-Haenszel stratifi cation, while controlling for the effects of the potential confounders.PR estimates with the respective 95% CI were then calculated using the Cox, log-binomial and Poisson regression models, and crude and adjusted ORs (with 95% CI) were also calculated using logistic regression.Next, for each outcome of interest, one confounding variable was tested as a continuous measurement.The Cox and Poisson regressions were performed by setting the follow-up time as one for all participants and using robust variance estimators.The statistical software used for this study was Stata version 9.0.
The Poisson regression model is generally used in epidemiology to analyze longitudinal studies in which the response is the number of episodes of an event occurring over a given time.For cohort studies in which all individuals have equal follow-up time, the Poisson regression can be used with a time-at-risk value of one for each individual.If the model adequately fi ts the data, this approximation provides a correct estimate of the adjusted relative risk. 4In cross-sectional studies, a value of one can be attributed to each participant's follow-up time, as a strategy to obtain PR point estimates, since there is no real follow-up for the participants in this type of epidemiological studies.However, when the Poisson regression is applied to binomial data, the error for the estimated relative risk is overestimated, because the variance of the Poisson distribution increases progressively, while the variance of the binomial distribution has a maximum value when the prevalence is 0.5.This problem can be corrected by using a robust variance procedure, as proposed by Lin & Wei (1989). 3The Poisson regression with robust variance does not have any convergence diffi culty, and it produces results that are very similar to those obtained using the MH procedure, when the covariate of interest is categorical. 6,14he Cox regression model is usually used to analyze time-to-event data.In cross-sectional studies, no timeperiods are observed, but if a constant risk period is assigned to all the individuals in the study, the hazard ratio estimated using Cox regression equals the PR, in the same way as with the Poisson regression.However, the use of Cox regression without any adjustment for analyzing cross-sectional studies can also lead to errors in estimating confi dence intervals, which may then be wider than they should be.The robust variance method may also be used in such situations. 3e log-binomial regression model is a generalized linear model in which the link function is the logarithm of the proportion under study and the distribution of the error is binomial.It directly models the prevalence ratio for dichotomous variables.However, there may be a lack of convergence when trying to provide parameter estimates.Normally, this problem is due to Newton's method, which is used to fi nd a minimum or maximum for this function.This method may be unable to fi nd a maximum likelihood estimate when the solution is at the boundary of the restricted interval for the parameter.Peterson & Deddens 6 (2003) suggested the COPY method (a macro for the SAS software), which may provide an approximate estimates and standard errors when the Proc Genmod command (generally used in SAS for binomial distribution with a logarithmic link function) fails to converge.
Logistic regression has been widely used in epidemiological studies with binary outcomes to obtain unbiased OR estimates adjusted for one or more confounding variables.It is possible to calculate the PR from the OR estimate, with 95% CI, but the calculations are complex and require computing software to calculate variance estimates using matrix modules. 5

RESULTS
The outcome of dementia (low prevalence: 5.1%) showed statistically significant associations with educational level and age group, but not with gender (Table 1).The risk factor of educational level also showed statistically signifi cant associations with age group (p < 0.01) and gender (p < 0.01).There was some confounding relative to age group in estimating the association between educational level and prevalence of dementia, as shown by the MH stratifi cation (Table 2).Comparing the results from the different models consisting of the main exposure and one confounding variable with four exposure levels, the point estimates and respective 95% CI obtained using the Poisson, Cox and log-binomial models were very close to what was obtained using MH stratifi cation (Table 2), with differences of one or two hundredths (second decimal place).
The results observed when an extra potential confounding variable (gender) was added to the Cox, Poisson and log-binomial models produced point estimates of 2 or 3 hundredths (second decimal place) lower than seen in the MH point estimate, and the 95% CI was narrower than the MH confi dence interval.Putting age as a continuous variable produced further adjustment for confounding, and the estimate for the association between educational level and dementia was no longer statistically signifi cant.Logistic regression produced a point estimate approximately 13% higher, with a wider 95% CI than what was obtained with the other regression models, in all situations.
The outcome of CMD (intermediate prevalence: 37.8%) showed statistically signifi cant associations with educational level, gender and age group (Table 1).There was some confounding due to gender and age group in estimating the association between educational level and risk of CMD, as shown by the MH stratifi cation (Table 3).Comparing the results from the different models, both in the situation consisting of the main exposure and one confounding variable (gender) with two exposure levels, and when adding an extra potential confounding variable (age group) with four exposure levels, the point estimates and respective 95% CI obtained using the Poisson, Cox and log-binomial models were identical to those obtained using MH stratifi cation (Table 3).When age was taken as a continuous variable, the Cox, Poisson and log-binomial models produced almost identical point estimates and respective 95% CI.Logistic regression produced point estimates almost 100% higher than those obtained using the other regression models, with wider 95% CI.
The outcome of self-rated poor health (high prevalence: 53.8%) showed statistically signifi cant associations with depressive episodes, gender and income (Table 1).The main exposure variable ("depressive episode") was also associated with income (p = 0.04).There was almost no confounding due to income or gender in estimating the association between depressive episodes and self-rated poor health, as shown by the MH stratifi cation (Table 4).When the results from the different models in the situation consisting of the main exposure and one confounding variable (income) with four exposure levels were compared, or when an extra potential confounding variable (gender) was added to each model, the point estimates obtained using the Poisson and Cox models and respective 95% CI were identical to those obtained using MH stratifi cation.The point estimates obtained using the log-binomial model were closer to one than were those yielded by the other two models.When income was taken as a continuous variable, the results from the Cox and Poisson models were similar.However, it was diffi cult to reach convergence using log-binomial regression.Logistic regression produced point estimates that, if interpreted as PR estimates, would be more than four times greater than those obtained using the other regression models, and the 95% CI was wider.

DISCUSSION
A previous study 1 had showed that in cross-sectional studies, the Cox and Poisson regression models with robust variance and the log-binomial regression model generate adequate estimates for prevalence ratios and their confi dence intervals, regardless of the base prevalence.In a recent study on this question, Peterson & Deddens 6 (2008) advocated, based on real and simulated data, that the Poisson regression gave better PR estimates for very frequent outcomes, in relation to the log-binomial regression model.However, these authors suggested that log-binomial regression would be the best method for intermediate prevalence.We explored the performance of these methods in relation to different prevalences of outcomes of interest, more than one confounding variable and continuous covariates.We showed that the three methods generated correct point and interval estimates in all situations, although the log-binomial models presented convergence diffi culty in situations of very prevalent outcomes and continuous covariates.For the three outcomes investigated, the Cox and Poisson regression models presented identical PR estimates and 95% CI estimates, and they were very similar to those obtained using our reference (MH stratifi cation).The use of robust methods for variance estimation in the Cox and Poisson models corrected for the overestimation of the variance and produced adequate confi dence intervals.The Cox and Poisson models also behaved well in relation to continuous covariates.
The log-binomial regression models also behaved well in most of the situations tested, yielding point and interval estimates that were close to those obtained using MH stratifi cation.However, when the prevalence of the outcome was high, the log-binomial model produced estimates closer to one than were those obtained using MH stratifi cation or using Cox and Poisson regression.Moreover, when one of the covariates was continuous, the log-binomial model presented convergence diffi culties, as previously described. 1,6e OR estimates obtained using the logistic regression models were close to the PR estimates when the outcome prevalence was low (dementia), although even then there was a tendency for the OR to be higher than the PR.In the situation of intermediate prevalence (CMD), the OR was almost twice the PR.In other words, if the OR were interpreted as a PR, it would seem that the relative increase in the risk of CMD for individuals with lower educational level was 23% higher than the risk for those with better educational level, instead of 12% higher, as shown by the PR.The ORs obtained when the prevalence was high (selfrated poor health) were four times higher than the PR estimates obtained using MH stratifi cation or using the Cox, Poisson and log-binomial models.This shows the inappropriateness of interpreting OR estimates as if they were PR estimates in these situations.

Table 1 .
Prevalence of self-rated poor health, dementia and common mental disorders according to the main exposure and confounding variables.São Paulo, Southeastern Brazil, May 2003 to April 2005.

Table 2 .
Prevalence ratio estimates and 95% confi dence intervals (95% CI) for the association between educational level and dementia, controlling for age group, age group and gender, and age and gender, using Mantel-Haenszel stratifi cation, Cox, Poisson, log-binomial and logistic regression models.São Paulo, Southeastern Brazil, May 2003 to April 2005.

Table 4 .
Prevalence ratio estimates and 95% confi dence intervals (95% CI) for the association between depressive episodes and self-rated poor health, controlling for income level, income level and gender, and income and gender, using MH stratifi cation, Cox, Poisson, log-binomial and logistic regression models.São Paulo, Southeastern Brazil, May 2003 to April 2005.

Table 3 .
Prevalence ratio estimates and 95% confi dence intervals (95% CI) for the association between educational level and common mental disorders (CMD), controlling for gender, gender and age group, and gender and age, using MH stratifi cation, Cox, Poisson, log-binomial and logistic regression models.São Paulo, Southeastern Brazil, May 2003 to April 2005.