Introduction

Over the past few decades, several authors ^{1,2,3,4,5,6,7 }have tried to determine the most appropriate association measure to be used in cross-sectional studies. The consensus is that the prevalence odds ratio (POR) is only a good approximation of the prevalence ratio (PR) when the event of interest is rare ^{8}. Logistic regression is the most popular statistical model used in estimating POR due to ease of interpretation and computational implementation. However, when the choice of association measure is the PR, and the event of interest is not rare, this model produces poor estimates. In such cases, several authors have proposed alternatives to logistic regression models to estimate the true PR.

Lee & Chia ^{9} were the first authors to use Cox models with Breslow's modification (Breslow-Cox model) to estimate prevalence ratios, but in the study, standard errors and, consequently, confidence intervals, were not calculated correctly. The correction for standard errors obtained by Cox models had already been proposed ^{10}, but were not considered. Barros & Hirakata ^{5} used the fact that Breslow-Cox and Poisson models estimate the same effects ^{11} and used Poisson regression models with robust variance to estimate the PR. Zou ^{12} published a simulation study demonstrating the reliability of the Poisson model with robust variance to estimate PR in 2 by 2 tables. The main issue with using a Poisson model to estimate PR is the misuse of a specific counting probability distribution to describe a response variable that is dichotomous (presence or absence of an outcome).

Skov et al. ^{4} used a generalized linear model with a binomial distribution and a log link (log-binomial model) to directly estimate PR ^{13}. Although this model allows for directly estimating the PR and assumes a probability distribution that agrees with the type of the response variable, the lack of convergence in the presence of continuous variables remains a problem. To solve this issue, Deddens et al. ^{6} introduced the COPY method for finding an approximation to the MLE when the log-binomial model fails to converge. Due to the convergence problem of the log-binomial model, Schouten et al. ^{14} proposed a simple data manipulation that allows for the use of logistic regression to obtain the PR. It consists in modifying the data set by duplicating the lines where the event occurs and replacing the outcome from event to non-event ^{14,15,16}.

Another approach - proposed by Wilcosky & Chambless ^{17}, using the conditional and marginal methods ^{18} - involves a direct adjustment of epidemiological measures through binary regression. An advantage of these methods is that they assume a probability distribution for a variable with a binomial response, matching the nature of the observed response variable in cross-sectional studies. We find one article ^{19} that uses the Wilcosky & Chambless ^{17} method to estimate PR, but it did not mention the software implementation. Recently, R Core Team developed a software package in R (R package version 1.2; The R Foundation for Statistical Computing, Viena, Austria; http://www.r-project.org) for estimating marginal and conditional PRs and confidence intervals via bootstrap and delta methods, but they have yet to publish a scientific article explaining the details of this package and the differences between the methods it utilizes to estimate PR.

In this article, we use a direct approach to estimate the prevalence ratio from binary regression models based on methods proposed by Wilcosky & Chambless ^{17}, and we compare the results to those obtained through different methods presented in the literature. Three different data sets are used to illustrate our study.

Methods

Based on the approach proposed by Wilcosky & Chambless ^{17}, we use real and simulated data to compare PR estimates obtained by the marginal and conditional methods. Those estimates are also compared with the estimates obtained by the binomial, log-binomial and robust Poisson/Cox models.

Using a logistic model, it is straightforward to estimate the probability of occurrence of a disease (denominated prevalence in transversal studies) adjusted for two or more variables. Suppose, for example, that one has information about the diabetes status (1: yes/0: no), age (continuous) and obesity status (1: yes/0: no) of a defined population. With this information, one can obtain the adjusted probability of diabetes through the following equation:

where P is the probability that DIABETES = 1, β_{0}, β_{1}, β_{2} are regression coefficients estimated from the data. Note that exp(β_{2}) estimates the odds ratio for diabetes in obese individuals compared to non-obese individuals, adjusted by age. However, if we are interested in obtaining the estimated PR for diabetes, adjusted by age, in obese and non-obese individuals, we can proceed in two ways, as described below:

1) Marginal method: in each stratum of the variable OBESITY (yes or no), the diabetes prevalence is calculated for each age value included in the dataset using Equation 1. The PR is the ratio between the average of the prevalences in each stratum. Wilcosky & Chambless ^{17} refer to this estimate as the marginal prevalence ratio (MPR);

2) Conditional method: in each stratum of the variable OBESITY, the diabetes prevalence is calculated using Equation 1, setting age as an average value obtained from the dataset. Thus, the ratio of the two prevalences can be calculated. Wilcosky & Chambless ^{17} refer to this estimate as the conditional prevalence ratio (CPR).

In the linear regression model, both methods estimate the same value. However, in the logistic regression model, we observed significant differences between the estimates of the two methods when p is close to zero or one.

According to Lee & Chia ^{9}, the marginal method provides an internally adjusted measure, making invalid any comparisons to external values of PR. With the conditional method, on the other hand, one can use default values as the average values of covariates, allowing for comparisons with other population studies that used the same default values. More details about the marginal and conditional methods can be found in Lee ^{20} and Wilcosky & Chambless ^{17}.

Asymptotic confidence intervals for the conditional and marginal prevalence ratios were proposed by Flanders & Rhodes ^{21}. The authors also presented an SAS (SAS Inc., Cary, USA) script for estimating and calculating the intervals of the conditional and marginal prevalence ratio. To the best of our knowledge, this was the only implementation of these measures to date.

The prevalence ratio estimation methods are illustrated in three different databases, all containing a binary outcome Y, a binary exposure X, and at least one control variable Z.

Application 1: the first database was a toy example with 1,000 simulated observations and one continuous control variable. In this example, we simulated 1,000 binary outcomes with a binary exposure, X, and a continuous confounding variable, Z. The exposure was sampled from a Bernoulli distribution with probability 0.5, the confounding variable was sampled from a Normal distribution with mean zero and unit variance. The outcome was sampled from a Bernoulli distribution with probabilities such that: the baseline prevalence was 20%; the conditional prevalence ratio for X at Z = 0 was equal to 2; and the conditional prevalence ratio for X at Z = 1 was equal to 1.919 (regression coefficient β_{2} = 0.20). There were several possible values for the conditional prevalence ratio for X depending on Z.

Application 2: the second database referred to a cohort of 1,273 live births in 1993 in the city of Pelotas, Rio Grande do Sul, Brazil, studied with the aim of linking sociodemographic factors and reproductive health, informed by the responsible female, to the nutritional condition of their children after 4-5 years ^{5}. The analysis considered underweight in 4-5 year old children (with a prevalence of 4.1%) as the outcome of interest, Y; previous hospitalization as the exposure, X; and birth weight (normal or low birth weight) as a control variable, Z. For this application, because all variables are binary, we were able to calculate the prevalence ratio applying the Mantel-Haenszel method, considered here to be the gold standard.

Application 3: the third database analyzed 703 sexually active, HIV-infected women, treated between 1996 and 2007 in Rio de Janeiro, Brazil, with no history of hysterectomy. The data was collected in order to assess factors associated with high-grade squamous intraepithelial lesions (HSIL), lesions that can develop into cancer of the cervix ^{22}. Five variables pertaining to the HIV-infected women were included in analysis: presence of HPV was the exposure variable, X; presence of cervical cytological abnormalities was the outcome, Y; and the control variables, Z, were age, number of pregnancies and the time since the last gynecological examination. Variables X and Y were binary variables and the others were continuous variables. The prevalence of the outcome was 4.1%.

Adjusted prevalence ratios and prevalence odds ratios were estimated by several different methods. Prevalence ratios were estimated by robust Poisson and log-binomial models, and by the conditional and marginal methods proposed by Wilcosky & Chambless ^{17}. POR were also calculated using the usual logistic regression model.

The different methods to obtain prevalence ratios were coded in R (The R Foundation for Statistical Computing, Vienna, Austria; http://www.r-project.org). An R function to estimate the conditional and marginal prevalence ratios, as proposed by Wilcosky & Chambless ^{17}, is available (Figure 1).

Results

Application 1: toy example

Table 1 presents estimates obtained through different methods for the prevalence ratio of the variable X. The true prevalence ratio depends on Z, which follows a standard normal distribution. Hence, the true conditional prevalence ratio varies from 1.71 to 2.25 for Z varying from -1.96 to 1.96. The crude prevalence ratio (1.477) underestimates this range of the true prevalence ratio, whereas the crude and the adjusted prevalence odds ratio (2.528 and 2.537) overestimate the true range (although their confidence intervals overlap with some of the true range) (Table 1). The adjusted prevalence ratios are all very similar, and all provide reasonable estimates (Table 1). The estimates differ only in the second or third decimal places, with the smallest estimated value in the log-binomial model and the largest in the conditional prevalence ratio.

Regression model (measure) | Estimate | 95%CI |
---|---|---|

Robust poisson (PR) | 1.950 | 1.573, 2.416 |

Log-binomial (PR) | 1.942 | 1.575, 2.418 |

Logistic regression (POR) | 2.537 | 1.905, 3.398 |

Logistic regression (CPR) | 1.956 | 1.578, 2.425 |

Logistic regression (MPR) | 1.949 | 1.574, 2.414 |

CPR: conditional prevalence ratio; MPR: marginal prevalence ratio; POR: prevalence odds ratio; PR: prevalence ratio. Note: the true conditional prevalence ratio for X varies from 1.71 to 2.25 depending on the value of Z (-1.96 to 1.96),

Application 2: underweight in 4-5 year-old children in Pelotas

Table 2 presents the adjusted prevalence ratio of the occurrence of underweight in 4-5 year-old children (outcome) by previous hospitalization (exposure) controlled by birth weight (normal or low birth weight).

Regression model (measure) | Estimate | 95%CI |
---|---|---|

Mantel-Haenszel (PR) | 2.483 | 1.456, 4.235 |

Robust poisson (PR) | 2.479 | 1.454, 4.226 |

Log-binomial (PR) | 2.481 | 1.447, 4.226 |

Logistic regression (POR) | 2.641 | 1.481, 4.671 |

Logistic regression (CPR) | 2.532 | 1.471, 4.357 |

Logistic regression (MPR) | 2.460 | 1.451, 4.171 |

CPR: conditional prevalence ratio; MPR: marginal prevalence ratio; POR: prevalence odds ratio; PR: prevalence ratio.

Despite the low prevalence of the outcome (4.1%), a difference was observed of 0.169 between the crude PR (2.902) and the crude POR (3.071) for previous hospitalization. According to the crude PR, there was a greater prevalence of underweight among children who were previously hospitalized when compared with those without previous hospitalization. The adjusted prevalence ratios of the log-binomial, robust Poisson, marginal prevalence ratio and Mantel-Haenszel method presented similar estimates (2.481, 2.479, 2.460, and 2.483, respectively). The largest adjusted estimates were the POR (2.641) and the conditional prevalence ratio (2.532).

Application 3: cervical cytological abnormalities in HIV-infected women

Table 3 shows the influence of high risk HPV (exposure) in the occurrence of cervical cytological abnormalities in HIV-infected women, controlled by age, number of pregnancies and time since last gynecological examination. Despite the low prevalence, the crude POR (7.909) differed from the crude PR (7.360) by 0.6. Those women with high risk HPV presented 640% more cytological abnormalities. The adjusted POR was the highest estimated value (7.990). The adjusted prevalence ratios obtained by the log-binomial, robust Poisson approach and the marginal prevalence ratio were very similar. The conditional prevalence method led to a ratio up to 46% greater than those obtained from other adjusted methods.

Regression model (measure) | Estimate | 95%CI |
---|---|---|

Robust poisson (PR) | 7.123 | 2.489, 20.388 |

Log-binomial (PR) | 7.192 | 2.849, 24.135 |

Logistic regression (POR) | 7.990 | 3.029, 27.531 |

Logistic regression (CPR) | 7.529 | 2.617; 21.665 |

Logistic regression (MPR) | 7.118 | 2.518; 20.124 |

CPR: conditional prevalence ratio; MPR: marginal prevalence ratio; POR: prevalence odds ratio; PR: prevalence ratio. Note: Z variables = age, number of pregnancies, and time since last gynecological examination.

Discussion

Difficulties in obtaining prevalence ratios in cross-sectional studies have been investigated by several authors in recent years. Several authors use strategies for indirect calculation of the PR using the Breslow-Cox and Poisson models (with and without robust variance), while others interpret the prevalence odds ratio obtained in logistic regression models as a prevalence ratio.

Lee & Chia ^{9} were the first authors to discuss methods proposed for estimating the PR. Until then, most cross-sectional studies in health used the logistic regression model estimate (POR), since it has the advantage of adjusting for the confounding or modifying effects of other variables. When the outcome is prevalent, however, the POR is a poor estimate of the prevalence ratio, overpredicting the PR by up to 27 times ^{23}.

Regarding the estimation of the adjusted prevalence ratio, in our examples the log-binomial model, robust Poisson model, and marginal prevalence ratio provided similar estimates. The conditional prevalence ratio (CPR) differed from the other estimates but was still smaller than the adjusted POR. The CPR proposed by Wacholder ^{13} is the prevalence ratio conditional on the mean values of the covariates, yet one could condition on any value for the confounding variables (higher or lower risk scenario). For instance, the prevalence of cervical cytological abnormalities in the 703 HIV-infected women (Application 3) was estimated for those women who had high risk HPV (X = 1), based on their respective values for age, number of pregnancies, and time since last gynecological examination (Z variables), and the mean value of the prevalence was computed. Similar calculations were performed for those women who were not diagnosed with high risk HPV (X=0). More detailed information on conditional and marginal methods is well described in Wacholder ^{13} and Wilcosky & Chambless ^{17}.

The main advantage of the log-binomial and robust Poisson models is that they are already implemented in most popular statistical packages. The log-binomial has the disadvantage of not using a proper link function, leading to numerical instability in the estimation process and resulting in non-convergence issues. The COPY method ^{6} was proposed to achieve convergence with the log-binomial model, but this method is only available in SAS, which is a proprietary software. The robust Poisson model assumes that all the events in the database occurred at the same time. In addition, use of the Poisson distribution is not appropriate for modeling a binary outcome ^{6}. However, it is important to highlight that the likelihood of the Poisson model has been used only to obtain an estimation equation and not for the purpose of modeling a binary response variable. The Schouten et al. ^{14} approach can be implemented easily on any statistical package, but in changing the database, it brings extra uncertainty that should be properly treated. The approach used by Wilcosky & Chambless ^{17}, unlike the log-binomial model, does not suffer from convergence problems.

One limitation of our results is that there is no "gold standard" for choosing the best method, especially when there is a continuous explanatory variable. In Application 2, where all explanatory variables were binary, the Mantel-Haenszel method was used as the "gold standard". For this application, we found that the prevalence ratio estimated by the log-binomial model, Poisson robust model and marginal prevalence ratio showed estimates similar to the one obtained by the Mantel-Haenszel method. We thus conclude that, in this application, the equivalence of the models applied. In this paper we have not explored robust methods based on quasi-likelihood estimation ^{15}.

In summary, we recommend the use of the direct approach proposed by Wilcosky & Chambless ^{17} because it is suitable for a binary response when using a variable binomial model, it has no convergence difficulties and it is now available as a package for the open source statistical software R. The estimates of the marginal prevalence ratio are similar to those of other methods, while the conditional prevalence ratio shows the prevalence ratio for an average person in the database. If one is interested in a particular set of control variables, one only need specify the values of those variables. The authors are developing an R package with the Wilcosky & Chambless approach, which will be available along with this article.