Obtaining adjusted prevalence ratios from logistic regression model in cross-sectional studies

In the last decades, it has been discussed the use of epidemiological prevalence ratio (PR) rather than odds ratio as a measure of association to be estimated in cross-sectional studies. The main difficulties in use of statistical models for the calculation of PR are convergence problems, availability of adequate tools and strong assumptions. The goal of this study is to illustrate how to estimate PR and its confidence interval directly from logistic regression estimates. We present three examples and compare the adjusted estimates of PR with the estimates obtained by use of log-binomial, robust Poisson regression and adjusted prevalence odds ratio (POR). The marginal and conditional prevalence ratios estimated from logistic regression showed the following advantages: no numerical instability; simple to implement in a statistical software; and assumes the adequate probability distribution for the outcome.


Introduction
During the last decades, several authors [1][2][3][4][5][6][7] have been studying the best association measure to be estimated in cross-sectional studies. The consensus is that the prevalence odds ratio (POR) is a good approximation of the prevalence ratio (PR) 8 , if and only if, it is in the presence of a rare event. Logistic regression is the most popular statistical model used when estimating POR due to ease of interpretation and computational implementation. However, when the choice of association measure is the PR, this model produces poor estimates in the presence of a not rare event. In such context, several authors proposed alternatives methods instead of using logistic regression to estimate the true PR. Lee 9 is one of the first authors to use Cox models with Breslow's modification (Breslow-Cox model) to estimate prevalence ratios, yet their standard errors and, consequently, confidence intervals are not correct. Although the correction for standard errors obtained by Cox models has already been proposed 10 , Lee 9 did not correct them. Barros and Hirakata 5 use the fact that Breslow-Cox and Poisson models estimate the same effects 11 and use Poisson regression models with robust variance to estimate the PR. Zou (2004) 12 provided a simulation study demonstrating the reliability of the Poisson model with robust variance to estimate PR in tables 2 by 2. The main problem to use Poisson model while estimating PR is the misuse of a specific counting probability distribution to describe a response variable that is dichotomous (presence or absence of an outcome). Skov et al. (1998) 4 used a generalized linear model with the binomial distribution and log link (log-binomial model) to estimate directly PR 13 . Although this model makes possible to estimate directly the PR and assumes the appropriated probability distribution considering the type of the variable response, the lack of convergence in the presence of continuous variables is still a problem. For solving this problem, Deddens 6 introduced the COPY method to find an approximation to the MLE when the log-binomial model fails to converge.
Due to the convergence problem of the log-binomial model, Schouten et al. 14 proposed a simple data manipulation in order to use the logistic regression to obtain the PR. It consists in modifying the data set by duplicating the lines where the event occur and replacing the outcome from event to non-event [14][15][16] .
Other approach was proposed by Wilcosky and Chambless (1985) 17 , using the conditional and marginal methods 10 , which developed a direct adjustment of epidemiological measures from binary regression. An advantage is that this method assumes a probability distribution for variable with binomial response, which matches the nature of the observed variable as response in cross-sectional studies. We find one article18 that uses the Wilcosky and Chambless 17 method to estimate PR, yet it did not mention anything about the software implementation. Recently, Ospina and Amorim (2013) 18 developed a package to R software (prLogistic) to estimate marginal and conditional PRs with bootstrap and delta's method confidence interval, but they were not shown the details of this package in scientific article as well as the differences between the several methods to estimate PR.
In this article we implement the direct approach to estimate the prevalence ratio from binary regression models based on Wilcosky and Chambless (1985) 17 and compare with different methods to estimate the prevalence ratio presented in the literature. Three different data sets are used to illustrate our study.

Methods
We use real and simulated data to compare prevalence ratio (PR) estimates obtained by the marginal and conditional models based on the approach proposed by Wilcosky and Chambless (1985) 17 . Those estimates are also compared with the estimates obtained by the Binomial, Log-binomial and robust Poisson/Cox models.
It is well known that it is possible to estimate the probability of occurrence of a disease (denominated prevalence in transversal studies) adjusted for two or more variables across the logistic model. Suppose, for example, that one has information about diabetes status (1: Yes / 0: no), age (continuous) and obesity (1: Yes / 0: No) of a defined population, we can obtain the probability of diabetes by the equation below.
where are β 0 , β 1 and β 2 are regression coefficients estimated from the data. Note that exp(β 2 ) estimates the odds ratio for diabetes in obese compared to non-obese, adjusted by age. However, if we are interested in obtaining the estimated PR for diabetes in obese and non-obese adjusted by age, we can proceed in two ways as described below:

Marginal Model
In each stratum of variable OBESITY (Yes or No), the diabetes prevalence is calculated for each age value observed in the dataset using equation (1). The PR is the ratio between the average of prevalences in each stratum. This estimate is called by Wilcosky and Chambless (1985) 17 marginal prevalence ratio (MPR).

Conditional Model
In each stratum of variable OBESITY, the diabetes prevalence is calculated using eq. 1 setting age as an average value obtaining from the dataset. Thus, the ratio of the two prevalences can be calculated. Wilcosky and Chambless (1985) 17 named this method as conditional prevalence ratio (CPR).
In the linear regression, both approaches estimate the same value. However, in the logistic model we observed significant differences between the estimates of the two models when p is close to zero or one.
According to Lee 9  Asymptotic confidence intervals for the conditional and marginal prevalence ratios were proposed by Flanders and Rhodes (1987) 20 . The authors also presented a SAS code to estimate and calculate the intervals of the conditional and marginal prevalence ratio. In this paper, we implemented their functions and compared with other methods by applying in different data sets in the results section The prevalence ratio estimation methods are illustrated in three different databases, all containing a binary outcome Y, a binary exposure X, and at least one controlling variable Z.
Application 1: The first database is a toy example with a 1000 simulated observations and one continuous controlling variable. In this example, we simulated 1000 binary outcomes with a binary exposure, X, and a continuous confounding variable, Z. The exposure was sampled from a Bernoulli distribution with probability 0.5, the confounding variable was sampled from a Normal distribution with mean zero and unit variance. The outcome was sampled from a Bernoulli distribution with probabilities such as the baseline prevalence equals 20%, the conditional prevalence ratio for X at Z = 0 is equal 2, i.e. PRX|Z=0 = 0, and the conditional prevalence ratio for X at Z = 1 is such as that regression coefficient is β 2 = 0.20, hence PRX|Z=1 = 1.9186. There are several possible values for the conditional prevalence ratio for X depending on Z. The different methods to obtain prevalence ratios were coded in R 22 . The code is available in the appendix.

Results
Application 1: Toy example     POR=Prevalence odds ratio and PR=Prevalence ratio.

Discussion
Difficulties in obtaining prevalence ratio in cross-sectional studies have been investigated by several authors in recent years. Several authors use strategies for indirect calculation of the PR using the Breslow-Cox and Poisson models (with and without robust variance), while others interpret the prevalence odds ratio obtained in logistic regression models as prevalence ratio. Lee 9 is one of the first authors to discuss the methods proposed for estimating prevalence ratio. Most cross-sectional studies in health, until then used logistic regression model, since it has the advantage of adjusting for the effects (PORs) for several variables, either confounding or modifying effect, however, the POR can poorly estimate the prevalence ratio, up to 27 times more when the outcome is prevalent 23 .
Regarding the estimation of adjusted prevalence ratio, in our examples the estimates pro-