Methodology in the epidemiological research of respiratory diseases and environmental pollution

Received on 30/5/2001. Reviewed on 24/8/2001. Approved on 31/10/2001. Descritores Pesquisa, métodos. Poluição ambiental. Doenças respiratórias, epidemiologia. Métodos epidemiológicos. Abstract There are complex and diverse methodological problems involved in the clinical and epidemiological study of respiratory diseases and their etiological factors. The association of urban growth, industrialization and environmental deterioration with respiratory diseases makes it necessary to pay more attention to this research area with a multidisciplinary approach. Appropriate study designs and statistical techniques to analyze and improve our understanding of the pathological events and their causes must be implemented to reduce the growing morbidity and mortality through better preventive actions and health programs. The objective of the article is to review the most common methodological problems in this research area and to present the most available statistical tools used.


INTRODUCTION
Epidemiological research in the field of respiratory diseases (RD) and its association with the environment is experiencing a growing interest from health planners.The increasing morbi-mortality due to RD in urban and rural areas since the 1950s makes the implementation of study designs necessary in order to approach these problems from diverse perspectives.The rapid developments in computer technology throughout the last two decades allow the programming of tools that facilitate statistical analysis.Since personal computers have the capacity to process large quantities of data, programs for multivariate and time series statistical analyses, among others, have proliferated contributing to the understanding of the associations among variables and to generate models for projections.Health sciences have benefited from these instruments due to the information available and the possibility of exploring complex relations with these elements.
The objective of the present article is to analyze and review some methodological problems commonly seen in RD and environment research as well as to present the major statistical procedures used for analysis.

POLLUTION, RESPIRATORY DISEASES AND ENVIRONMENT
Acute and chronic RD associated with environmental pollution represent a relevant concern throughout most developed and developing countries, 26,41 particularly in urban areas due to growing urbanization and industrialization. 8,28,30Changes in population morbi-mortality patterns across time have been described as an epidemiological transition, 7,23,34,35 in which the last stage is characterized by an increase in non-communicable chronic diseases while the incidence of infectious diseases decrease.Some researchers are currently trying to characterize a new stage of transition in which lifestyles and better prevention reduce the incidence of non-communicable chronic diseases. 13,29,42e relationship between urbanization and disease is a topic of analysis that requires a convergence of multiple disciplines and methods to increase knowledge and understanding of complex and multicausal processes. 31,38Although it is relatively simple to identify the factors, direction, magnitude and projection of a trend (i.e., due to season, weather or environmental conditions), there is a need of proper methodological designs and statistical procedures to obtain relevant information.Thus environmental epidemiology and other related specialties are becoming essential. 11,12,26,43,45ome of the common challenges in the study of this relationship are presented here.
Selection of the sample and external validity.As in most population studies, an updated list with all eligible study subjects is generally unavailable.Therefore, basic sampling techniques, such as simple and stratified random sampling, must be replaced by complex sampling techniques, such as cluster sampling, useful for numerous populations and large areas.This technique requires a cluster list covering sections of the population (i.e., geographic zones) from which a random sample is selected (this design could increase in complexity if, for example, on each cluster subclusters are selected as often happens in national surveys). 22,27,48Among the advantages of this technique, the most important ones are less time required to collect data, lower cost and fewer instrumentation errors.The main disadvantage is an increasing complexity of the statistical analysis since it is necessary to adjust the sample size due to the error introduced through the sampling technique.Also, expansion factors must be calculated for each cluster to estimate how many cases in the universe are represented for each particular subject.It is important to anticipate a design with enough power and confidence to detect the main outcome.Only that will useful and accurate data be obtained.By keeping good control over the study variables, the sample can be estimated based on the known variability and desired changes on the outcome variable and the number of control factors, taking into account power and significance.

Complex etiology
The complex etiology of RD due to environmental pollution indicates an association with diverse and often multidimensional factors such as air pollutants (temperature, gases and particles, among others) interacting and making difficult the interpretation of the study results. 6,11The most frequently investigated pollutants are sulfur dioxide (SO 2 ), black smoke, nitrogen dioxide (NO 2 ), air particles, ozone and acid rain. 36,37,39,44,47Some of the most commonly investi- gated confounding factors at the environmental level are: socioeconomic status, seasonality, meteorological phenomena of sustained pattern, day of the week, vacation time, maximum temperature and humidity, which affect individual indicators such as: tobacco consumption, nutritional status and health. 4,5,9,49It is complicated to design a prospective study where all possible exposure factors are controlled in the groups.
For example, to compare exposure to high vs. low levels of ozone, controlling for low and high levels of NO 2 , considering three levels of temperature and three levels of socioeconomic status, there would be a 2 × 2 × 3 × 3 matrix or 36 comparison groups.Each additional factor would increase the number of groups making the interpretation of results complex.In addition, the categorization of ozone, NO 2 , temperature and socioeconomic status could lower data quality.
Those studies are complex to implement experimentally not only for logistics matters but also due to ethical reasons (i.e., have a group of subjects exposed to a presumably unhealthy effect).However, with proper monitoring measures it is possible to study exposure to a specific controlled condition in humans. 14An alternative to this type of study is controlling covariates through analyses using multivariate methods.This approach enables to include variables in their continuum form increasing the statistical power, though sampling calculations must be performed taking into account these control variables.

Complex interactions
Respiratory diseases are intimately related to environmental conditions such as temperature, ozone levels, CO 2 , etc.These conditions vary seasonally and have a differential effect on individuals through time.These factors must generally be considered in population studies of RD because they could be a source of confounding when not controlled.Disregarding socioeconomic status (a proxy of health behaviors) as a control variable could resulted in biased results when, for example, estimating risk by regions, since there could be differences due to heterogeneous socioeconomic levels in the locations under study.This type of research is basically characterized by the interaction of different factors as well as by the fact that a mixture of pollutants rather than by a single agent could produce an observed effect. 6,26Also, it is commonly observed that, unless a study is specifically designed, its power to detect the association between a single pollutant and RD is weak. 11It is generally helpful to keep data in its raw, non-aggregated form so that as to obtain maximum precision of the information gathered and be able to explore the data.Categorization should be performed later on in the analysis phase when some decisions are required (i.e., to divide groups into tertiles or as an indicator variable if necessary). 2,3,27hanging" conceptual framework Like other health problems the conceptual framework for studying the association of RD and environmental pollution is complex and context-specific.Various factors could affect the outcome directly or through mediators (e.g., environmental temperature per se could modify certain RD outcomes, but it is also possible it affects ozone levels and free suspended air particles, creating an intermediate effect).Relevant factors in a certain context might not be important in another (e.g., CO 2 levels in rural communities with a high prevalence of RD might not have the same impact as in urban communities).Therefore, the conceptual framework on which hypothesis and data collection are based must be preceded by an exploratory study, a careful literature review and/or an indepth analysis of the particular study scenario by experienced professionals.

Multivariate methods
Similar to other research areas, the methods for studying RD due to environmental pollution are limited by creativity, resources, time, study design and data quality.There are a number of procedures commonly used to approach these difficulties.A combination of time-series and multiple regression analysis is often performed.In most cases, the objective is to evaluate whether there is an association between RD and other identified factors such as potentially harmful exposures, and its magnitude.
Multivariate methods analyze the relation of at least three variables.Among these methods, one of the most popular is multiple regression analysis, where the relation among one or more independent variables to a dependent one is explored.This method is commonly applied in environmental pollution and RD research.It helps to estimate the magnitude of variation of the dependent variable (e.g., acute respiratory infections in children <5 years old) describe along the levels of the independent or predictor variables (i.e., ozone levels, environmental temperature, free suspended air particles and nutritional status).This method of analysis is useful in experimental, quasi-experimental and observational designs-the latter being most likely to be biased due to the lack of random selection of individuals to the different exposures or treatments.Nonetheless, cross-sectional designs are frequently used to analyze associations between environment agents and RD due to their simplicity and cost, but results must be cautiously considered and compared to similar research for consistency matters.Multiple regression models are a powerful alternative; however, they are based on assumptions that must be carefully reviewed when the analysis is performed to verify the adequacy of the model.In addition, the number of cases and interactions between independent variables should be considered. 25,33The assumptions to be taken into account are the following: 1) normality and variance equality: for each value of the independent variable x, there is a normal distribution of the independent variable y.For example, the number of asthma hospital admissions will not always be the same every time ozone is high but if an analysis of hospital admissions according to ozone level is performed a normal distribution will be observed; (2) independence: cases have to be statistically independent among themselves.In other words, an observation cannot be influenced by or have an effect on another one; 3) linearity: measures of the values of y for x falls on a straight line (or plane, when more than one independent variable is considered) -it is the population regression.
Linearity, normality and variance equality issues can be solved by transforming the variables, and applying exponential, logarithmic or reciprocal functions.It is convenient after the analysis to report the values in the original form to facilitate the interpretation of results.
Another useful multivariate procedure is the partial regression where the relation between two or more variables is analyzed maintaining a factor (or variable) constant.An example would be to study the levels of free suspended air particles associated with the development of asthma episodes in children keeping constant a given level of ozone.The result could then be compared to a regression curve without the constant to clarify the existing relation. 25en the dependent variable has only two possible values, the assumptions are violated since it is not possible to expect a normal distribution of the error.Likewise, the prediction values are not limited to fall between 0 and 1 (the two possible values), and, therefore, the results cannot be interpreted.In these cases the logistic regression model can be used.In contrast to the multiple regression model where the least squared method is used-the coefficients of regression that provide the smallest squared sums of the distance between the observed and predicted values of the dependent variable are selected-, in the logistic regression model the parameters are estimated using the maximum likelihood method, where the coefficients that make more likely the observed results are selected. 25,32Once the model is applied it is important to see how well the data fit in.A good model is said to have a high likelihood of obtain the results with the estimated parameters. 15,16 example is a logistic regression study performed by Scarlett et al 46 to determine the association of some air pollutants to respiratory symptoms in a sample of adults.The dependent variable would be the presence of previously defined respiratory symptoms (such as cough and phlegm).This information was compared with the levels of black smoke and SO 2 divided by geographic regions adjusting for socioeconomic status and tobacco consumption.The results can be presented as crude and adjusted odds ratios or relative risk, let's say indicating that a given level of black smoke increases significantly x times (or x percentage) the occurrence of respiratory symptoms adjusted for other related variables.This model is also useful in studying risk in prospective designs. 12her types of useful multivariate methods include path analysis and structural equation modeling where the relations among study variables are analyzed.Though based on the regression analysis, it can provide more information since it assesses the association of independent variables and their magnitude.This procedure is useful in elaborating and graphically displaying the network of relations among variables, identifying independent, intermediate and dependent variables. 1,2,17,25Likewise, there are tools such as factor analysis and principal components analysis that allow, through a series of algorithms, the estimation of dimensions (or factors) represented by a group of variables integrating a "dimension" and that can predict certain events.These dimensions can be temperature, free suspended air particles and levels of gases, and each one of them is a combination of diverse variables which together affect health.It is frequently found that dimension has a more important predictive value than single variables when analyzed separately.In this sense, the technique of factor analysis generates dimensions-sometimes not evident-that can be useful tools to simplify a model representing a group of collected variables when the extracted coefficients of dimension are used. 2,40A frequent application of this technique is to create a surrogate factor for the socioeconomic status.All collected variables designed to measure it (such as materials for building houses, educational level, income, public services, etc.) are introduced to the model and the extracted coefficients of the main factor are used to represent the socioeconomic status dimension as a single variable, thus simplifying the final model.
The log-linear models analyze complex relations among interrelated categorical or nominal variables.In these models all variables are independent and the dependent variable is the number of cases on each cell in a matrix in which each combination of variables represents one cell, 10,32 and continuous variables can be incorporated into the model as co-variables.Take for example a study on the susceptibility of having an increasing incidence of asthma among diverse socioeconomic status groups in two regions with different levels of air pollutants.This method is also called Poisson regression and is often used in conjunction with time series in environmental pollution and RD research.This procedure was included in the APHEA (Air Pollution on Health; European Approach) protocol implemented in 10 European countries (15 cities) using time series data of a population of more than 25 million to study short-term adverse health effects of environmental pollution. 4,9,18,19,20,21t was used to analyze models of discrete dependent variables (i.e., number of deaths due to RD).Instead of normal distribution, Poisson distribution is applied, which is adequate when strange events are modeled, such as subjects developing lung cancer over a period of time.The purpose of this analysis, as in multiple regression, is to model accurately E(Y) as a function of a group of predictor variables X 1 , X 2 ….X n ; however, it fits better in these scenarios. 25ere are some statistical procedures of data analysis when the variable of interest is time until an event occurs, called survival analysis.One of the most relevant requirements for these sorts of analyses is the quality of data.The event of interest could be death, disease development, disease relapse, and time elapsed till recovery among others.When it is assumed more than one final event, the problem is characterized as of competitive risk.Survival analysis techniques could be employed to the study of: time till occurrence of a new asthma episode in a population, or time elapsed without respiratory symptoms in a sample exposed to diverse temperatures and pollution levels.Survival analysis assesses through Cox regression the relation between explanatory variables and time until the event occurs. 24

Time series
The other commonly employed analysis in the field of respiratory diseases and environment research is time series analysis.The study of time series was developed in the economics area with applications in finance, marketing and administration research.Nowadays however it is commonly applied to various disciplines, among them public health, especially as a monitoring tool.This technique is particularly useful in situations where there are repetitive cycles, trends, seasonal variations and irregular fluctuations -precisely the kind of problems commonly found in the research of respiratory diseases and environment.Time series analysis uses information collected over a period of time, such as ozone levels per day or per season, average temperature per week, etc.These data could be quantitative or qualitative.Generally a mathematical model is adjusted to the observed series to predict the future behavior of the series within its confidence limits.There are two general types of time series analysis.One is the univariate and auto-projection (such as Auto-Regressive Integrated Moving Average -ARIMA) where the prediction is derived exclusively from their previous trends.This technique allows predicting, based on previously recorded series, the mortality due to pneumonia and influenza in a region.The other is known as cause and effect or multivariate, in which external variables are used to establishing mathematical relations between one or more time series representing the related factors.This technique could be useful to evaluate monthly bronchitis, emphysema and asthma deaths, for instance related to monthly cigarette consumption, temperatures and pollution.Knowing these techniques in depth is required before researchers can make relevant decisions.However, the Epidemiology and Monitoring Division of the Centers for Disease Control and Prevention in the United States has recently made of public domain a software for health professionals to easily perform and interpret time series analysis for disease monitoring. 50verse epidemiological studies on environmental health have employed these techniques to analyze the risks of respiratory diseases and pollution. 43These procedures along the Poisson regression are a fundamental part of the APHEA protocol mentioned above.
Other statistical procedures.Though reviewing all useful statistical procedures would be quite extensive, it is worth mentioning at least three other relevant areas.The first is the Geographic Information Systems, which create digital maps and allow identifying complex relations otherwise not evident using traditional methods.But adding variables representing distance of physical relations could influence the outcome.
Second, there are a number of techniques for longitudinal data analysis.Most of the already described procedures have an equivalent for repeated measures taking into consideration in the models the correlation among the cases.This is a central issue given that the same case is evaluated in diverse points, there-Methodological issues and environmental pollution Barquera S et al.
fore violating the assumption of independence, and requiring a different approach. 10d finally it is worth mentioning that there are a number of robust procedures to estimate trends fitting a straight line that is not drastically affected by extreme values, as opposed to the least square method used in multiple linear regression.Robust regression methods are based on outliers analysis and provide resistance against their influence.One of them is the weighted least square regression, where a lower weight is given to suspicious values.There are also other procedures such as the resistant slope estimation through other techniques. 50These procedures can provide alternatives when variable distribution problems can be anticipated, often seen in population data.

CONCLUSIONS
Research of environmental pollution and RD areas is complex and requires a proper study design design (see Table for a list of common challenges in this area).The development and availability of software programs and statistical procedures has provided a significant breakthrough.The growing prevalence of communicable and noncommunicable RD makes the application of adequate designs and evaluation techniques a priority in order to understand the hypothesized relations and to ultimately make better decisions to prevent respiratory diseases.

Table -
Common challenges in the epidemiological research of environmental pollution and respiratory diseases.