Impact of Non-Weighting in the Analysis of Data Obtained from Complex Samples

non-weighting in the of data obtained from complex samples. ABSTRACT Objective: To compare the estimates obtained, considering or not the weighting data. Material and Methods: Secondary data from the Oral Health Survey of the State of São Paulo (SBSP2015) was used for calculation of mean estimates, standard errors of the mean and confidence intervals (CI) for the DMFT index and components (decayed, lost and filled), in the age group of 35-44 years. Multiple logistic regression models were estimated, considering or not the weighting from the sampling plan (p<0.05). Results: It was observed that the estimates of the DMFT index and the carious component did not vary much when the design was considered or not (1.1% and 2.0%, respectively). However, the data referring to the lost and filled component showed greater differences between the values of the means. The averages fluctuated up and down by up to 6.7% for weighted versus unweighted analyses. The standard error was underestimated in the unweighted analysis and the confidence interval showed variations. Differences between the regression models obtained by the weighted and unweighted analysis of the data were detected. Conclusion: Although weighted and unweighted models presented differences of less than 10% in estimates of the mean, confidence intervals, as well as statistical inferences, were different. Thus, weighting should be applied in the population base data analysis collected by sampling with complex designs.


Introduction
Population-based epidemiological studies require a design that allows the characteristics of the studied population to be analyzed, using a representative sample plan of the population, where each element of the population has a known and non-zero probability of being selected. Thus, one of the ways to guarantee the representativeness of the data in the population is the randomization by probabilistic sampling [1,2].
Studies by probabilistic sampling allow the researcher to observe the representation of the phenomenon studied in the population and calculate confidence intervals and statistical significance. For this, the researcher must define in the study planning which sampling method will be used, among them, Simple Random Sampling (SRS), Systematic Sampling (SS), Stratified Random Sampling (AE) and Cluster Sampling (AC). The last one, which guides our research, is used in large-scale population surveys, where each sampling unit is a group of elements [3].
A major challenge for population-based surveys is the analysis of data collected by complex sampling (a combination of several probabilistic sampling methods for selecting a sample). Using a combined method of sampling allows the researcher to visit compact areas instead of dispersed areas individually, reducing time and expenses [4].
In cluster sampling, the population is divided into groups that function as Primary Sampling Units (PSU), which are linked to the clusters. Each cluster is composed of several units and these are called Secondary Sampling Units (SSU). Within the PSU, a sample can be selected that can have one or more stages.
Thus, this type of sampling is recommended for populations of states, cities, neighborhoods, and census sectors [5].
It is worth noting that, when investigating only the selected clusters, there is an increase in imprecision due to possible differences in the sample units within and between clusters, making it necessary to increase the sample size. Therefore, the data for analysis should be treated, for they cannot be considered as being independent, as occurs in simple random sampling [6,7]. The main challenges encountered in analyzing data from complex samples are the measurement of correct point estimates and the estimation of standard deviation and variance and this is very important because to search to estimate an overall mean rather than to estimate means for each group or area. In the studies, there is the analytic choice, but little attention has been paid to the theoretical justification for this way of proceeding [8,9].
When the sample units are presented with variant probabilities, it is advisable to assign weight to the analysis. As an example, there is the State of São Paulo, where there is a city with more than 12 million inhabitants and another one with less than 3 thousand. One way to make the representativeness of a grievance be compared among the locations would be by assigning weights to these samples [10,11]. However, the literature has many studies that use regressions to assess the association of variables that did not weigh their data for complex samples.
In the literature, we find clear evidence of the importance of weighting for data from complex samples.
However, many researchers do not understand or are not concerned with the impact on estimates (mean, standard deviation, and confidence interval), in addition to the significance of logistic regressions.
For that matter, the objective of the present study was to present a trial on the influence of weighting on population estimates and on the statistical significance of the variables analyzed in logistic regressions, using the database from the epidemiological survey on oral health of the State of São Paulo, in the adult population aged 35 to 44 years.

Study Design
This study is a statistical trial, which compares estimates obtained from weighted and unweighted samples. To this end, secondary data were used referring to the State Oral Health Research (SB São Paulo Project -SBSP) in the adult population between 35 and 44 years old [11].
The SBSP2015 survey used the WHO methodology to estimate the prevalence of oral diseases (caries experience, use and need of prosthesis, periodontal condition, toothache, socioeconomic characteristics, and impact on oral health) for each of the six macroregions (domains) and for the State of São Paulo, considering the age groups under analysis. The total number of municipalities in the state of São Paulo is 645. A total of 178 cities plus the state capital were drawn for the study (1st draw stage -PSU Primary Sampling Units) with representation for six macroregions (São Paulo Capital, Metropolitan Region of São Paulo and the Regional Health Departments -RHD -2 to 17-as shown in Figure 1).

Figure 1. Regional Health Departments of the State of São Paulo and the macro-regions.
In the second stage, 390 census sectors, or Secondary Sampling Units -SSU (2 sectors for 178 municipalities and 36 sectors for the city of São Paulo) were drawn. Given the characteristics of this research, only urban sectors were used. The drawing of sectors was carried out with probability proportional to the number of inhabitants in each of the municipalities. The depletion technique with minimum sample size for each SSU was used, where all households in the SSU were visited, with individuals from the studied age group being examined.
The census sector is a spatial division of the territory defined by the IBGE with approximately 300 households. Therefore, having in hand the sector map and the IBGE projection for the size of the population stratified for the age group in each sector drawn, a team of "Enrollers/Beaters" visited all households in the census sectors, identifying all residents who were eligible (age indexes of the survey) and informing them of the survey. As each census sector presented distinct characteristics of population density by age group, all occupied households were visited, and eligible residents in the age groups were registered on the enrollment form.
Subsequently, the teams of examiners and markers traveled throughout the sector, inviting them to participate in the research (signing the informed consent form, examining and interviewing people in the index age) until the sector was depleted and the minimum number expected was reached. If this did not happen, the sector was covered again. All information from the residents examined/interviewed, as well as those who were absent and those who refuse to participate in the survey, served as data to weigh the results, calculating the density rate by census sector and the non-response rate, being fundamental information for correction analysis, for they generate weights for weighting within the sector of the municipality and the macroregion.
For the calculation of the sample of adults (age range 35-44 years) and adjustment of the sample to the size of the population, the parameters of the SB Brazil 2010 survey for the southeast region were used [11].
Each domain had a minimum sample size for each of the six macroregions (n = 687 for calculating the sample for dental caries). However, as other conditions were also investigated (periodontal disease, use and need of prosthesis), the minimum sample size per macroregion was 951 people, which was divided by 30 (considering that this is the minimum PSU per macroregion) and reached a minimum n for each examiner per PSU of 32 people. At the end of the survey, 6,051 adults were examined.
In the respective primary and secondary sampling units, information about the conditions and other socioeconomic information was included in the original database. Each macroregion of the State of São Paulo (n = 6) was called a domain, where the cities were drawn. Therefore, in each municipality (PSU), it was drawn within its domain (macro) and had a probability proportional to its population size (PPS). In the second stage, the census sectors were drawn in each of the municipalities. Therefore, the census sector is considered the secondary sampling unit (SSU).
It was not possible to carry out examinations in all selected municipalities (PSU) in their respective domains, as well as in some census sectors (SSU), resulting in a number of examinees below the total calculated in the sample, which was already foreseen. This fact causes an imbalance in the sampling process using PPS as a reference. Thus, the non-response rate of the municipalities participating in the research was calculated according to the municipalities drawn.
With the acquisition of the weights for each of the PSUs, the next step is to carry out the integration of the two databases, that is, the weight database with the database of the examined individuals. Both databases are on the same Excel base, where the information was merged so that they had the same reference variable.
To this end, the chosen variable was the name of the municipality (PSU), where, from the identification of this variable, the spelling of the municipalities was corrected using the SEERRO function of the Excel program, which nests the information of two databases, having in common the same reference variable.

Data Analysis
Estimates of means, standard errors of the mean and confidence intervals were calculated for the DMFT index (decayed, missing, and filled teeth), considering the complex design (with weighting) and also considering as simple random sampling (without weighting).
Then, the DMFT index was dichotomized by the median and the associations with the independent variables sex, ethnicity, income, time of study, and visits to the dentist were analyzed. For that, simple and multiple logistic regression models were estimated, considering or not the complex design. For the analysis considering the complex design, the weights were calculated from the sampling plan and used to adjust the estimates according to the distribution within the regions.
Variables with p<0.20 in the individual analyses were tested in the multiple regression models, and those with p≤0.05 remained in the models. The adjustment quality was assessed by the Akaike information criterion (AIC) and -2 Log L. All the analyses were performed using the SAS program. Table 1 shows the means, standard errors of the means and the confidence intervals (CI) for the DMFT index and its components, estimated with and without weighting. It was observed that the estimates of the DMFT index and the carious component did not vary much when the design was considered or not (1.1% and 2.0%, respectively). However, the data referring to the lost and filled component showed greater differences between the values of the means. For these parameters, an increase of 6.1% was observed for the lost component and a reduction of 6.7% for the filled component. It is also noted that the standard error was underestimated when weighting was not considered. Consequently, the confidence interval of the weighted estimates is larger, which can influence the statistical significance.  Table 2 shows the results for the simple and multiple logistic regression analyses, considering or not the weights. It is observed that the weighted and unweighted analyses present variables with different significance. The choice of these independent variables was made to reproduce other regression models that seek to estimate the effect of socio-demographic variables and the use of dental services on the caries experience.

Results
Differences were observed between the regression models, with and without statistical weighting of the analysis. The ethnicity variable was statistically significant in the weighted analysis (gross and adjusted), but it was not significant in the unweighted analysis. However, the variable "time since the last consultation" was statistically significant in the unweighted analysis (gross and adjusted), but it was not for the weighted analysis.

Discussion
This study used the secondary database of SBSP2015 to exemplify the estimation of parameters with and without the attribution of weights (weighting). The estimates obtained from the mean, standard error of the mean and confidence interval differed from the measurements obtained when weights were not assigned, which can change the significance of the variables. In this sense, the failure to assign weighting to the analysis could lead to an underestimation of the observed standard error [12]. In addition, the efficiency of complex sampling can be misused if unequal weight is not attributed due to its unequal probability of representing the sample [13].
A cluster sample has the advantage, compared to the simple random sample, that it is cheaper in the cost per element sampled due to a lower expense in the elaboration of registrations and in the location of individuals. However, on the other hand, it implies greater complexity and difficulty in statistical analysis, in addition to the increase in the magnitude of the variance, which, when using a cluster design, is difficult to predict, with a consequent decrease in the precision of the study [14,15]. As noted, the weighting of the analysis resulted in averages different from those obtained without the weighting. Although the weighting resulted in the expansion of the standard error measure, possibly associated with the imputation of the non-response rate in the calculation of the estimates, this procedure tends to reproduce more accurately the calculation of the estimates in complex samples. If a complex sample is analyzed as if it were a Simple Random Sample (AAS), the magnitude of the variability will be underestimated, which results in a loss of efficiency in the analysis [16].
Although the importance of weighting data in complex samples is already clear, we observe the nonobservance of these procedures in many researches. Obviously, the estimates of the weighted data are different; however, the question is whether these data would be implicated in the estimates of the association and impact measures using logistic regressions. As observed in this study, the weighting resulted in different multiple statistical models from that obtained when the weighting was not performed.
The present study found that the unweighted analyses showed a standard error of the mean lower than the weighted samples, which makes the confidence interval estimate underestimated, which is already expected due to the characteristics of the sample itself. However, the unweighted analysis results in estimates that are incompatible with the sample design of the study. Therefore, the incorporation of weights is strongly recommended. Some authors cite the increase in the standard error measurement as a disadvantage; however, they argue that the estimates obtained by unweighted samples are not valid for the population [17,18].
When in a population-based survey, data analysis is carried out without taking into account the weighting, it assumes equiprobability among the individuals examined/interviewed, the use of weights corrects the probability of a draw, as the organization in clusters results from the unequal fraction probability of inter and intracluster draw and weighting makes the final result less likely to be irregular [19,20]. However, there is an increase in variance and confidence intervals, as individuals within clusters tend to be more similar than the population, compared to unweighted analyses [21].
Greenland suggests that changes of 10% or more could be considered a criterion for bias selection.
Although we cannot find mean with differences of more than 10%, we clearly see them in the SE, in some cases with 100% difference (DMFT), consequently increasing the CI range [22].
Although it seems obvious the need to weigh complex samples and this is a peaceful point in statistical designs, this quality parameter is not always used in scientific publications, even in impact journals. Our aim was to show how much the means, SE, CI and statistical inference can be modified by weighting. It seems clear that this is the case, but we found a very limited literature on the subject, and many researchers still have doubts about this need, which requires some degree of knowledge of statistics and consideration in complex sample designs. In addition, we could cite as a limitation the fact that this study uses a single sample. However, this sample contains more than 6000 individuals and, therefore, the possibility of using smaller samples would decrease the possibility of verifying the magnitude of the difference in relation to the average, SE and CI.
This finding brings us to the discussion of weighting, as it controls possible measurement biases, while unweighted samples have standard errors and underestimated confidence intervals, which impacts the estimates of significance values, as observed in this trial. The fact that weighting the data and using programs that allow the use of weights in the analyses makes it possible to correct the values, since there is a difference between the sample drawn and the expected exams/surveys, in addition to what is observed and the real one.
Thus, it is recommended that non-response rates are corrected in weighting with the use of weights [23].
There are criticisms regarding the results obtained in the SBBrasil 2003 nationally based survey, precisely because they do not carry out the weighting and use of weights in the data analysis. The draw considered the organization by clusters, but weights were not assigned for the analysis of the results, thus disrupting the estimates. It is emphasized that the greater the variability of the studied variable, the greater the bias of the sample estimate, so the lack of weight makes it difficult to estimate the population results [24].

Conclusion
Although means vary less than 10% between weighted and unweighted analyses, standard errors, confidence intervals and statistical inferences were different, which can be considered a potential bias. Thus, it is recommended that the weighting be applied in the analysis of population-based data collected by sampling with complex designs.