Sample design and participation in the Educatel Study

Educatel Brazil 2015/2016 was a cross-sectional study conducted by telephone interview with the aim of producing information on health and absenteeism in Brazilian schoolteachers. The nationally representative sampling plan was based on the simple stratified sampling method, with stratification defined to meet the analytical domains established for the study (five major geographic regions, two census areas, four age brackets, sex, three types of school administration, five types of teacher employment, and six grade levels) and selection by simple random sampling of teachers in each stratum. Teacher selection was based on the 2014 School Census conducted by the Brazilian National Institute for Educational Studies and Research “Anísio Teixeira”. Of the 2,229,269 teachers recorded in the Census, 13,243 were selected. A total of 119,378 telephone calls were made, identifying 7,642 eligible teachers (57.7% of the total initially sampled). A total of 6,510 interviews were finally completed, for a response rate of 85.2%. At the end of data collection, sample weights were assigned to each of the teachers interviewed. These weighting factors are connected not only to the Educatel sample design, but also to the adjustment terms for treatment of non-response during the data collection process. Random and Systematic Sampling; Surveys and Questionnaires; School Teachers; Education, Primary and Secondary Correspondence M. T. Vieira Departamento de Estatística, Instituto de Ciências Exatas, Universidade Federal de Juiz de Fora. Rua José Lourenço Kelmer s/n, Juiz de Fora, MG 36036-900, Brasil. marcel.vieira@ice.ufjf.br 1 Instituto de Ciências Exatas, Universidade Federal de Juiz de Fora, Juiz de Fora, Brasil. 2 Escola de Enfermagem, Universidade Federal de Minas Gerais, Belo Horizonte, Brasil. 3 Faculdade de Medicina, Universidade Federal de Minas Gerais, Belo Horizonte, Brasil. doi: 10.1590/0102-311X00167217 Cad. Saúde Pública 2019; 35 Sup 1:e00167217 QUESTÕES METODOLÓGICAS METHODOLOGICAL ISSUES This article is published in Open Access under the Creative Commons Attribution license, which allows use, distribution, and reproduction in any medium, without restrictions, as long as the original work is correctly cited.


Introduction
Planning efficient probabilistic samples for survey populations has been discussed exhaustively in the scientific literature 1,2,3 .Such planning aims to reduce survey costs and increase their efficiency (for data collection and publication of the results).Meanwhile, participation rates in epidemiological studies have decreased over the years, with undesirable effects on the surveys' internal validity 4 .Thus, complex sampling weights with the combination of different probabilistic methods for sample selection are more advanced than simple random sampling (SRS).Complex sampling has been used increasingly in the heath field, especially for large samples.As an example, the Brazilian National Health Survey (PNS in Portuguese) is a nationwide household survey aimed at studying the Brazilian population's health status and lifestyles, using three-stage cluster sampling (census tracts selected in the 1st stage, households within the tracts in the 2nd stage, and an adult among the household's residents in the 3rd stage) 5 .The Brazilian National Household Sample Survey (PNAD), which also periodically collects data on the Brazilian population's health, uses a similar cluster sampling plan, sometimes with two selection stages and in other cases three stages 6 .Meanwhile, the Risk and Protective Factors Surveillance System for Chronic Non-Communicable Diseases through Telephone Interview (VIGITEL), conducted by telephone interview in the country's 26 state capitals and the Federal District (aimed at monitoring the size and variation in the frequency of the main risk and protective factors for noncommunicable diseases) adopts a different strategy, with selection of hardwire telephone lines (using telephone lists furnished by the main telephone companies) followed by the selection of an adult among the residents in each contacted household 7 .
The adoption of a complex sampling plan generally allows obtaining estimates with preestablished precision parameters in population surveys, with the additional advantage of low cost and ease of collection when compared to studies using less sophisticated sampling weights 2 .Therefore, many large-scale epidemiological studies in Brazil currently use such sampling weights.By reducing the numerical contingent of participants, they add the advantage of interrupting fewer people in their daily routines to participate in the survey.Field surveys and people are known to experience difficulties when the latter are invited to participate but suffer time constraints or other limitations.Reducing the sample size thus offers advantages in the survey's logistics.
Educatel Brazil 2015/2016 was a nationally representative cross-sectional study conducted via telephone interviews with a probabilistic sample of the country's schoolteachers and data collection in the fourth quarter of 2015 and the first quarter of 2016.The objective was to offer the academic community and public administrators a source of data on this population8, an unprecedented source for addressing health issues, prevalence of diseases and accidents, absenteeism, and related factors.Absenteeism was defined in the Educatel Study as at least one day of work missed in the previous 12 months.
This article presents the procedures adopted in the sample selection for Educatel, as well as the parameters for calculating the sample size.It also discusses the following: (a) the registry of teachers used in the sample selection; and (b) the survey's objectives defining the domains (or subgroups) for which estimates are wanted with controlled precision.Having completed the data collection, the study proceeded to an analysis of the sample's characteristics in order to identify the participation rate (of eligible teachers) by major geographic region, location of the school (rural versus urban), age bracket, sex, type of school administration, and type of teacher employment.Sample weights, adjustments for non-response, and estimation procedures are also addressed.
The Educatel Survey was approved by the Ethics Research Committee of the School of Medicine at Federal University of Minas Gerais (case review CAAE: 48129115.0.0000.5149).

Target population and sampling plan
The target population of Educatel consists of classroom teachers working in basic education in Brazil in 2015.The sampling procedures aimed to produce a nationwide probabilistic sample of Brazilian schoolteachers.
Cad. Saúde Pública 2019; 35 Sup 1:e00167217 Considering the logistic complexity and high cost of conducting face-to-face interviews to obtain such a large nationally representative sample, the option was to use a computer-assisted telephone interview (CATI) 8 .
The sampling plan's structure was based on simple stratified sampling.A predefined stratification strategy was used, aimed at covering the analytical domains (population subgroups) preestablished for the study, as described below.The sample planning considered the results of studies that focused on groups of teachers and the methodological experience from VIGITEL 7 .
The success of surveys with probabilistic sampling depends on the prior identification of the survey population and analytical domains, in order to guarantee sufficient sample sizes to obtain estimates for each pertinent domain.Thus, in order to obtain the sample's spread and to capture the survey population's heterogeneity, the population's stratification was defined according to a plan that combined the following categories of variables: (a) major geographic regions (North, Northeast, Central, Southeast, and South); (b) location of the school (urban/rural); (c) age brackets (≤ 34 years, 35-44 years, 45-54 years, and ≥ 55 years); (d) sex (male/female); (e) type of school administration (state, municipal, private, other); (f) type of employment (public admission/tenured/stable, temporary contract, private school system, covered by formal labor legislation, other); and (g) grade level (preschool, primary, middle, youth and adult, vocational, other).Importantly, stratification variables (c), (d), (f), and (g) reflect the teachers' characteristics, and stratification variables (a), (b), and (e) reflect the characteristics of the schools where the teachers work.
The number of possible strata based on the combination of all the categories of variables listed above is 9,600.However, only 7,650 of these possible strata actually existed in the survey population, that is they included at least one teacher.The following strategy was thus adopted for the sample allocation in the existing strata: (a) in the 1,549 strata in which the population size was 1, the sample size was also set at 1, and (b) in the other existing strata the sample was allocated proportionally to the stratum's population size, always rounding off to the closest whole number, where the rounding meant that in another 5,504 strata, the sample size was also set at 1. Importantly, in principle the estimation of the estimators' variances requires sample sizes greater than or equal to 2. Thus, the existence of strata with a sample size of one requires the adoption of estimation methods capable of dealing with this characteristic.Such methods are available in software programs like R (survey package), Stata (svy commands), and SAS, for example.
It is also necessary to distinguish between strata (for purposes of sample selection and spreading) and analytical domains (for subsequent estimation) preestablished for the study, aimed at guaranteeing acceptable precision levels.For the Educatel Study, the analytical domains were defined as categories (marginals) of the stratification variables (and not all the combinations of all the categories).For each of these analytical domains, after losses to the final sample, the sample allocation in the existing strata as described above resulted in the production of prevalence estimates for absenteeism with a maximum margin of error as follows: (a) 3% for the categories of major geographic regions; (b) 2.5% for the census tract categories; (c) 3.5% for the age bracket categories; (d) 2% for gender categories; (e) 4% for the types of school administration; (f) 4% for teacher employment categories; and (g) 4.5% for school grade level categories.
The basic registry for the sample selection of teachers for the Educatel Study was the database from the 2014 School Census conducted by the Brazilian National Institute for Educational Studies and Research "Anísio Teixeira" (INEP), which reports the schools' telephone numbers.The registry provided information on the survey population (2,229,269 teachers), including an identification variable and others that allowed constructing the above-mentioned stratification variables: sex, age, location, grade level, administrative system, and census tract.Variables for characterizing the schools were also obtained, eliminating the need to measure them again in Educatel, such as: size of the school according to number of teachers, access to running water, availability of filtered or safe drinking water, electricity, sewage disposal, garbage disposal, and school equipment and installations 8 .
Data from the 2014 School Census provided the most current database on the study population available at the time of the Educatel sample selection, allowing a good estimation of the universe of teachers working in 2015 (when the data were collected).The survey population for the Educatel Study thus consisted of teachers recorded in the 2014 School Census registry and that were working in classrooms in the same school in 2015.The survey population is thus a subset of the predefined Cad.Saúde Pública 2019; 35 Sup 1:e00167217 target population.This distinction is necessary, since all the statistical inferences from the database collected by the Educatel Study thus refer to the survey population defined according to the registry.The survey population was also limited to teachers that answered at least one of the first 15 attempted telephone contacts (made on various days and at various times, including Saturdays and evening hours) and to those working in schools with functional telephones.
Even so, given the time elapsed between the 2014 School Census and its actual use in Educatel (just over a year), some discrepancies were found between the list of schools, the selected teachers, and the actual faculty identified by telephone call.Deaths, retirements, and labor market turnover were all more than expected.The microdata from each Census are only made available in the semester following the reference year for the data collection, which prevented using the 2015 School Census (only published when the Educatel data collection was already under way).In short, the list that was produced of eligible participants for the study included teachers that no longer belonged to the schools and thus no longer belonged to the study's target population.Hereinafter, this situation will be called the "registry problem (1)".In addition, teachers that entered the teaching market after 2014, that is, after that year's School Census was performed, were obviously not in the registry that was used.The survey team thus faced a problem with the registry's coverage.Hereinafter this situation will be called "registry problem (2)".In order to measure the size of problems ( 1) and ( 2) and thereby establish strategies to compensate for possible biases, data from the 2013 School Census were consulted for clarifications when appropriate.
Comparing the teacher registries from the 2013 School Census with the 2014 Census, an estimated 16.62% of teachers in the registry thought they might be classified as problem (1) in the registry, while 24.28% was the proportion of new teachers that would not have been included in the registry, that is, classified as problem (2).
We opted to use stratified sampling because it allowed classification of the survey population in strata according to its known characteristics.The selection of teachers belonging to different strata was done independently (between strata), considering SRS without replacement, thus resulting in the adoption of the simple stratified sampling method.This strategy allows greater homogeneity of the subgroups defined by stratification than would be found if the entire population were considered.The adoption of sampling weights that involve the sample's stratification can result in increased precision in the estimates when compared to SRS with the same sample size, in addition to allowing estimations for both the whole population and the subgroups.In addition, the sampling error is reduced, since the more homogeneous the subgroups are in relation to their components' characteristics, the greater the efficiency of the sampling procedure.In short, the sample selection method in Educatel was designed and conducted in this way.
An alternative sample design was also considered that would have involved the selection of defined clusters such as schools.However, the decision was made to use direct selection of teachers (despite the challenges of using a partially outdated teacher registry) to avoid the cluster effects that might have resulted in substantially increasing the sample size.Sample weight effects in cluster designs can frequently reach 4 or more, which could result in increasing the sample size by 4 times or more.

Calculation of the sample size
The central target parameter for Educatel, namely prevalence of absenteeism due to illness, oriented the fundamental sampling definitions, given the need for consistency between the sampling plan, the use of estimators, and the nature of what was being measured.Based on the survey problem, the target population, and the state-of-the-art knowledge on illness in teachers, the following definitions were elaborated for calculating the sample size: (i) 95% confidence level; (ii) 38% prevalence (P) of at least once work absence 9 ; (iii) predicted 0.99% maximum error (B) (margin of error) for estimated prevalence of absenteeism for the entire Brazilian teacher population, as defined in (ii); (iv) maximum 20% non-interview rate (tx1) due to refusal (or other forms of non-response); (v) maximum 20% lack of application of the questionnaire (tx2) due to registry problem (1); and (vi) correction for finite populations.The previously reported margin of error was defined by the survey coordinators, based on such aspects as budget, data collection logistics, and timetable.
Cad. Saúde Pública 2019; 35 Sup 1:e00167217 Initial calculation of the sample size was based on SRS, followed by incorporation of the design effect ("EPA" in the equation below).The Kish design effect or deff is a measure of the effect of sampling on the estimators' variance 10 .Design effect is estimated by the ratio between the variance obtained with the sampling plan actually used and the variance obtained with SRS.Due to the lack of surveys of Brazilian schoolteachers using similar sample designs, the study considered a theoretical design effect of 1 as a procedure backed by the specialized literature 11 .These premises served as the basis for the assumption that there would be no loss of sampling efficiency by adopting stratified sampling when compared to SRS.The following expression was thus used to calculate the sample size for the Educatel Study: with .
Where: EPA = the design effect defined as equal to 1; tx1 = 1.20; tx2 = 1.20;N = the survey population size, equal to 2,229,269 teachers; P = 0.38; Q = (1 -0.38) = 0.62; B = 0.99; and z α/2 = 1.96, with 95% confidence.The sample size calculated with this procedure was 13,243 teachers.This sample size already incorporated adjustments aimed to compensate for possible losses, via tx1 and tx2.Therefore, it was necessary to provide for a much larger number in the initial selection, considering the preliminary results of a pilot study aimed at examining the consistency of the list prepared from the microdata from the 2014 School Census conducted by INEP.Therefore, 13,243 teachers from 11,042 schools were selected in order to increase the margins for obtaining the minimum number of interviews needed to ensure the survey's success.
In keeping with the above-mentioned definition of the survey population, teachers were considered non-eligible for the survey if, at the time of the telephone contact, they were no longer working in the school identified in the 2014 School Census (during the sample selection), they failed to respond to 15 attempted telephone contacts (on different days and at different times, including Saturdays and evening hours), or they worked in schools without a telephone or in which the telephone number in the registry was non-functional.

Participation in the Educatel Study
A total of 119,378 telephone calls were made, which allowed identifying 7,642 eligible teachers (57.7% of the initially selected total).In the end, 6,510 interviews were completed, for a response rate of 85.2%.The allowed margin of error was estimated at 1.18%, based on the actual sample size of 6,510 teachers interviewed.It was necessary to make an average of 19 calls per completed interview, and the mean interview time was 12 minutes.The survey's performance, measured by the response rate in each of the categories of variables used in the sample stratification, can be assessed from the data shown in Table 1.

Sample weights and sample expansion
The characteristics of the actual sample were carefully assessed after concluding the data collection.The sample weights as defined are connected not only to the Educatel sample design but also to the adjustment terms for treating non-response cases during the data collection process, as follows: is the probability of selecting teacher i belonging to stratum h; and is the basic sample weight of teacher i belonging to stratum h.Simply stated, the basic sample weight only reflects the main aspects of the sample design, i.e., the inverse of the probabilities of the teachers' selection.However, non-response is common in large-scale surveys and was also seen in Educatel.To treat such cases, the survey used the missing at random mechanism (MAR) 12 , with the non-response pattern depending on the following stratification variables: (a) major geographic regions, (b) location (urban versus rural), and (d) sex, as follows: is the response propensity score in groups c formed by the combination of the categories of stratification variables (a), (b), and (d), where n´ is the actual sample size in groups c; and is the sampling weight adjusted for non-response in stratum h.
These sample weights should be used in the estimation of any target descriptive measures calculated from data in the Educatel sample.Their adoption ensures the breadth of the various aspects in the selection scheme, including stratification, besides allowing adjustments for non-response effects.Lack of incorporation of the sample weights in the analysis can result in skewed estimates.Estimates that incorporate the sample weights can be produced using the R software (survey package), SUDAAN, SAS (PROC SURVEY), Stata (survey module), or SPSS (Complex Samples package), for example.
Meanwhile, estimation of a survey's precision is an important tool for evaluating the results' quality.Measures of precision include the coefficient of variation and confidence intervals, which are based on the estimated standard errors.Thus, analysis of the Educatel database with calculation of the estimated standard errors and proportions should adopt the procedures already implemented in the above-mentioned statistical packages.Such procedures produce a reasonable approximation of the true precision estimates, which could be obtained if the adopted selection scheme were considered in full 11 .

Final remarks
The main difficulty for members of the academic community and administrators in understanding and using survey data lies in their limited exposure to the design's characteristics.This article seeks to attenuate this difficulty by presenting a comprehensive explanation of the sample design and participation by teachers in the Educatel Study, the broadest survey ever conducted on health and absenteeism among Brazilian schoolteachers.
Since SRS was not feasible in a large-scale survey with the characteristics of Educatel, it was necessary to find a sampling method that allowed representing the various intended population domains and their estimates with preestablished precision levels.The probabilistic sample design adopted here was simple stratified sampling.This strategy allowed greater homogeneity in the subgroups defined in the stratification, favoring the estimates' precision, reduction of sampling error, and estimation both for the population as a whole and its subgroups.Finally, sample weights adjusted for correction of teachers' non-response were calculated and discussed.
The adoption of an appropriate statistical method for the analysis of complex sample data, already available in various software packages, guarantees the production of valid and precise statistical inferences on absenteeism in Brazilian schoolteachers and for the specific preestablished analytical domains in the Educatel sample data.

Table 1
Number of sampled schoolteachers, number of eligible schoolteachers, interviews performed, and success and refusal rates, according to geographic region, census area, age bracket, sex, school administration, teacher employment status, and grade level.Educatel Brazil, 2015/2016.