Factorial analysis of bootstrap variances of population genetic parameter estimates

We presented an alternative way to verify the relative contribution to the total variance, of the sources of variation due to populations (P), individuals within populations (I), the (P*I) interaction, and the standard error of the following parameter estimates: total (F) and intrapopulation (f) fixation indices, and divergence among populations ( ). The knowledge of this relative contribution is important to establish sampling strategies of natural populations. To attain these objectives, the bootstrap method was used to resample simultaneously populations and individuals, considering different combinations of P and I. This procedure was repeated five times for a given combination of each analyzed data set. For each data set, five estimates of these variances were obtained for each combination of P and I, and a given parameter estimate. These variance estimates were submitted to an analysis of variance, considering a factorial structure. The sources of variation considered in this analysis were P, I and P*I. The coefficient of determination (R) was calculated for each source of variation. Sources of variation with greater R are responsible for bigger errors of the estimates. The method applied was efficient for answering the questions initially proposed, and the results indicated that there are no ideal sample sizes for a species, but rather for a specific data set, because each data set has its own particularities. However, for investigations on the genetic structure of natural populations using population parameters, the number of populations to be sampled is a critical factor. Thus, more efforts should be made to increase the number of sampled populations, rather than the number of individuals within populations. A sampling strategy is given as a guide for investigations of this kind, when there is no previous knowledge about the genetic structure and the mating system of the populations.


Introduction
In nature, genetic variability is found among different hierarchical levels such as: populations within species, subpopulations within populations, individuals within subpopulations, genes within individuals, and so on.
There are many ways to analyze how this variability is distributed.One of these methodologies was proposed by Cockerham (1969Cockerham ( , 1973) ) and Weir and Cockerham (1984), and is based on the analysis of variance of the gene frequencies under a random model.Population parameters such as the total fixation index (F), the intrapopulation fixation index (f), and the divergence among populations or coancestry within populations (q) are estimated when the hierarchical levels of populations, individuals within populations, and genes within individuals are considered.Other pertinent parameters can also be estimated from this analysis, when other hierarchical levels are included (Weir, 1996).
When natural populations are studied, replications are not available, as populations and individuals are sampled under the conditions of the species' habitat.Therefore, errors of estimates cannot be obtained as usual in regular experimentation.However, with the computational development in recent years, resampling methods such as jackknife and bootstrap have been frequently applied for estimating such errors.
In general, when researches involve the estimation of the F, f and q parameters, resamplings are carried out considering only one level of variation, such as individuals (Van Dongen, 1995).There are also studies including two variation levels at the same time, such as the simultaneous resampling of individuals and populations (Petit and Pons, 1998;Carlini-Garcia et al., 2001, 2003), for obtaining the bootstrap estimates of the parameters and their variances as function of the sources of variation of individuals within populations and of populations, considered separately and jointly.This approach is very interesting in that, if the additivity of the sources of variation of individuals within populations and of populations is verified, when resampling is done separately, the estimate of the contribution of these sources of variation in relation to the total variance can be easily obtained.This can be very useful in the establishment of sampling strategies.Petit and Pons (1998), applying bootstrap resampling to estimate diversity parameters, verified that the variance resulting from the simultaneous resampling of individuals and populations contains the population variance component once and the intrapopulation variance component twice.Carlini-Garcia (2003) observed that, when added up, the mean squares concerning the resampling of individuals and populations separately, correspond to the mean square obtained from the simultaneous resampling of individuals and populations.This was verified empirically for estimates of the parameters F, f and q, with real data, as well as with simulated data.However, no explicit expressions of the bootstrap variance components for the various resampling levels considered were obtained.
The aims of this research were: i) to present an alternative way of calculating the relative contribution of the bootstrap variance estimates of parameters F, f and q, obtained for the sources of variation due to the number of individuals (I) within populations, of populations (P), of the (P*I) interaction, and of the error, in relation to the estimate of the corresponding total variance; ii) to contribute to rationalize the sampling process in nature, when the research is focused on studying the genetic structure of natural populations, as the relative contribution of the different sources of variation has a direct implication in this process.

Material
Five real data sets, obtained by Ciampi (1999), Reis et al. (2000), Seoane et al. (2000), Auler et al. (2002), and Telles et al. (2003) were considered.These authors studied the population genetic structure and/or the predominant reproductive system of tropical tree species.All the data sets presented hierarchical structure, so that variation was split into sources of variation due to populations, individuals within populations and genes within individuals.Reis et al. (2000) studied eight Euterpe edulis (known as "palmiteiro") populations, with an average of 24.8 individuals sampled per population, and used seven isoenzyme loci.Seoane et al. (2000) studied four Esenbechia leiocarpa (known as "guarantã") populations, with 22 individuals per population, and used five polymorphic loci.Four isoenzyme loci were considered by Telles et al. (2003) for studying six Anonna crassiflora (known as "araticunzeiro") populations, with 30 individuals sampled per population.Ciampi (1999) used eight microsatellite loci to evaluate four Copaifera langsdorffii (known as "copaíba") populations, with 24 individuals sampled per population.Auler et al. (2002) used twelve polymorphic isoenzyme loci to evaluate nine Araucaria angustifolia (known as "pinheiro-do-Paraná) populations with an average of 36.1 individuals sampled per population.

Methods
The analysis of variance of gene frequencies (Cockerham, 1969(Cockerham, , 1973;;Weir and Cockerham, 1984;Weir, 1996) was used to investigate population structure.For each data set, estimates of the total variance ( $ s T 2 ), population variance ( $ s P 2 ), individuals within populations variance ( $ s I 2 ), and genes within individuals variance ( $ s G 2 ), were obtained.These variance estimates allowed to estimate the total fixation index F (the correlation between alleles within individuals considering the entire set of populations, or the total coefficient of inbreeding: 2 ); the intrapopulation fixation index f (the correlation between alleles within individuals and within populations, or the intrapopulation coefficient of inbreeding: ), and q (the correlation between alleles of different individuals of the same population, or the coancestry between individuals of the same population; it is also a measurement of the divergence among populations: q = s s / ).For each data set, a random model according to Weir (1996) was considered, assuming the existence of a reference population from which the studied populations originated by genetic drift, in the absence of selection.The loci were considered as neutral, and the mathematical model used in the analysis of variance of gene frequency presented the following hierarchical levels: populations, individuals within populations and genes within individuals.
The bootstrap procedure is a resampling technique used to obtain the estimate of a parameter, its standard error, the distribution of the estimates of the parameter and also its percentile confidence interval.The process of resampling is carried out with replacement (Efron and Tibishirani, 1993).For a given parameter, several bootstrap resamples (10000, for example) are taken, each one furnishing one estimate of that parameter.The mean and the variance of these estimates are the bootstrap parameter estimate and variance, respectively.From the distribution of these estimates it is possible to obtain the percentile confidence interval.In this research, this procedure was used to determine the relative importance of the sources of variation due to the number of populations (P), the number of individuals (I), the interaction between the number of populations and the number of individuals (P*I), and the error, all in relation to the total variation.Resampling was applied varying the number of populations and individuals, and the population parameters were estimated for each combination of P and I.One thousand resamples were done within each possible combination, which allowed obtaining the variance of each parameter estimate.In each data set, these combinations varied from two populations and two individuals per population to the combination given by the maximum possible number of populations and the maximum possible number of individuals.In each case, resampling was carried out in two steps: first for populations and subsequently for individuals within populations.This procedure was repeated five times for each combination, leading to five estimates of the parameter and its bootstrap variance for each data set.For this purpose, the EGBV computer program (Coelho, 2000) was used.
For clarification, the data set obtained by Ciampi (1999) can be taken as an example.A total of four populations were investigated, with 24 individuals per population.This allowed obtaining 69 combinations with varying numbers of these two factors.The estimated variances of the parameter estimates of each combination were submitted to an analysis of variance, considering its factorial structure.
As the partitioning of the sum of squares is not orthogonal, the mean squares estimates are not independent.Nevertheless, according to Sokal and Rohlf (1995), it is possible to obtain the coefficient of determination (R 2 ) for each source of variation simply by dividing the sum of squares (SQ) of the source of variation of each factor, and also of the interaction between factors, by the total sum of squares (SQ Total ).Sources of variation with higher R 2 values will have more influence on the estimate of a given parameter.Therefore if, for instance, the source of variation of the number of individuals has a larger R 2 value than that of the number of populations, individuals should be given priority in sampling schemes of the species.
It is important to emphasize that, for a data set being analyzed as proposed here, at least three populations and three individuals per population are necessary, because with two populations the number of degrees of freedom for this factor in the analysis of variance under the factorial design would be zero.In fact, with only one population it is impossible to estimate F and q.The same reasoning is applicable to the resampling of individuals.

Results and Discussion
Since the parameter estimates are ratios of variance components, some of the outcomes had to be discarded when the denominator was equal to zero.For a given combination of numbers of populations and individuals, the variance of a parameter estimate was computed only for those cases in which the fraction of discards was equal to or less than 10%.For the data sets provided by Auler et al. (2002) and Seoane et al. (2000), the minimum number of individuals considered was, therefore, five and six, respectively.
Although all species included in this study are predominantly panmitic, Carlini-Garcia et al. (2001) showed that they have distinct population structures, based on these data sets.The populations studied by Reis et al. (2000) are non-inbred (f = 0), showed small divergence (q = 0) and, consequently, their total fixation index also did not differ statistically from zero (F = 0).Similar results were found with the data set of Seoane et al. (2000).Results obtained with the data set of Ciampi (1999) were also similar, except for the fact that the divergence among populations was significant (q > 0).In the data set of Telles et al. (2003), the populations were considered non-inbred (f = 0) but, as the divergence among populations was significant (q > 0) and relatively high, the total fixation index was different from zero (F > 0).On the other hand, the populations sampled by Auler et al. (2002) were inbred (f > 0) and divergent (q > 0).As a consequence, the total fixation index was significantly greater than zero.
The results showed that for total inbreeding (F) the largest portion of the variance of estimates was due to the source of variance of populations (Table 1).This was observed for all data sets, as can be seen from the corresponding R 2 values.Therefore, there is strong evidence that for the estimation of F, for a given total number of individuals, priority should be given to a greater number of populations rather than to a greater number of individuals in a smaller number of populations.
In the case of $ f, the sources of variation of populations and of individuals appeared to be equally important, and apparently there was not a standard behavior determined by the population structure of different data sets (Table 1).For the data sets of Reis et al. (2000) and of Seoane et al. (2000), for instance, all three parameters were considered not different from zero, as mentioned previously.However, for the Reis et al. (2000) data set, the results suggested that it would be more important to increase the number of sampled individuals, whereas for the Seoane et al. (2000) data set increasing the number of populations would be more important.For the data sets of Telles et al. (2003) and of Ciampi (1999), increasing the number of individuals to estimate f would be recommended, despite the relatively high interpopulation divergence observed for both data sets.In the Auler et al. (2002) data set, however, the num-ber of populations considered is more important than the number of individuals for the estimation of the three parameters focused.
For the estimation of q, similarly to what was found for f, there was no standard pattern relating the relative im-  Table 1 -Analysis of variance of bootstrap variance estimates of a $ F , b $ f , c $ q, obtained from joint resampling of individuals (I) and populations (P), and the coefficient of determination (R 2 ) of the sources of variation.Data sets provided by several authors.portance of the sources of (P, I and P*I) to the structure of the populations considered.

Ciampi (1999)
In general, the I*P interactions showed the smallest coefficients of determination when compared to the main effect sources of variation (Table 1).The R 2 values for this interaction ranked third in most of the cases, with some exceptions, as in the estimation of f in the species studied by Ciampi (1999) and of q in the Reis et al. (2000) data set.
The results indicate that no specific combination between the number of populations and of individuals could be found as being adequate for the sampling process.As expected, the coefficients of determination obtained for the source of variation of the error were always very small, when compared to those obtained for the other sources of variation (Table 1).
Considering the formulas that define the parameters F, f and q, one could be led to the conclusion that, in a sampling scheme, for the estimation of f, attention should be given only to the number of individuals, with less attention to the number of populations.The results showed that this is not necessarily true.Due possibly to heterogeneity of the s I 2 component across populations, sampling an adequate number of populations also affects the error of $ f.Conversely, for $ q, it is also not true that for attaining a reliable estimation of this measure of diversity only the sampling of an adequate number of populations is required.Considering that, for $ F, both factors, P and I, were important, it can be concluded that, for obtaining reliable estimates of all three parameters, a proper balance between the number of populations and individuals is a good strategy, giving priority to the number of populations.In the present study, the number of populations was smaller than ten in all cases.The numbers sampled, which ranged from four to nine, do not seem adequate for the estimation of these population parameters, especially if it is assumed that in natural conditions this number is potentially very large.
To complete this kind of investigation, the source of variation due to loci should be included, since it is known that, even for neutral loci, estimates can vary from locus to locus in finite populations (Coelho and Vencovsky, 2003).
The usual procedure for investigating the relative importance of different factors in the bootstrap variance is to carry out resampling of a given factor, fixing the others (Carlini-Garcia et al., 2001).The total variance, due to all factors, can be estimated using a stepwise resampling scheme.This type of procedure will be incomplete if the variances of a parameter estimate are estimated under a fixed sample size.The procedure used here considers varying sample sizes of the various factors and allows studying the importance of interaction sources of variation such as P*I.Also, increasing the number of a given factor and studying the magnitude of the respective variance values can be useful for determining the adequate number of populations and individuals in populations, to attain a desired level of precision in the estimation of a given parameter.
The results of this research indicate that the analysis applied was efficient to answer the questions on the relative contribution of the source of variation due to populations (P), individuals within populations (I), interaction P*I, and error to the total variance for the parameters considered.It was possible to verify that ideal samples sizes do not exist for species in general, since each case has its particularity.However, it could be seen that, in investigations on the structure of natural populations with estimation of population parameters, the sampling strategy should take into account that the number of populations to be sampled is a critical factor.More efforts should be made to increase this number, rather than to increase the number of individuals within the populations.In general terms, determining adequate sampling procedures would require previous knowledge of the diversity among populations (q) and of the intrapopulation fixation index (f).The same condition applies when a certain effective population size of the sample is desired (Vencovsky and Crossa, 2003).If a detailed investigation is to be carried out for studying the genetic structure and the mating system of a given species, sampling can be made in two steps, namely: 1) sampling of at least 10 to 15 populations and 25 to 30 individuals per population; and 2) based on the results obtained, improving the sampling done previously, so as to attain adequate errors of the parameter estimates.