Acessibilidade / Reportar erro

Factorial analysis of bootstrap variances of population genetic parameter estimates

Abstract

We presented an alternative way to verify the relative contribution to the total variance, of the sources of variation due to populations (P), individuals within populations (I), the (P*I) interaction, and the standard error of the following parameter estimates: total (F) and intrapopulation (f) fixation indices, and divergence among populations (q). The knowledge of this relative contribution is important to establish sampling strategies of natural populations. To attain these objectives, the bootstrap method was used to resample simultaneously populations and individuals, considering different combinations of P and I. This procedure was repeated five times for a given combination of each analyzed data set. For each data set, five estimates of these variances were obtained for each combination of P and I, and a given parameter estimate. These variance estimates were submitted to an analysis of variance, considering a factorial structure. The sources of variation considered in this analysis were P, I and P*I. The coefficient of determination (R²) was calculated for each source of variation. Sources of variation with greater R² are responsible for bigger errors of the estimates. The method applied was efficient for answering the questions initially proposed, and the results indicated that there are no ideal sample sizes for a species, but rather for a specific data set, because each data set has its own particularities. However, for investigations on the genetic structure of natural populations using population parameters, the number of populations to be sampled is a critical factor. Thus, more efforts should be made to increase the number of sampled populations, rather than the number of individuals within populations. A sampling strategy is given as a guide for investigations of this kind, when there is no previous knowledge about the genetic structure and the mating system of the populations.

population structure; resampling; natural populations


PLANT GENETICS

RESEARCH ARTICLE

Factorial analysis of bootstrap variances of population genetic parameter estimates

Luciana Aparecida Carlini-GarciaI; Roland VencovskyII; Alexandre Siqueira Guedes CoelhoIII

IInstituto Agronômico de Campinas, Centro de Grãos e Fibras, Campinas, SP, Brazil

IIUniversidade de São Paulo, Escola Superior de Agricultura "Luiz de Queiroz", Departamento de Genética, Piracicaba, SP, Brazil

IIIUniversidade Federal de Goiás, Instituto de Ciências Biológicas, Departamento de Biologia Geral, Goiânia, Go, Brazil

Send correspondence to Send correspondence to Luciana Aparecida Carlini-Garcia Instituto Agronômico de Campinas, Centro de Grãos e Fibras Av. Barão de Itapura 1481, Caixa Postal 28 13020-902 Campinas, SP, Brazil E-mail: lac_garcia@iac.sp.gov.br

ABSTRACT

We presented an alternative way to verify the relative contribution to the total variance, of the sources of variation due to populations (P), individuals within populations (I), the (P*I) interaction, and the standard error of the following parameter estimates: total (F) and intrapopulation (f) fixation indices, and divergence among populations (q). The knowledge of this relative contribution is important to establish sampling strategies of natural populations. To attain these objectives, the bootstrap method was used to resample simultaneously populations and individuals, considering different combinations of P and I. This procedure was repeated five times for a given combination of each analyzed data set. For each data set, five estimates of these variances were obtained for each combination of P and I, and a given parameter estimate. These variance estimates were submitted to an analysis of variance, considering a factorial structure. The sources of variation considered in this analysis were P, I and P*I. The coefficient of determination (R2) was calculated for each source of variation. Sources of variation with greater R2 are responsible for bigger errors of the estimates. The method applied was efficient for answering the questions initially proposed, and the results indicated that there are no ideal sample sizes for a species, but rather for a specific data set, because each data set has its own particularities. However, for investigations on the genetic structure of natural populations using population parameters, the number of populations to be sampled is a critical factor. Thus, more efforts should be made to increase the number of sampled populations, rather than the number of individuals within populations. A sampling strategy is given as a guide for investigations of this kind, when there is no previous knowledge about the genetic structure and the mating system of the populations.

Key words: population structure, resampling, natural populations.

Introduction

In nature, genetic variability is found among different hierarchical levels such as: populations within species, subpopulations within populations, individuals within subpopulations, genes within individuals, and so on.

There are many ways to analyze how this variability is distributed. One of these methodologies was proposed by Cockerham (1969, 1973) and Weir and Cockerham (1984), and is based on the analysis of variance of the gene frequencies under a random model. Population parameters such as the total fixation index (F), the intrapopulation fixation index (f), and the divergence among populations or coancestry within populations (q) are estimated when the hierarchical levels of populations, individuals within populations, and genes within individuals are considered. Other pertinent parameters can also be estimated from this analysis, when other hierarchical levels are included (Weir, 1996).

When natural populations are studied, replications are not available, as populations and individuals are sampled under the conditions of the species’ habitat. Therefore, errors of estimates cannot be obtained as usual in regular experimentation. However, with the computational development in recent years, resampling methods such as jackknife and bootstrap have been frequently applied for estimating such errors.

In this context, the bootstrap method stands out because, apart from allowing the estimation of parameters and of their respective variances, it also permits to obtain the empirical distribution of the estimates and to construct several types of related confidence intervals (Efron and Tibshirani, 1993; Davidson and Hinkley, 1997; Shao and Tu, 1995; Manly, 1997). Examples of the application of this method in genetics are given by Felsenstein (1985); Halldén et al. (1994), Tivang et al. (1994), Dopazo (1995), Van Dongen (1995), Van Dongen and Backeljau (1995), Visscher et al. (1996), Petit and Pons (1998), Fanizza et al. (1999), Remington et al. (1999), and Carlini-Garcia et al. (2001, 2003).

In general, when researches involve the estimation of the F, f and q parameters, resamplings are carried out considering only one level of variation, such as individuals (Van Dongen, 1995). There are also studies including two variation levels at the same time, such as the simultaneous resampling of individuals and populations (Petit and Pons, 1998; Carlini-Garcia et al., 2001, 2003), for obtaining the bootstrap estimates of the parameters and their variances as function of the sources of variation of individuals within populations and of populations, considered separately and jointly. This approach is very interesting in that, if the additivity of the sources of variation of individuals within populations and of populations is verified, when resampling is done separately, the estimate of the contribution of these sources of variation in relation to the total variance can be easily obtained. This can be very useful in the establishment of sampling strategies. Petit and Pons (1998), applying bootstrap resampling to estimate diversity parameters, verified that the variance resulting from the simultaneous resampling of individuals and populations contains the population variance component once and the intrapopulation variance component twice. Carlini-Garcia (2003) observed that, when added up, the mean squares concerning the resampling of individuals and populations separately, correspond to the mean square obtained from the simultaneous resampling of individuals and populations. This was verified empirically for estimates of the parameters F, f and q, with real data, as well as with simulated data. However, no explicit expressions of the bootstrap variance components for the various resampling levels considered were obtained.

The aims of this research were: i) to present an alternative way of calculating the relative contribution of the bootstrap variance estimates of parameters F, f and q, obtained for the sources of variation due to the number of individuals (I) within populations, of populations (P), of the (P*I) interaction, and of the error, in relation to the estimate of the corresponding total variance; ii) to contribute to rationalize the sampling process in nature, when the research is focused on studying the genetic structure of natural populations, as the relative contribution of the different sources of variation has a direct implication in this process.

Material and Methods

Material

Five real data sets, obtained by Ciampi (1999), Reis et al. (2000), Seoane et al. (2000), Auler et al. (2002), and Telles et al. (2003) were considered. These authors studied the population genetic structure and/or the predominant reproductive system of tropical tree species. All the data sets presented hierarchical structure, so that variation was split into sources of variation due to populations, individuals within populations and genes within individuals. Reis et al. (2000) studied eight Euterpe edulis (known as "palmiteiro") populations, with an average of 24.8 individuals sampled per population, and used seven isoenzyme loci. Seoane et al. (2000) studied four Esenbechia leiocarpa (known as "guarantã") populations, with 22 individuals per population, and used five polymorphic loci. Four isoenzyme loci were considered by Telles et al. (2003) for studying six Anonna crassiflora (known as "araticunzeiro") populations, with 30 individuals sampled per population. Ciampi (1999) used eight microsatellite loci to evaluate four Copaifera langsdorffii (known as "copaíba") populations, with 24 individuals sampled per population. Auler et al. (2002) used twelve polymorphic isoenzyme loci to evaluate nine Araucaria angustifolia (known as "pinheiro-do-Paraná) populations with an average of 36.1 individuals sampled per population.

Methods

The analysis of variance of gene frequencies (Cockerham, 1969, 1973; Weir and Cockerham, 1984; Weir, 1996) was used to investigate population structure. For each data set, estimates of the total variance (), population variance (), individuals within populations variance (), and genes within individuals variance (), were obtained. These variance estimates allowed to estimate the total fixation index F (the correlation between alleles within individuals considering the entire set of populations, or the total coefficient of inbreeding: ; the intrapopulation fixation index f (the correlation between alleles within individuals and within populations, or the intrapopulation coefficient of inbreeding: , and q (the correlation between alleles of different individuals of the same population, or the coancestry between individuals of the same population; it is also a measurement of the divergence among populations: . For each data set, a random model according to Weir (1996) was considered, assuming the existence of a reference population from which the studied populations originated by genetic drift, in the absence of selection. The loci were considered as neutral, and the mathematical model used in the analysis of variance of gene frequency presented the following hierarchical levels: populations, individuals within populations and genes within individuals.

The bootstrap procedure is a resampling technique used to obtain the estimate of a parameter, its standard error, the distribution of the estimates of the parameter and also its percentile confidence interval. The process of resampling is carried out with replacement (Efron and Tibishirani, 1993). For a given parameter, several bootstrap resamples (10000, for example) are taken, each one furnishing one estimate of that parameter. The mean and the variance of these estimates are the bootstrap parameter estimate and variance, respectively. From the distribution of these estimates it is possible to obtain the percentile confidence interval. In this research, this procedure was used to determine the relative importance of the sources of variation due to the number of populations (P), the number of individuals (I), the interaction between the number of populations and the number of individuals (P*I), and the error, all in relation to the total variation. Resampling was applied varying the number of populations and individuals, and the population parameters were estimated for each combination of P and I. One thousand resamples were done within each possible combination, which allowed obtaining the variance of each parameter estimate. In each data set, these combinations varied from two populations and two individuals per population to the combination given by the maximum possible number of populations and the maximum possible number of individuals. In each case, resampling was carried out in two steps: first for populations and subsequently for individuals within populations. This procedure was repeated five times for each combination, leading to five estimates of the parameter and its bootstrap variance for each data set. For this purpose, the EGBV computer program (Coelho, 2000) was used.

For clarification, the data set obtained by Ciampi (1999) can be taken as an example. A total of four populations were investigated, with 24 individuals per population. This allowed obtaining 69 combinations with varying numbers of these two factors. The estimated variances of the parameter estimates of each combination were submitted to an analysis of variance, considering its factorial structure.

As the partitioning of the sum of squares is not orthogonal, the mean squares estimates are not independent. Nevertheless, according to Sokal and Rohlf (1995), it is possible to obtain the coefficient of determination (R2) for each source of variation simply by dividing the sum of squares (SQ) of the source of variation of each factor, and also of the interaction between factors, by the total sum of squares (SQTotal). Sources of variation with higher R2 values will have more influence on the estimate of a given parameter. Therefore if, for instance, the source of variation of the number of individuals has a larger R2 value than that of the number of populations, individuals should be given priority in sampling schemes of the species.

It is important to emphasize that, for a data set being analyzed as proposed here, at least three populations and three individuals per population are necessary, because with two populations the number of degrees of freedom for this factor in the analysis of variance under the factorial design would be zero. In fact, with only one population it is impossible to estimate F and q. The same reasoning is applicable to the resampling of individuals.

Results and Discussion

Since the parameter estimates are ratios of variance components, some of the outcomes had to be discarded when the denominator was equal to zero. For a given combination of numbers of populations and individuals, the variance of a parameter estimate was computed only for those cases in which the fraction of discards was equal to or less than 10%. For the data sets provided by Auler et al. (2002) and Seoane et al. (2000), the minimum number of individuals considered was, therefore, five and six, respectively.

Although all species included in this study are predominantly panmitic, Carlini-Garcia et al. (2001) showed that they have distinct population structures, based on these data sets. The populations studied by Reis et al. (2000) are non-inbred (f = 0), showed small divergence (q = 0) and, consequently, their total fixation index also did not differ statistically from zero (F = 0). Similar results were found with the data set of Seoane et al. (2000). Results obtained with the data set of Ciampi (1999) were also similar, except for the fact that the divergence among populations was significant (q > 0). In the data set of Telles et al. (2003), the populations were considered non-inbred (f = 0) but, as the divergence among populations was significant (q > 0) and relatively high, the total fixation index was different from zero (F > 0). On the other hand, the populations sampled by Auler et al. (2002) were inbred (f > 0) and divergent (q > 0). As a consequence, the total fixation index was significantly greater than zero.

The results showed that for total inbreeding (F) the largest portion of the variance of estimates was due to the source of variance of populations (Table 1). This was observed for all data sets, as can be seen from the corresponding R2 values. Therefore, there is strong evidence that for the estimation of F, for a given total number of individuals, priority should be given to a greater number of populations rather than to a greater number of individuals in a smaller number of populations.

In the case of , the sources of variation of populations and of individuals appeared to be equally important, and apparently there was not a standard behavior determined by the population structure of different data sets (Table 1). For the data sets of Reis et al. (2000) and of Seoane et al. (2000), for instance, all three parameters were considered not different from zero, as mentioned previously. However, for the Reis et al. (2000) data set, the results suggested that it would be more important to increase the number of sampled individuals, whereas for the Seoane et al. (2000) data set increasing the number of populations would be more important. For the data sets of Telles et al. (2003) and of Ciampi (1999), increasing the number of individuals to estimate f would be recommended, despite the relatively high interpopulation divergence observed for both data sets. In the Auler et al. (2002) data set, however, the number of populations considered is more important than the number of individuals for the estimation of the three parameters focused.

For the estimation of q, similarly to what was found for f, there was no standard pattern relating the relative importance of the sources of (P, I and P*I) to the structure of the populations considered.

In general, the I*P interactions showed the smallest coefficients of determination when compared to the main effect sources of variation (Table 1). The R2 values for this interaction ranked third in most of the cases, with some exceptions, as in the estimation of f in the species studied by Ciampi (1999) and of q in the Reis et al. (2000) data set. The results indicate that no specific combination between the number of populations and of individuals could be found as being adequate for the sampling process. As expected, the coefficients of determination obtained for the source of variation of the error were always very small, when compared to those obtained for the other sources of variation (Table 1).

Considering the formulas that define the parameters F, f and q, one could be led to the conclusion that, in a sampling scheme, for the estimation of f, attention should be given only to the number of individuals, with less attention to the number of populations. The results showed that this is not necessarily true. Due possibly to heterogeneity of the component across populations, sampling an adequate number of populations also affects the error of . Conversely, for , it is also not true that for attaining a reliable estimation of this measure of diversity only the sampling of an adequate number of populations is required. Considering that, for , both factors, P and I, were important, it can be concluded that, for obtaining reliable estimates of all three parameters, a proper balance between the number of populations and individuals is a good strategy, giving priority to the number of populations. In the present study, the number of populations was smaller than ten in all cases. The numbers sampled, which ranged from four to nine, do not seem adequate for the estimation of these population parameters, especially if it is assumed that in natural conditions this number is potentially very large.

To complete this kind of investigation, the source of variation due to loci should be included, since it is known that, even for neutral loci, estimates can vary from locus to locus in finite populations (Coelho and Vencovsky, 2003).

The usual procedure for investigating the relative importance of different factors in the bootstrap variance is to carry out resampling of a given factor, fixing the others (Carlini-Garcia et al., 2001). The total variance, due to all factors, can be estimated using a stepwise resampling scheme. This type of procedure will be incomplete if the variances of a parameter estimate are estimated under a fixed sample size. The procedure used here considers varying sample sizes of the various factors and allows studying the importance of interaction sources of variation such as P*I. Also, increasing the number of a given factor and studying the magnitude of the respective variance values can be useful for determining the adequate number of populations and individuals in populations, to attain a desired level of precision in the estimation of a given parameter.

The results of this research indicate that the analysis applied was efficient to answer the questions on the relative contribution of the source of variation due to populations (P), individuals within populations (I), interaction P*I, and error to the total variance for the parameters considered. It was possible to verify that ideal samples sizes do not exist for species in general, since each case has its particularity. However, it could be seen that, in investigations on the structure of natural populations with estimation of population parameters, the sampling strategy should take into account that the number of populations to be sampled is a critical factor. More efforts should be made to increase this number, rather than to increase the number of individuals within the populations. In general terms, determining adequate sampling procedures would require previous knowledge of the diversity among populations (q) and of the intrapopulation fixation index (f). The same condition applies when a certain effective population size of the sample is desired (Vencovsky and Crossa, 2003). If a detailed investigation is to be carried out for studying the genetic structure and the mating system of a given species, sampling can be made in two steps, namely: 1) sampling of at least 10 to 15 populations and 25 to 30 individuals per population; and 2) based on the results obtained, improving the sampling done previously, so as to attain adequate errors of the parameter estimates.

Acknowledgments

The authors wish to thank CNPq for the financial support given.

Received: December 14, 2004; Accepted: October 10, 2005.

Associate Editor: José Francisco Ferraz de Toledo

  • Auler NMF, Reis MS, Guerra MP and Nodari RO (2002) The genetics and conservation of Araucaria angustifolia: I. Genetic structure and diversity of natural populations by means of non-adaptive variation in the state of Santa Catarina, Brazil. Genet Mol Biol 25:329-338.
  • Carlini-Garcia LA, Vencovsky R and Coelho ASG (2001) Método bootstrap aplicado em diferentes níveis de reamostragem na estimação de parâmetros genéticos populacionais. Sci Agric 58:785-793.
  • Carlini-Garcia LA, Vencovsky R and Coelho ASG (2003) Variance additivity of population genetic parameter estimates obtained through bootstrapping. Sci Agric 60:97-103.
  • Ciampi AY (1999) Desenvolvimento e utilização de marcadores microssatélites, AFLP e seqüenciamento de cpDNA, no estudo da estrutura genética e parentesco em populações de copaíba (Copaifera langsdorffii) em matas de galeria no cerrado. Tese de Doutorado, Universidade Estadual Paulista "Júlio de Mesquita Filho", Botucatu, 204 pp.
  • Cockerham CC (1969) Variance of gene frequencies. Evolution 23:72-84.
  • Cockerham CC (1973) Analysis of gene frequencies. Genetics 74:679-700.
  • Coelho ASG (2000) Programa EGBV: Análise de Estrutura Genética de Populações pelo Método da Análise de Variância, com a Utilização de Bootstraps com Amostras de Tamanho Variável (software). Universidade Federal de Goiânia, Instituto de Ciências Biológicas, Departamento de Biologia Geral, Goiânia, Goiás.
  • Coelho ASG and Vencovsky R (2003) Intrapopulation fixation index dynamics in finite populations with variable outcrossing rates. Sci Agric 60:605-313.
  • Davidson AC and Hinkley DV (1997) Bootstrap Methods and Their Application. Cambridge University Press, New York, 582 pp.
  • Dopazo J (1994) Estimating errors and confidence intervals for branch lengths in phylogenetic trees by a bootstrap approach. J Mol Evol 38:300-304.
  • Efron B and Tibshirani RJ (1993) An Introduction to the Bootstrap. Chapman & Hall, New York, 436 pp.
  • Fanizza G, Colonna G, Resta P and Ferrara G (1999) The effect of the number of RAPD markers on the evaluation of genotypic distances in Vitis vinifera Euphytica 107:45-50.
  • Felsenstein J (1985) Confidence limits on phylogenies: A justification. Evolution 39:783-791.
  • Halldén C, Nilsson NO, Rading IM and Säll T (1994) Evaluation of RFLP and RAPD markers in a comparison of Brassica napus breeding lines. Theor Appl Genet 88:123-128.
  • Manly BFJ (1997) Randomization, Bootstrap and Monte Carlo Methods in Biology. 2nd edition. Chapman & Hall, New York, 399 pp.
  • Petit RJ and Pons O (1998) Bootstrap variance of diversity and differentiation estimators in a subdivided population. Heredity 80:56-61.
  • Reis MS, Vencovsky R, Kageyama PY, Guimarães E, Fantini AC, Nodari RO and Mantovani A (2000) Variação genética em populações naturais de palmiteiro (Euterpe edulis Martius, Arecaceae) na floresta ombrófila densa. Sellowia 49-52:131-149.
  • Remington DL, Whetten RW, Liu BH and OMalley DM (1999) Construction of an AFLP genetic map with nearly complete genome coverage in Pinus taeda Theor and Appl Genet 98:1279-1292.
  • Seoane CES, Kageyama PY and Sebbenn AM (2000) Efeitos da fragmentação florestal na estrutura genética de populações de Esenbeckia leiocarpa Engl. (guarantã). Sci For 57:123-139.
  • Shao J and Tu D (1995) The Jackknife and Bootstrap. Springer Series in Statistics, Springer-Verlag, New York, 516 pp.
  • Sokal RR and Rohlf FJ (1995) Biometry. WH Freeman and Company, New York, 880 pp.
  • Telles MPC, Valva FD, Bandeira LF and Coelho ASG (2003) Caracterização genética de populações naturais de araticunzeiro (Anonna crassiflora Mart. - Anonnaceae) no Estado de Goiás. Rev Bras Bot 26:123-129.
  • Tivang JG, Nienhuis J and Smith OS (1994) Estimation of sampling variance of molecular marker data using the bootstrap procedure. Theor Appl Genet 89:259-264.
  • Van Dongen S (1995) How should we bootstrap allozyme data? Heredity 74:445-447.
  • Van Dongen S and Backeljau T (1995) One- and two-sample tests for single-locus inbreeding coefficients using the bootstrap. Heredity 74:129-135.
  • Vencovsky R and Crossa J (2003) Measurements of representativeness used in genetic resources conservation and plant breeding. Crop Sci 43:1912-1921.
  • Visscher PM, Thompson R and Haley CS (1996) Confidence intervals in QTL mapping by bootstrapping. Genetics 143:1013-1020.
  • Weir BS (1996) Genetic Data Analysis II. 2nd edition. Sinauer Associates Inc. Publishers, Sunderland, 445 pp.
  • Weir BS and Cockerham CC (1984) Estimating F-statistics for the analysis of population structure. Evolution 38:1358-1370.
  • Send correspondence to

    Luciana Aparecida Carlini-Garcia
    Instituto Agronômico de Campinas, Centro de Grãos e Fibras
    Av. Barão de Itapura 1481, Caixa Postal 28
    13020-902 Campinas, SP, Brazil
    E-mail:
  • Publication Dates

    • Publication in this collection
      12 June 2006
    • Date of issue
      2006

    History

    • Accepted
      10 Oct 2005
    • Received
      14 Dec 2004
    Sociedade Brasileira de Genética Rua Cap. Adelmio Norberto da Silva, 736, 14025-670 Ribeirão Preto SP Brazil, Tel.: (55 16) 3911-4130 / Fax.: (55 16) 3621-3552 - Ribeirão Preto - SP - Brazil
    E-mail: editor@gmb.org.br