Acessibilidade / Reportar erro

Variance additivity of genetic populational parameter estimates obtained through bootstrapping

Aditividade de variâncias obtidas por bootstrap de estimativas de parâmetros genéticos populacionais

Abstracts

Studying the genetic structure of natural populations is very important for conservation and use of the genetic variability available in nature. This research is related to genetic population structure analysis using real and simulated molecular data. To obtain variance estimates of pertinent parameters, the bootstrap resampling procedure was applied over different sampling units, namely: individuals within populations (I), populations (P), and individuals and populations simultaneously (I, P). The considered parameters were: the total fixation index (F or F IT), the fixation index within populations (f or F IS) and the divergence among populations or intrapopulation coancestry (theta or F ST). The aim of this research was to verify if the variance estimates of <IMG SRC="/img/fbpe/sa/v60n1/14549x09.gif">, <IMG SRC="/img/fbpe/sa/v60n1/14549x10.gif">and <IMG SRC="/img/fbpe/sa/v60n1/14549x11.gif">, found through the resampling over individuals and populations simultaneously (I, P), correspond to the sum of the respective variance estimates obtained from separated resampling over individuals and populations (I+P). This equivalence was verified in all cases, showing that the total variance estimate of <IMG SRC="/img/fbpe/sa/v60n1/14549x09.gif">, <IMG SRC="/img/fbpe/sa/v60n1/14549x10.gif">and <IMG SRC="/img/fbpe/sa/v60n1/14549x11.gif">can be obtained summing up the variances estimated for each source of variation separately. Results also showed that this facilitates the use of the bootstrap method on data with hierarchical structure and opens the possibility of obtaining the relative contribution of each source of variation to the total variation of estimated parameters.

population structure; resampling; molecular markers; natural populations; simulation


O estudo da estrutura genética de populações naturais é muito importante para a conservação e o uso da variabilidade genética disponível na natureza. Esta pesquisa relaciona-se com a análise da estrutura genética de populações a partir de dados moleculares reais e simulados. Visando estimar variâncias de estimativas de parâmetros pertinentes, o método de reamostragem bootstrap foi aplicado levando em conta diferentes unidades amostrais, a saber: indivíduos dentro de populações (I), populações (P) e indivíduos e populações concomitantemente (I, P). Os parâmetros considerados foram: o índice de fixação total (F ou F IT), o indice de fixação intrapopulacional (f ou F IS) e a divergência interpopulacional (teta ou F ST). O trabalho objetivou estimar a variância amostral das estimativas destes parâmetros para verificar se as variâncias de <IMG SRC="/img/fbpe/sa/v60n1/14549x09.gif">, <IMG SRC="/img/fbpe/sa/v60n1/14549x10.gif">e <IMG SRC="/img/fbpe/sa/v60n1/14549x11.gif">, obtidas pela reamostragem de indivíduos e populações concomitantemente (I, P) são equivalentes às obtidas pela soma (I+P) das variâncias estimadas reamostrando-se I e P separadamente. A equivalência foi verificada em todos os casos investigados, mostrando ser possível estimar as variâncias das estimativas de <IMG SRC="/img/fbpe/sa/v60n1/14549x09.gif">, <IMG SRC="/img/fbpe/sa/v60n1/14549x10.gif">e <IMG SRC="/img/fbpe/sa/v60n1/14549x11.gif">, para cada fonte de variação (unidade amostral) somando-as depois para estimar a variância total. O procedimento facilita o uso do método bootstrap em dados com estrutura hierárquica e permite mensurar a importância relativa de cada fonte de variação sobre a variância amostral total das estimativas dos parâmetros.

estrutura populacional; reamostragem; marcadores moleculares; populações naturais; simulação


Variance additivity of genetic populational parameter estimates obtained through bootstrapping

Aditividade de variâncias obtidas por bootstrap de estimativas de parâmetros genéticos populacionais

Luciana Aparecida Carlini-GarciaI; Roland VencovskyI; Alexandre Siqueira Guedes CoelhoII

IDepto. de Genética, USP/ESALQ, C.P. 83, CEP: 13418-970, Piracicaba, SP

IIDepto. de Biologia Geral, UFG/ICB, C.P. 131, CEP: 74001-970, Goiânia, GO

Address to correspondence Address to correspondence Roland Vencovsky rvencovs@esalq.usp.br

ABSTRACT

Studying the genetic structure of natural populations is very important for conservation and use of the genetic variability available in nature. This research is related to genetic population structure analysis using real and simulated molecular data. To obtain variance estimates of pertinent parameters, the bootstrap resampling procedure was applied over different sampling units, namely: individuals within populations (I), populations (P), and individuals and populations simultaneously (I, P). The considered parameters were: the total fixation index (F or FIT), the fixation index within populations (f or FIS) and the divergence among populations or intrapopulation coancestry (q or FST). The aim of this research was to verify if the variance estimates of , and , found through the resampling over individuals and populations simultaneously (I, P), correspond to the sum of the respective variance estimates obtained from separated resampling over individuals and populations (I+P). This equivalence was verified in all cases, showing that the total variance estimate of , and can be obtained summing up the variances estimated for each source of variation separately. Results also showed that this facilitates the use of the bootstrap method on data with hierarchical structure and opens the possibility of obtaining the relative contribution of each source of variation to the total variation of estimated parameters.

Key words: population structure, resampling, molecular markers, natural populations, simulation

RESUMO

O estudo da estrutura genética de populações naturais é muito importante para a conservação e o uso da variabilidade genética disponível na natureza. Esta pesquisa relaciona-se com a análise da estrutura genética de populações a partir de dados moleculares reais e simulados. Visando estimar variâncias de estimativas de parâmetros pertinentes, o método de reamostragem bootstrap foi aplicado levando em conta diferentes unidades amostrais, a saber: indivíduos dentro de populações (I), populações (P) e indivíduos e populações concomitantemente (I, P). Os parâmetros considerados foram: o índice de fixação total (

F ou

F

IT), o indice de fixação intrapopulacional (

f ou

F

IS) e a divergência interpopulacional (

q ou

F

ST). O trabalho objetivou estimar a variância amostral das estimativas destes parâmetros para verificar se as variâncias de

,

e

, obtidas pela reamostragem de indivíduos e populações concomitantemente (I, P) são equivalentes às obtidas pela soma (I+P) das variâncias estimadas reamostrando-se I e P separadamente. A equivalência foi verificada em todos os casos investigados, mostrando ser possível estimar as variâncias das estimativas de

,

e

, para cada fonte de variação (unidade amostral) somando-as depois para estimar a variância total. O procedimento facilita o uso do método bootstrap em dados com estrutura hierárquica e permite mensurar a importância relativa de cada fonte de variação sobre a variância amostral total das estimativas dos parâmetros.

Palavras-chave: estrutura populacional, reamostragem, marcadores moleculares, populações naturais, simulação

INTRODUCTION

In studies of genetic population structure with the use of genetic markers, usually resampling methods are used to estimate genetic population parameters and their respective standard deviation. Some authors have used resampling only over one source of variation, like Van Dongen (1995) and Vencovsky et al. (1997), who applied it only over individuals. Others applied it over several sources of variation, such as Petit & Pons (1998) and Carlini-Garcia et al. (2001).

Petit & Pons (1998) applied the bootstrap method over individuals, populations, and individuals and populations concomitantly, to estimate population parameters and their variances, based on these sources of variation. Their objective was to verify over which source of variation resampling should be applied to obtain estimates of the studied parameters. To do so, they compared the obtained variances based on the mentioned variation sources, with the variance estimates calculated from explicit expressions obtained by Pons & Petit (1995). The authors concluded that to estimate the studied parameters and their variances, the resampling should be priority over populations.

Carlini-Garcia et al. (2001) applied bootstrap resampling over populations, individuals within population, populations and individuals simultaneously, and also over loci. They estimated some genetic population parameters, their standard deviation, and obtained the respective confidence intervals, as well as the empirical distribution of the estimates. Among other aspects, they could demonstrate the importance of applying resampling taking into account each source of variation.

The aim of this research was to verify if, with the hierarchically structured data, it is possible to obtain the total bootstrap variance of an estimate summing up all obtained variances by the resampling over each source of variation separately. This equivalence would be advantageous in that it would be possible to obtain the relative contribution of population and individual sources of variation to the total variation, as well as the lack of necessity to do a joint resampling of individuals and populations. The hierarchical structure considered involved populations and individuals within populations. The evaluated parameters were the total fixation index (F or FIT), the fixation index within populations (f or FIS) and the degree of divergence among populations or coancestry within populations (q or FST) (Wright, 1951; 1965; Cockerham, 1969; 1973; Weir & Cockerham, 1984; Weir, 1996). Real and simulated data were considered.

MATERIAL AND METHODS

The data used in this research were obtained by Telles & Coelho (1998), Ciampi (1999), Auler (2000), Reis et al. (2000), Seoane et al. (2000) and Sebbenn et al. (2001). These authors studied the population genetic structure and/or reproductive system of tropical arboreous trees, by means of isoenzymatic markers or, in the case of Ciampi (1999), by microsatelite markers.

Twenty-five sets of simulated data were also used, each of them composed by 30 populations with 100 individuals each. Five loci with three alleles per locus were considered, and the initial allelic frequencies were 1/3 for each allele at all loci. In these simulations, populations in inbreeding equilibrium were considered with the inbreeding rates (s) varying in the interval 0 £ s £ 0.08 and the number of generations (g) varying from 100 to 500 (Table 1). Different numbers of generations were considered to generate data sets having different degrees of divergence among populations.

The study of population genetic structure of each considered data set was carried out by means of analysis of variance of gene frequencies (Cockerham, 1969; 1973; Weir & Cockerham, 1984; Weir, 1996). Thus, in each case, the total variance estimate of the allelic frequencies, as well as their components: among populations , among individuals within populations , and among genes within individuals , were obtained. From these, estimates of the total fixation index , the fixation index within populations , and the degree of divergence among populations or coancestry within populations and their respective variance estimates, were calculated.

To obtain these estimates, a random model was considered, meaning that for each data set, it was assumed that there is a reference population that originated, by genetic drift, the evaluated populations. Therefore no selection in all considered loci was assumed, such that loci were taken as neutral. The considered hierarchical structure for the analysis of variance included the following sources of variation: populations (P), individuals within populations (I) and genes within individuals (G) (Weir, 1996).

The method of moments was employed to estimate the variance components mentioned above, as well as to estimate the other population parameters, according to Weir (1996):

The resampling bootstrap method (Efron & Tibshirani, 1993; Manly, 1997) was applied with the objective of obtaining bootstrap estimates of F, f and q and of their respective variance estimates, considering the sources of variations of individuals, populations, and individuals and populations simultaneously, in a similar way to that used by Petit & Pons (1998), fixing the loci. In each resampling level, 100,000 bootstrap samples were obtained for the real data and 10,000 for the simulated data. The variance analysis was carried out in each bootstrap sample, which provided F, f and q estimates. The average of these estimates, per parameter, is the bootstrap estimate of the parameter, while their variance is the variance estimate of the bootstrap estimate of the parameter.

For F, f and q parameters individually, it was verified if the additivity of the variances was true or not when the bootstrap approach is used. This property was investigated verifying if the sum of the variance estimates of the parameter estimates, obtained from the independent resampling of individuals and populations, corresponds to the variance estimate of such parameter estimates, taken from the concomitant resampling of individuals and populations. In addition to the practical facility of not having to carry out the simultaneous resampling of populations and individuals, the outlined procedure allows investigating the relative contribution of the sources of variation and gives an indication of where the major deficiencies of field sampling are occurring.

To verify this additivity, a simple linear regression model was adopted, i.e. Y = a +bX + e, Y being the values of the sum the variances obtained from the resampling of individuals and populations separately, and X the respective variance estimated from the simultaneous resample of these two factors. This was carried out for each parameter (F, f and q),with the simulated and real data sets. Student's t tests were applied to verify if the coefficient of regression (b) and the intercept (a) estimates differed from zero. Confidence intervals for b were constructed to verify if the corresponding parameter differed from 1 (Sokal & Rohlf, 1995). If a = 0, b ¹ 0 and b does not differ from 1, the regression equation is reduced to E (Y) = X in terms of mathematical expectation, and then, it is possible to confirm the additivity of variance estimates, that were derived from the resample of individuals and populations separately. The degree of deviations from regression was verified through the coefficients of determination R2, obtained for each regression analysis (Sokal & Rohlf, 1995).

Shapiro-Wilk test was applied to verify if the regression analysis residuals follow a normal distribution (Sokal & Rohlf, 1995). When this assumption was not fulfilled, appropriate data transformation was searched for attaining normality. This was necessary to guarantee the validity of the test of hypothesis, as well as the confidence intervals calculated for the intercept and for the coefficient of regression, which are based on normality.

As these two variables involved in the regression are random, an appropriate regression analysis for the random model could have been used (geometric mean regression; Sokal & Rohlf, 1995). Nevertheless, these latter authors mention the existence of controversies regarding the use of this methodology. Thus, the usual regression analysis was used here, as described in the previous paragraphs, in agreement with Neter et al. (1990).

All resamplings and calculations of F, f and q bootstrap estimates and of their bootstrap variances estimates were carried out using a version of the EG software (Coelho, 2000), specially developed for this purpose.

RESULTS AND DISCUSSION

For the real data sets, the observed values (total variance estimates, due to individuals and populations together: I, P) and the expected values of these variances (sum of estimates of variances due to the sources of variation of individuals and populations, I+P) were very close (Table 2). In the case of Seoane's et al. (2000) data, the bootstrap discard did not contribute in a significant manner to the increase in the difference between the estimates, as the discard was very small, from 0.011% and 0.014%, for individuals, and individuals and populations simultaneous resamples, respectively. However, these discards must have altered the precision of the estimates and of their variances, in comparison to those obtained with no discards. These discards are due to the estimation method used, since estimates were obtained as variance ratios and, in certain combinations, these ratios may have zero values in the denominators. In these cases, the software used automatically discarded the bootstrap sample. This procedure was applied to all resampling levels.

Shapiro-Wilk's normality test was non-significant (P ³ 0.05) in all analyses, when the real data set was considered. This, however, was not observed with the simulated data in the case of the F and q.

Nevertheless, the regression residuals presented normal distribution when the logarithmic transformation was applied to the simulated data sets for these two parameters.

In all situations, the linear regression model adjusted well to the data. In all the real cases, the estimates of b were significant, and intercept estimates () did not differ from zero. Furthermore, in all regressions, the hypothesis b = 1 was accepted since all confidence intervals obtained for b included the value 1. Deviations from regression were not expressive, since all R2 values were greater than 99% (Table 3, Figure 1 i to iii).


Results obtained for the three parameters indicated that the corresponding variance estimates, taken from individual and population resamples, can be summed up to obtain the total variance due to these two sources of variation jointly, confirming the additivity of the variances. Therefore, the regression model reduces to Y = X + e. This same behavior was also observed when the simulated data sets were analyzed. In this case, the observed and expected values of the total variance estimates of , and were even more similar (Table 4). Such an outcome is probably due to the large number of populations and individuals used in each data set. No bootstrap discards took place.

Results of the simulated data confirmed those obtained with the real data. In all cases the null hypotheses H0 : a and H0 : b = 1 were not rejected. All the R2 values were greater than 99%, so that deviations from regression were not expressive. (Table 5, Figure 1 iv to vi). This additivity is advantageous, as it is much simpler to work with additive models. Another practical advantage is the lack of necessity of carrying out simultaneous resamplings of individuals and populations to obtain variance estimates due to these two levels of simultaneous resampling. Summing up the bootstrap variance estimates of the different sources of variation is an adequate procedure for obtaining the total variance. Nevertheless, if there is interest in obtaining the total confidence interval of the parameter, due to individuals and populations simultaneously, the concomitant resampling of these two sources of variation becomes necessary whenever the distribution of the estimates , and is unknown. However, in order to investigate if the parameter differs or not from zero, an alternative approach is analyzing jointly the confidence intervals obtained for each resampled level. Carlini-Garcia et al. (2001) proposed that, if at least one of the confidence intervals, for a given parameter, comprised the zero value, the parameter should be considered null. Under this criterion, the hypothesis that the parameter is null is rejected only when all confidence intervals do not contain the zero value. The reference value zero is adequate for F, f and q, but can be different for other parameters.

As mentioned in methodology, different combinations of selfing rates and numbers of generations of divergence were considered (Table 1). The variances of and tend to increase with divergence as expected (Table 4). Even though, the property of additivity was maintained.

Probably the main advantage of this additivity is the possibility of obtaining the relative contribution of the different sources of variation to the total variation. This fact has implications in sample planning, as the source that most contributes to the total variance should receive greater attention in the elaboration of future sample strategies. By knowing these relative contributions and verifying trends in several similar types of research, it is possible to organize sampling strategies (number of populations and of individuals per population) to minimize the error in the population parameter estimates.

Obtaining expressions of intrapopulation (among individuals) and interpopulation variance components that contribute to the bootstrap variances is an interesting area for future research. This is specially true when individuals and populations are resampled simultaneously. Petit & Pons (1998), considering Nei's diversity measures with haploid loci, verified that the variance obtained by the simultaneous resampling of individuals and populations contains the total (intra and interpopulation) and the intrapopulation variances, and not only the total variance. This could not be verified in this, because the explicit expressions of the bootstrap variance components of the evaluated parameters have been necessary.

Considering the real data sets, in approximately 78% of the comparisons, the variance estimates due to the source of variation of populations were superior to the estimated variances due to individuals (Table 2). With the simulated data set, this percentage increased to 91% (Table 4). These results seem to confirm what Petit & Pons's (1998) obtained, i.e., that the variance estimates, when populations were resampled, contain both the intra and interpopulation variance components. This behavior is pertinent, since when populations are resampled, individuals belonging to these populations are also resampled. This suggests that variances derived from bootstrap should be considered as mean squares and not as variance components associated to the levels under resampling. Therefore, the number of individuals influences the amount with which the variance component, due to individuals, contributes to the variance due to the populations.

Including the source of variation due to loci in this study would require knowing not only the total bootstrap variance, based on the hierarchical resampled levels (populations and individuals), but also the component due to loci. As the source of variation of loci leads to a crossed data structure, the existence of variance components due to interactions between loci and other resampled levels are expected. Therefore, it is not expected that mean squares are additive when loci are resampled together with individuals and populations. Determining the bootstrap variance components based on the crossed structure due to loci in addition to those due to the hierarchical structure is necessary. This is required for verifying the property of additivity when loci are resampled together with individuals and populations or even with any other possible hierarchical levels.

Received January 4, 2002

  • AULER, N.M.F. Caracterização da estrutura genética de populações naturais de Araucaria angustifolia (Bert) O. Ktze. no Estado de Santa Catarina. Florianópolis, 2000. 93p. Dissertação (Mestrado) - Universidade Federal de Santa Catarina.
  • CARLINI-GARCIA, L.A.; VENCOVSKY, R.; COELHO, A.S.G. Método bootstrap aplicado em diferentes níveis de reamostragem na estimação de parâmetros genéticos populacionais. Scientia Agricola, v.58, p.785-793, 2001.
  • CIAMPI, A.Y. Desenvolvimento e utilização de marcadores microsatélites, AFLP e seqüênciamento de cpDNA, no estudo da estrutura genética e parentesco em populações de copaíba (Copaifera langsdorffii) em matas de galeria no cerrado. Botucatu, 1999. 204p. Tese (Doutorado) Instituto de Biociências, Universidade Estadual Paulista "Júlio de Mesquita Filho".
  • COCKERHAM C.C. Variance of gene frequencies. Evolution, v.23, p.72-84, 1969.
  • COCKERHAM, C.C. Analysis of gene frequencies. Genetics, v.74, p.679-700, 1973.
  • COELHO, A.S.G. Programa EG: Análise de estrutura genética de populações pelo método da análise de variância (software). Goiânia: Universidade Federal de Goiânia, Instituto de Ciências Biológicas, Departamento de Biologia Geral, 2000.
  • EFRON, B.; TIBSHIRANI, R.J. An introduction to the bootstrap. New York: Chapman & Hall, 1993. 436p.
  • MANLY, B.F.J. Randomization, bootstrap and monte carlo methods in biology. 2. ed. London: Chapman & Hall, 1997. 399p.
  • NETER, J.; WASSERMAN, W.; KUTNER, M.H. Applied linear regression models 3. ed. Homewood: Irwin,1990. 1181p.
  • PETIT, R.J.; PONS, O. Bootstrap variance of diversity and differentiation estimators in a subdivided population. Heredity, v.80, p.56-61, 1998.
  • PONS, O.; PETIT, R.J. Estimation variance and optimal sampling of gene diversity. I. Haploid locus. Theoretical and Applied Genetics, v.90, p.462-470, 1995.
  • REIS, M.S.; VENCOVSKY, R.; KAGEYAMA, P.Y.; GUIMARÃES, E.; FANTINI, A.C.; NODARI, R.O.; MANTOVANI, A. Variação genética em populações naturais de palmiteiro (Euterpe edulis Martius Arecaceae) na floresta ombrófila densa. Sellowia, v.49-52, p.131-149, 2000.
  • SEBBENN, A.M.; SEOANE, C.E.S.; KAGEYAMA, P.Y.; LACERDA, C.M.B. Estrutura genética em populações de Tabebuia cassinoides: implicações para o manejo florestal e a conservação genética. Revista do Instituto Florestal, v.13, p.99-113, 2001.
  • SEOANE, C.E.S.; KAGEYAMA, P.Y.; SEBBENN, A.M. Efeitos da fragmentação florestal na estrutura genética de populações de Esenbeckia leiocarpa Engl. (guarantã). Scientia Forestalis, n.57, p.123-139, 2000.
  • SOKAL, R.R.; ROHLF, F.J. Biometry: the principles and practice of statistics in biolofical research. 3. ed. New York: W.H. Freeman, 1995. 887p.
  • TELLES M.P.C.; COELHO, A.S.G. Caracterização genética de populações naturais de araticum (Anonna crassiflora). Genetic and Molecular Biology, v.21, p.199, 1998. Supplement. /Apresentado ao 44. Congresso Nacional de Genética, Águas de Lindóia, 1998 Resumo/
  • VAN DONGEN, S. How should we bootstrap allozyme data? Heredity, v.74, p.445-447, 1995.
  • VENCOVSKY, R.; DIAS, C.T.S.; DEMÉTRIO, C.G.B.; LEANDRO, R.A.; PIEDADE, S.M.S. Reamostragem por "bootstrap" na estimação de parâmetros baseados em marcadores genéticos. In: ENCONTRO SOBRE TEMAS DE GENÉTICA E MELHORAMENTO, 14., Piracicaba, 1997. Anais Piracicaba: ESALQ, Depto. de Genética, 1997. p.59-72.
  • WEIR, B.S. Genetic data analysis II 2. ed. Sunderland: Sinauer Associates, 1996. 445p.
  • WEIR, B.S.; COCKERHAM, C.C. Estimating F-statistics for the analysis of population structure. Evolution, v.38, p.1358-1370, 1984.
  • WRIGHT, S. The genetical structure of population. Annals of Eugenics, v.15, p.323-354, 1951.
  • WRIGHT, S. The interpretation of population structure by F-statistics with special regard to systems of mating. Evolution, v.19, p.395-420, 1965.
  • Address to correspondence
    Roland Vencovsky
  • Publication Dates

    • Publication in this collection
      01 Apr 2003
    • Date of issue
      Feb 2003

    History

    • Received
      04 Jan 2002
    Escola Superior de Agricultura "Luiz de Queiroz" USP/ESALQ - Scientia Agricola, Av. Pádua Dias, 11, 13418-900 Piracicaba SP Brazil, Phone: +55 19 3429-4401 / 3429-4486 - Piracicaba - SP - Brazil
    E-mail: scientia@usp.br