Acessibilidade / Reportar erro

Be-Breeder - an application for analysis of genomic data in plant breeding

Abstract

Be-Breeder is an application directed toward genetic breeding of plants, developed through the Shiny package of the R software, which allows different phenotype and molecular (marker) analysis to be undertaken. The section for analysis of molecular data of the Be-Breeder application makes it possible to achieve quality control of genotyping data, to obtain genomic kinship matrices, and to analyze genome selection, genome association, and genetic diversity in a simple manner on line. This application is available for use in a network through the site of the Allogamous Plant Breeding Laboratory of ESALQ-USP (http://www.genetica.esalq.usp.br/alogamas/R.html).

Key words:
Quality control; kinship matrix; genome selection; genetic diversity; SNP

INTRODUCTION

The application of molecular information in genetic breeding of plants can lead to savings in time and money and increase precision and intensity of selection for programs that use this information (Moose and Mumm 2008Moose SP, Mumm RH (2008) Molecular plant breeding as the foundation for 21st century crop improvement. Plant Physiology 147: 969-977. , Varshney et al. 2009Varshney RK, Nayak SN, May GD and Jackson SA (2009) Next-generation sequencing technologies and their implications for crop genetics and breeding. Trends in Biotechnology 27: 522-530. ). In addition, this information has allowed advances in studies in population genetics and species evolution through facilitating observation and comparison of existing genetic diversity measured by molecular markers (Govindaraj et al. 2015Govindaraj M, Vetriventhan M and Srinivasan M (2015) Importance of genetic diversity assessment in crop plants and its recent advances: an overview of its analytical perspectives. Genetics Research International 2015: 1-14. ).

Due to the complexity of molecular markers, as well as the size of data sets generated, different statistical and computational approaches are necessary for researchers to extract all the information available for analysis (Cole et al. 2012Cole J, Newman S, Foertter F, Aguilar I and Coffey M (2012) Breeding and genetics symposium: Really big data: Processing and analysis of very large data sets. Journal of Animal Science 90: 723-733. ). For that reason, software programs with different computational approaches have been developed to undertake molecular analysis in specific areas of genetics and plant breeding. In this context, the aim of this study was to develop an easily accessible application through scripts and packages of the R software, the molecular section of app Be-Breeder (Fritsche-Neto and Matias 2016Fritsche-Neto R, Matias FI (2016) Be-Breeder-Learning: a new tool for teaching and learning plant breeding principles. Crop Breeding and Applied Biotechnology 16: 240-245. ) which simultaneously includes different molecular analysis as tools for hanalysisandling and transformation of datasets and matrices, methodologies of genomic selection and association, and mechanisms for studies of genetic diversity.

MATERIAL AND METHODS

The Be-Breeder application was constructed through the Shiny package of the R software (Chang et al. 2015Chang W, Cheng J, Allaire J, Xie Y and McPherson J (2015) Shiny: web application framework for R. R package version 0.11 1. ) with a click point interface that allows the user to perform different analysis with molecular (marker) data. For that reason, scripts in R language were developed and implemented in the “molecular breeding” section by the team of the Allogamous Plant Breeding Laboratory of the Genetics Department of the Escola Superior de Agricultura “Luiz de Queiroz” (ESALQ) of the Universidade de São Paulo (USP). This is available for use at http://www.genetica.esalq.usp.br/alogamas/R.html.

In each one of the tabs to be presented below, there are practical examples and explanatory tutorials in the application itself regarding data entry and details of analysis (Help).

Genotyping data

Modifying and constructing matrices that are in agreement with the presupposed statistics or even those that are suitable for entry in analytical software is one of the main challenges encountered by researchers. In this respect, this section of the application provides the user with the tabs “Quality Control” and “Kinship Matrix”.

The first tab uses the “raw.data” function of the snpReady package of R (under construction). This allows the user to control the quality of the genomic data. Data can be uploaded as matrix or columns, codified in nitrogen bases, as a function of the structure of the initial dataset received from genotyping (Table 1).

Table 1
Example of data entry in the column format for the “Quality Control” section

- MAF (Minor Allele Frequency): consists of removal of markers with minor allele frequency, that is, lower than the threshold defined by the user.

- Call Rate: call rate parameter, which refers to the quality of genotyping in relation to lost data. Thus, markers with a Call Rate lower than the threshold defined by the user will be eliminated.

- Sweep sample: refers to the quality of genotyping in relation to the data lost in the individuals. Thus, individuals with a Sweep sample lower than the threshold defined by the user will be eliminated.

This tab also allows data imputation, which is carried out from the combined probability of the alleles of a determined SNP(i) (frequencies of pi and qi ) and the level of homozygosity of the individual that has a marker to be imputed. The output generated by the function is a “clean” matrix, in which the output format can also be defined by the user, with the options of counting a reference allele for each locus (0,1,2), this matrix centered in zero (-1,0,1), or even the matrix in the appropriate format for entry in the Structure software. Thus, the resulting matrix is in the adequate format to be used in other software or packages or, moreover, to proceed with other analysis in the Be-Breeder application itself.

The “Kinship Matrix” tab uses the “G.matrix” function of the snpReady package and allows construction of different kinship matrices, as indicated by the formulas below:

- UAR: unified additive kinship matrix (Yang et al. 2010Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, Madden PA, Heath AC, Martin NG and Montgomery GW (2010) Common SNPs explain a large proportion of the heritability for human height. Nature Genetics 42: 565-569.)

- UARadj: adjusted unified additive kinship matrix (Yang et al. 2010Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, Madden PA, Heath AC, Martin NG and Montgomery GW (2010) Common SNPs explain a large proportion of the heritability for human height. Nature Genetics 42: 565-569.)

- WW: kinship matrix of VanRaden (VanRaden 2008VanRaden P (2008) Efficient methods to compute genomic predictions. Journal of Dairy Science 91: 4414-4423. )

Thus, in this tab, in addition to the possibility of generating these three types of matrices, it is possible to change the format of the generated output. The formats available are the traditional kinship matrix (complete matrix) or an alternative format, in which the inverse of the kinship matrix is generated and then organized in columns for use in the package ASREML-R (Butler et al. 2009Butler DG, Cullis BR, Gilmour AR and Gogel BJ (2009) ASReml-R reference manual. Department of Primary Industries and Fisheries, The State of Queensland, 149p. ):

Genome selection (GS)

Genome selection is a potent tool in genetic plant breeding, the basis of which is the use of powerful statistical models to carry out prediction using thousands of markers. These models estimate the effects of molecular markers to predict characteristics of the individuals of this population without the need for direct phenotype observation (Meuwissen et al. 2001Meuwissen THE, Hayes BJ and Goddard ME (2001) Prediction of total genetic value using genome-wide dense marker maps. Genetics 157: 1819-1829. ). This results in time and resources saved in the selection process (Heffner et al. 2010Heffner EL, Lorenz AJ, Jannink J-L and Sorrells ME (2010) Plant breeding with genomic selection: gain per unit time and cost. Crop Science 50: 1681. ). For that reason, the statistical model must first be trained and validated and then use the effects of the markers as a predictive tool. The Be-Breeder application uses the RR-BLUP method, implemented in the rrBLUP package (Endelman 2011Endelman JB (2011) Ridge regression and other kernels for genomic selection with R package rrBLUP. The Plant Genome 4: 250-255. ). Thus, it uses a mixed models approach (1), adopting the effects of the markers as random (2) through the REML (Restricted Maximum-Likelihood) approach.

In the “Prediction and Selection” tab, the user must enter with the marker matrix, in the format centered in zero (-1,0,1) to carry out analysis, in which the individuals are in the lines and the markers in the columns (Table 2). The phenotype dataset is also necessary (Table 3). In this tab, there is also the possibility of using the effects of the markers generated and the matrix of markers of non-phenotyped individuals to predict the genetic values of the individuals.

Table 2
Example of genotype data entry for the “Genomic Selection(GS) section

Table 3
Example of phenotype data entry for the “Genomic Selection(GS) section

Genome association (GWAS)

Determination of regions in the gene related to a trait (QTL) through molecular markers is an initial step and of extreme importance for understanding the regulation and genetic structure of a phenotype of interest (Korte and Farlow 2013Korte A, Farlow A (2013) The advantages and limitations of trait analysis with GWAS: a review. Plant Methods 9: 1. ). This information allows the user to direct other bioinformation approaches, such as verification of genes associated with this QTL, to try to correlate the different transcripts with the phenotypes (Kamatani 2016Kamatani Y (2016) Genome wide association study: its theory and methodological review. Clinical Calcium 26: 525. ). To do so, an important tool in breeding is genome association (GWAS), which uses mixed models associated with multiple regression, and by means of a threshold (LOD), determined by the user, which allows molecular markers to be found that exhibit greater correlation with the phenotypic variability of the trait under study. For that reason, in the “GWAS” section of the Be-Breeder, the user supplies the phenotype (Table 3) and genome (Table 4) data. From these data and through the “GWAS” function of the rrBLUP package (Endelman 2011Endelman JB (2011) Ridge regression and other kernels for genomic selection with R package rrBLUP. The Plant Genome 4: 250-255. ), analysis of association between the markers and the QTL for the trait under study is carried out, considering the following model:

Table 4
Example of genotype data entry for the “Genomic Association(GWAS) section

in which y is the phenotype vector, β is the fixed effects vector, g is the genetic effects vector given as random, τ is the additive effects vector of the SNP being considered as fixed, and ε is the residue vector. X, Z, and S are incidence matrices of the model.

In addition to the estimate of the marker effect, the application creates a Manhattan plot graph with all the chromosomes, markers, and their respective effects.

Analysis of genetic diversity

To study genetic diversity among individuals and populations, molecular markers are quite useful and highly used currently (Govindaraj et al. 2015Govindaraj M, Vetriventhan M and Srinivasan M (2015) Importance of genetic diversity assessment in crop plants and its recent advances: an overview of its analytical perspectives. Genetics Research International 2015: 1-14. ). In this respect, Be-Breeder provides the user with two options.

In the first tab, “Genetic Diversity”, the user can enter files in text format (.txt) generated by the Genpop (Raymond and Rousset 1995Raymond M, Rousset F (1995) GENEPOP (version 1.2): population genetics software for exact tests and ecumenicism. Journal of Heredity 86: 248-249. ) (Table 5) or Structure (Pritchard et al. 2000Pritchard JK, Stephens M and Donnelly P (2000) Inference of population structure using multilocus genotype data. Genetics 155: 945-959. ) software (Table 6), as indicated in the Help section and in the example files. These analysis were constructed through scripts that associate functions of the “ape” (Paradis et al. 2004Paradis E, Claude J and Strimmer K (2004) APE: analysis of phylogenetics and evolution in R language. Bioinformatics 20: 289-290. ), “poppr” (Kamvar et al. 2014Kamvar ZN, Tabima JF and Grünwald NJ (2014) Poppr: an R package for genetic analysis of populations with clonal, partially clonal, and/or sexual reproduction. PeerJ 2: 281. ), and “adegenet” (Jombart 2008Jombart T (2008) adegenet: a R package for the multivariate analysis of genetic markers. Bioinformatics 24: 1403-1405. ) packages of the R software. As output, the user has a general summary with information on the populations and markers. From these estimates, observed heterozygosity (Ho) is obtained through “Ho = number of heterozygotes in each locus / number of individuals” and expected heterozygosity (He) , for p in reference to the frequency of the i-th allele (Nei 1978Nei M (1978) Estimation of average heterozygosity and genetic distance from a small number of individuals. Genetics 89: 583-590. ). The F of Wright, for its part, is estimated by F=(He−Ho)/He (Wright 1965Wright S (1965) The interpretation of population structure by F-statistics with special regard to systems of mating. Evolution: 395-420. ). Graphic outputs that involve multivariate principal component analysis, population structure, and dendrograms are obtained based on the Nei distance (Nei 1972Nei M (1972) Genetic distance between populations. American Naturalist: 283-292. )

Table 5
Example of the .txt file of genome data generated by the Genpop software to be used in the “Genetic Diversity” section

Table 6
Example of the .txt file of genomic data generated by the Structure software to be used in the “Genetic Diversity” tab

The second tab “Population genetics” uses the “popgen” function of the snpReady package and allows information to be obtained in reference to allele frequency per marker (p and q), minor allele frequency (MAF), expected heterozygosity (He) estimated through the formula He=2∗p∗q, observed heterozygosity (Ho), genetic diversity (DG), obtained by DG=−1− p 2 − q 2 , and polymorphic information content (PIC) estimated by PIC=−1− p 2 + q 2 −(2∗ p 2 ∗ q 2 ) (Hartl and Clark 2010Hartl DL, Clark A (2010) Princípios de genética de populações. Artmed, Porto Alegre, 660p. ). The function furthermore allows an argument for attributions of individuals in subpopulations, if there are any and they are known a priori. To carry out analysis, the user must provide the marker matrix parametrized for allele count (0,1,2) (Table 7) and, as an option, a vector with the information of subgroups within the population (subgroups) (Table 8). Thus, it will be possible to identify each one of these parameters for each subpopulation, as well as the existence of exclusive, absent, and fixed alleles in the subpopulations. If this last piece of information is not supplied, the application will consider only a single population.

Table 7
Example of genotype data entry for the “Population Genetics” section

Table 8
Example of information entry of subgroups within the population, for use in the “Population Genetics” section

RESULTS AND DISCUSSION

The Be-Breeder application allows the use of important tools of molecular breeding (GS and GWAS), as well as studies related to genetic diversity, in an easy way on line. It is expected that the practicality of the application allows the use of more complex analysis, superseding the need for mastery of R software programming. This application will assist researchers in providing access to quality results in a dynamic and interactive manner, integrating genome data to the routine of plant breeding programs.

REFERENCES

  • Butler DG, Cullis BR, Gilmour AR and Gogel BJ (2009) ASReml-R reference manual. Department of Primary Industries and Fisheries, The State of Queensland, 149p.
  • Chang W, Cheng J, Allaire J, Xie Y and McPherson J (2015) Shiny: web application framework for R. R package version 0.11 1.
  • Cole J, Newman S, Foertter F, Aguilar I and Coffey M (2012) Breeding and genetics symposium: Really big data: Processing and analysis of very large data sets. Journal of Animal Science 90: 723-733.
  • Endelman JB (2011) Ridge regression and other kernels for genomic selection with R package rrBLUP. The Plant Genome 4: 250-255.
  • Fritsche-Neto R, Matias FI (2016) Be-Breeder-Learning: a new tool for teaching and learning plant breeding principles. Crop Breeding and Applied Biotechnology 16: 240-245.
  • Govindaraj M, Vetriventhan M and Srinivasan M (2015) Importance of genetic diversity assessment in crop plants and its recent advances: an overview of its analytical perspectives. Genetics Research International 2015: 1-14.
  • Hartl DL, Clark A (2010) Princípios de genética de populações. Artmed, Porto Alegre, 660p.
  • Heffner EL, Lorenz AJ, Jannink J-L and Sorrells ME (2010) Plant breeding with genomic selection: gain per unit time and cost. Crop Science 50: 1681.
  • Jombart T (2008) adegenet: a R package for the multivariate analysis of genetic markers. Bioinformatics 24: 1403-1405.
  • Kamatani Y (2016) Genome wide association study: its theory and methodological review. Clinical Calcium 26: 525.
  • Kamvar ZN, Tabima JF and Grünwald NJ (2014) Poppr: an R package for genetic analysis of populations with clonal, partially clonal, and/or sexual reproduction. PeerJ 2: 281.
  • Korte A, Farlow A (2013) The advantages and limitations of trait analysis with GWAS: a review. Plant Methods 9: 1.
  • Meuwissen THE, Hayes BJ and Goddard ME (2001) Prediction of total genetic value using genome-wide dense marker maps. Genetics 157: 1819-1829.
  • Moose SP, Mumm RH (2008) Molecular plant breeding as the foundation for 21st century crop improvement. Plant Physiology 147: 969-977.
  • Nei M (1972) Genetic distance between populations. American Naturalist: 283-292.
  • Nei M (1978) Estimation of average heterozygosity and genetic distance from a small number of individuals. Genetics 89: 583-590.
  • Paradis E, Claude J and Strimmer K (2004) APE: analysis of phylogenetics and evolution in R language. Bioinformatics 20: 289-290.
  • Pritchard JK, Stephens M and Donnelly P (2000) Inference of population structure using multilocus genotype data. Genetics 155: 945-959.
  • Raymond M, Rousset F (1995) GENEPOP (version 1.2): population genetics software for exact tests and ecumenicism. Journal of Heredity 86: 248-249.
  • VanRaden P (2008) Efficient methods to compute genomic predictions. Journal of Dairy Science 91: 4414-4423.
  • Varshney RK, Nayak SN, May GD and Jackson SA (2009) Next-generation sequencing technologies and their implications for crop genetics and breeding. Trends in Biotechnology 27: 522-530.
  • Wright S (1965) The interpretation of population structure by F-statistics with special regard to systems of mating. Evolution: 395-420.
  • Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, Madden PA, Heath AC, Martin NG and Montgomery GW (2010) Common SNPs explain a large proportion of the heritability for human height. Nature Genetics 42: 565-569.

Publication Dates

  • Publication in this collection
    Mar 2017

History

  • Received
    09 Aug 2016
  • Accepted
    30 Nov 2016
Crop Breeding and Applied Biotechnology Universidade Federal de Viçosa, Departamento de Fitotecnia, 36570-000 Viçosa - Minas Gerais/Brasil, Tel.: (55 31)3899-2611, Fax: (55 31)3899-2611 - Viçosa - MG - Brazil
E-mail: cbab@ufv.br