Be-Breeder – an application for analysis of genomic data in plant breeding

Be-Breeder is an application directed toward genetic breeding of plants, developed through the Shiny package of the R software, which allows different phenotype and molecular (marker) analysis to be undertaken. The section for analysis of molecular data of the Be-Breeder application makes it possible to achieve quality control of genotyping data, to obtain genomic kinship matrices, and to analyze genomic selection, genome association, and genetic diversity in a simple manner on line. This application is available for use in a network through the site of the Allogamous Plant Breeding Laboratory of ESALQ-USP (http://www.genetica.esalq.usp.br/alogamas/R.html).


INTRODUCTION
The application of molecular information in genetic breeding of plants can lead to savings in time and money and increase precision and intensity of selection for programs that use this information (Moose andMumm 2008, Varshney et al. 2009).In addition, this information has allowed advances in studies in population genetics and species evolution through facilitating observation and comparison of existing genetic diversity measured by molecular markers (Govindaraj et al. 2015).
Due to the complexity of molecular markers, as well as the size of data sets generated, different statistical and computational approaches are necessary for researchers to extract all the information available for analysis (Cole et al. 2012).For that reason, software programs with different computational approaches have been developed to undertake molecular analysis in specific areas of genetics and plant breeding.In this context, the aim of this study was to develop an easily accessible application through scripts and packages of the R software, the molecular section of app Be-Breeder (Fritsche-Neto and Matias 2016) which simultaneously includes different molecular analysis as tools for hanalysisandling and transformation of datasets and matrices, methodologies of genomic selection and association, and mechanisms for studies of genetic diversity.

MATERIAL AND METHODS
The Be-Breeder application was constructed through the Shiny package of the R software (Chang et al. 2015) with a click point interface that allows the user to perform different analysis with molecular (marker) data.For that reason, scripts in R language were developed and implemented in the "molecular breeding" section by the team of the Allogamous Plant Breeding Laboratory of the Genetics Department of the Escola Superior de Agricultura "Luiz de Queiroz" (ESALQ) of the Universidade de São Paulo (USP).This is available for use at http://www.genetica.esalq.usp.br/alogamas/R.html.
In each one of the tabs to be presented below, there are practical examples and explanatory tutorials in the application itself regarding data entry and details of analysis (Help).

Genotyping data
Modifying and constructing matrices that are in agreement with the presupposed statistics or even those that are suitable for entry in analytical software is one of the main challenges encountered by researchers.In this respect, this section of the application provides the user with the tabs "Quality Control" and "Kinship Matrix".
The first tab uses the "raw.data"function of the snpReady package of R (under construction).This allows the user to control the quality of the genomic data.Data can be uploaded as matrix or columns, codified in nitrogen bases, as a function of the structure of the initial dataset received from genotyping (Table 1).
The present parameters of quality control are: -MAF (Minor Allele Frequency): consists of removal of markers with minor allele frequency, that is, lower than the threshold defined by the user.
-Call Rate: call rate parameter, which refers to the quality of genotyping in relation to lost data.Thus, markers with a Call Rate lower than the threshold defined by the user will be eliminated.
-Sweep sample: refers to the quality of genotyping in relation to the data lost in the individuals.Thus, individuals with a Sweep sample lower than the threshold defined by the user will be eliminated.This tab also allows data imputation, which is carried out from the combined probability of the alleles of a determined SNP (i) (frequencies of p i and q i ) and the level of homozygosity of the individual that has a marker to be imputed.The output generated by the function is a "clean" matrix, in which the output format can also be defined by the user, with the options of counting a reference allele for each locus (0,1,2), this matrix centered in zero (-1,0,1), or even the matrix in the appropriate format for entry in the Structure software.Thus, the resulting matrix is in the adequate format to be used in other software or packages or, moreover, to proceed with other analysis in the Be-Breeder application itself.
The "Kinship Matrix" tab uses the "G.matrix" function of the snpReady package and allows construction of different kinship matrices, as indicated by the formulas below: -UAR: unified additive kinship matrix (Yang et al. 2010) -UARadj: adjusted unified additive kinship matrix (Yang et al. 2010) FI Matias et al.
-WW: kinship matrix of VanRaden (VanRaden 2008) Thus, in this tab, in addition to the possibility of generating these three types of matrices, it is possible to change the format of the generated output.The formats available are the traditional kinship matrix (complete matrix) or an alternative format, in which the inverse of the kinship matrix is generated and then organized in columns for use in the package ASREML-R (Butler et al. 2009):

Genomic selection (GS)
Genomic selection is a potent tool in genetic plant breeding, the basis of which is the use of powerful statistical models to carry out prediction using thousands of markers.These models estimate the effects of molecular markers to predict characteristics of the individuals of this population without the need for direct phenotype observation (Meuwissen et al. 2001).This results in time and resources saved in the selection process (Heffner et al. 2010).For that reason, the statistical model must first be trained and validated and then use the effects of the markers as a predictive tool.The Be-Breeder application uses the RR-BLUP method, implemented in the rrBLUP package (Endelman 2011).Thus, it uses a mixed models approach (1), adopting the effects of the markers as random (2) through the REML (Restricted Maximum-Likelihood) approach.
In the "Prediction and Selection" tab, the user must enter with the marker matrix, in the format centered in zero (-1,0,1) to carry out analysis, in which the individuals are in the lines and the markers in the columns (Table 2).The phenotype dataset is also necessary (Table 3).In this tab, there is also the possibility of using the effects of the markers generated and the matrix of markers of non-phenotyped individuals to predict the genetic values of the individuals.

Genome association (GWAS)
Determination of regions in the gene related to a trait (QTL) through molecular markers is an initial step and of extreme importance for understanding the regulation and genetic structure of a phenotype of interest (Korte and Farlow 2013).This information allows the user to direct other bioinformation approaches, such as verification of genes associated with this QTL, to try to correlate the different transcripts with the phenotypes (Kamatani 2016).To do so, an important tool in breeding is genome association (GWAS), which uses mixed models associated with multiple regression, and by means of a threshold (LOD), determined by the user, which allows molecular markers to be found that exhibit greater correlation with the phenotypic variability of the trait under study.For that reason, in the "GWAS" section of the Be-Breeder, the user supplies the phenotype (Table 3) and genome (Table 4) data.From these data and through the "GWAS" function of the rrBLUP package (Endelman 2011), analysis of association between the markers and the QTL for the trait under study is carried out, considering the following model: in which y is the phenotype vector, β is the fixed effects vector, g is the genetic effects vector given as random, τ is the additive effects vector of the SNP being considered as fixed, and ε is the residue vector.X, Z, and S are incidence matrices of the model.
In addition to the estimate of the marker effect, the application creates a Manhattan plot graph with all the chromosomes, markers, and their respective effects.

Analysis of genetic diversity
To study genetic diversity among individuals and populations, molecular markers are quite useful and highly used currently (Govindaraj et al. 2015).In this respect, Be-Breeder provides the user with two options.
In the first tab, "Genetic Diversity", the user can enter files in text format (.txt) generated by the Genpop (Raymond and Rousset 1995) (Table 5) or Structure (Pritchard et al. 2000) software (Table 6), as indicated in the Help section and in the example files.These analysis were constructed through scripts that associate functions of the "ape" (Paradis et al. 2004), "poppr" (Kamvar al. 2014), and "adegenet" (Jombart 2008) packages of the R software.As output, the user has a general summary with information on the populations and markers.From these estimates, observed heterozygosity (Ho) is obtained through "Ho = number of heterozygotes in each locus / number of individuals" and expected heterozygosity (He) by He = 1 -Σ ≠alleles i =1 p 2 i , for in reference to the frequency of the i-th allele (Nei 1978).The F of Wright, for its part, is estimated by (Wright 1965).Graphic outputs that involve multivariate principal component analysis, population structure, and dendrograms are obtained based on the Nei distance (Nei 1972).
The second tab "Population genetics" uses the "popgen" function of the snpReady package and allows information to be obtained in reference to allele frequency per marker (p and q), minor allele frequency (MAF), expected heterozygosity (He) estimated through the formula He = 2 *p*q , observed heterozygosity (Ho), genetic diversity (DG), obtained by DG = -1 -p 2 -q 2 , and polymorphic information content (PIC) estimated by PIC = -1 -(p 2 + q 2 ) -(2 * p 2 * q 2 ) (Hartl and Clark 2010).The function furthermore allows an argument for attributions of individuals in subpopulations, if there are any and they are known a priori.To carry out analysis, the user must provide the marker matrix parametrized for allele count (0,1,2) (Table 7) and, as an option, a vector with the information of subgroups within the population (subgroups) (Table 8).Thus, it will be possible to identify each one of these parameters for each subpopulation, as well as the existence of exclusive,

Table 1 .
Example of data entry in the column format for the "Quality Control" section

Table 2 .
Example of genotype data entry for the "Genomic Selection" (GS) section

Table 3 .
Example of phenotype data entry for the "GenomicSelection" (GS) section

Table 4 .
Example of genotype data entry for the "Genomic Association" (GWAS) section

Table 5 .
Example of the .txtfile of genome data generated by the Genpop software to be used in the "Genetic Diversity" section

Table 6 .
Example of the .txtfile of genomic data generated by the Structure software to be used in the "Genetic Diversity" tab