‘Gametes Simulator’: a multilocus genotype simulator to analyze genetic structure in outbreeding diploid species

: Bulk sampling and subsequent DNA fingerprinting are applied to obtain estimated allelic frequency data and unveil population structure for outbreeding species. We developed ‘Gametes Simulator’, a routine to simulate gametes from allelic frequencies of unlinked loci, which enables the analysis of population structure through Bayesian analysis. Based on the allelic frequen cies in the populations, the software simulates the alleles per population in the right proportions and assigns one allele per marker to each gamete following Mendel’s laws that govern gametogenesis. multiplying an allelic frequency (Af) by the number of gametes to simulate (Gs), (Af.Gs); if the decimal part of the result is equal to or higher than 0.5 alleles, the software will round the value up to the next integer number, and if it is less than 0.5, it will return the integer number without the decimal part. For example, if Af.Gs ≥ 0.5 but < 1.5, the program will simulate 1 allele, and if Af.Gs is ≥ 1.5 and < than 2.5, it will simulate 2 alleles due to the evident fact that it is not possible to simulate decimal parts of an allele or any decimal part of a gamete (and it is also impossible for this to happen in nature). To downsize the errors due to the use of a small sample size (Gs), we strongly recommend simulate a number of gametes that make possible that the lowest allelic frequency in the input matrix then will be represented at least with one allele in the simulation. To that end, the researcher should take into account that the allele with the lowest frequency in the input matrix will be represented in the population’s sample of gametes at least once only if the sample size (Gs: number of gametes to be simulated) is as large as the quotient (q) that represents the inverse of the lowest allelic frequency ( lowest Af) in the input (q = 1/ lowest Af). To avoid inaccuracies due to the sample size, a 2q number of gametes can be simulated.


INTRODUCTION
Knowledge of the genetic structure of plant populations and germplasm collections is a crucial factor in the development of conservation strategies (Mutegi et al. 2015) and in the exploitation of the breeding potential of domesticated species (Crystian et al. 2018, Ferreira et al. 2018, Guimarães et al. 2019 and their wild relatives (Dempewolf et al. 2017). In animal populations knowledge of the genetic structure of the populations is also crucial to design efficient conservation strategies (Contina et al. 2019). Genetic structure in outbreeding species is determined by multiple factors, such as assortative mating, bottlenecks, genetic drift, kinship, and natural selection (Thornhill 1993), as well as artificial selection in the case of domesticated species (McCouch 2004). The nonexistence of Hardy-Weinberg equilibrium in a population could be the result of the action of one or more factors and occurs frequently in populations of wild and domesticated species (Waples 2014).
Genetic diversity analysis in outbreeding species requires the sampling of a high number of individuals per population (a minimum of twenty-five to thirty individuals) to accurately estimate allelic frequencies within each population (Hale et al. 2012). A high number of individuals per population increase the associated costs, and these costs increase even more when multiple populations must be analyzed. To solve this resource limitation, bulk DNA fingerprinting can be applied, reducing the costs and time involved as a result of bulking the DNA from fifteen or more individuals into a population DNA sample (Reif et al. 2005, Dubreuil et al. 2006, Porta et al. 2018. From these bulk samples, allelic frequencies are precisely estimated (Reif et al. B Porta et al. 2005, Dubreuil et al. 2006, Bedoya et al. 2017 but the ability to recognize the genotype of each of the individuals belonging to the bulk sample is lost. Bayesian analysis using the Structure algorithm (Pritchard et al. 2000) can be used to detect underlying populational genetic structure, which is difficult to detect using phenotypic traits but significant in terms of population genetic differentiation and evolution. This approach estimates the best-fitting number of subpopulations in a sample and predicts the probability that a certain genotype belongs to a subpopulation. Nevertheless, it requires knowledge of the individual profile of diploid or haploid genotypes (Falush et al. 2003).
In the case of bulk sampling, where allele frequencies are estimated precisely (Dubreuil et al. 2006), the simulation of individual gametes based on observed allelic frequencies arises as a suitable solution (Porta et al. 2018). The simulated haploid gametes can then be used as input into the Structure program (Falush et al. 2003) to detect the genetic structure of the populations.
The right assumptions and data selection procedure for the simulation of gametes or individuals are crucial to obtain nonbiased results. Whenever diploid individuals are simulated, on the basis of allelic frequencies, is made the assumption that the population is in Hardy-Weinberg equilibrium. This fact might lead to nonrealistic simulations since deviations from Hardy-Weinberg equilibrium are highly common in natural populations as a result of a variety of biological and non-biological reasons (Waples 2014): population size, assortative mating, selection, migration, etc. By the contrary, the simulation of haploid gametes according to the allelic frequency data is independent of the state of the population with respect to the Hardy-Weinberg equilibrium. Diploid species generally produce haploid gametes (Mable and Otto 1998). Gametes simulation based on the allele frequencies of unlinked loci does not imply any assumption regarding the phase of the alleles in the genotypes, the proportions of homozygous and heterozygous combinations or the existence of linkage disequilibrium, therefore preventing bias in the result.
We developed 'Gametes Simulator', a computer routine to simulate gametes from the allelic frequencies of unlinked loci, which enables the analysis of population structure through Bayesian analysis in Structure software. The objective is to make this routine available to other scientists working on the genetic structure of outbreeding diploid species.

Scope and theoretical assumptions
'Gametes Simulator' software simulates haploid gametes produced by diploid organisms, which implies that the gametes are produced due to the gametogenesis process through meiosis I and II. The 1st and 2nd laws of Mendel are the direct descriptions of chromosome movement during meiosis. Furthermore, chromosome movement during meiosis is responsible for allele assortment in the gametes. Consequently, 'Gametes Simulator' software adopts the following theoretical assumptions: • Mendel´s first law or law of segregation. Each gamete will have only one of the two parental gene copies. The gene copies are randomly allocated to the gametes due to: 1) the random alignment in metaphase I of the homologous chromosomes and its migration towards opposite poles in anaphase I and 2) the orientation and separation of the sister chromatids in anaphase II during the gametogenesis process.
• Mendel´s second law or law of independent assortment.This implies that the alleles of two or n different unlinked genes are assorted independently into gametes. There is no influence of an allele received by a gene on an allele received by another gene and vice versa. The law of independent assortment has its physical basis during meiosis I in the gametogenesis process, when the homologous chromosome pairs line up in random orientations in the middle of the cell during metaphase I. Consequently, the gametes will have different combinations of maternal and paternal homologues, which also determine the alleles carried in each case. This law is applicable in the case of unlinked loci, with the loci in linkage equilibrium, because they are either in different chromosomes or their recombination frequency is 50%, or the map distance is equal to or higher than 50 cM. This element is crucial to the right usage of the software: the selected loci should be known to be unlinked.
In summary, two mechanisms are responsible for the assortment of alleles during gametogenesis: 1) the random alignment of the bivalents in metaphase I, which results in a random combination of maternal and paternal chromosomes allocated to each gamete and therefore the alleles they carry, and 2) the crossing over or recombination during prophase I of meiosis among homologous chromosomes, which leads to the physical assortment of alleles in the genes along a chromosome. The homology between the gametogenesis process and the algorithm followed by 'Gametes Simulator' software is presented in Table 2 as well as the outputs generated per algorithmic step and the biological meaning.

Practical procedures
This Java program uses the matrix of estimated allelic frequencies in populations genotyped through bulk DNA sampling -or other methodologies -as input ( Table 1). The matrix contains the allelic frequencies, with the studied populations as columns and each allele as a row. The alleles are identified using the first two columns: 1) the marker name or loci used to genotype the populations and 2) the name of the specific allele. This matrix is an ASCII-type file from Excel. The matrix could be generated by the program FreqsR, specifically designed to determine allelic frequencies in bulk-sampling populations of maize genotyped by microsatellites (Franco et al. 2005), or by any other program able to generate output containing all the information specified in Table 1.
To install 'Gametes Simulator' software, see the instructions in File 1 -'Instructions to install and use Gametes Simulator Software.docx' at <https://github.com/pfernandezgr/gamete-simulator/>, and read and follow them carefully. Once the program is installed, direct access opens the window shown in ( Figure 1A). The input file (the Excel file with the matrix of allelic frequencies) should be browsed and selected. Then, the number of multilocus haploid genotypes (gametes) per population to be simulated must be indicated, filling in the corresponding box 'Gametes per population' ( Figure 1A). Finally, this program provides the option to indicate the type or number required for missing data in the output. It is strongly recommended to use the same type or number used for empty cells in the next program in which this output will be used as an input. For example, -9 is used in Structure for empty cells, so it is very useful that the output generated by the 'Gametes Simulator' program has already incorporated the -9 into the empty cells. Once everything is completed, press 'Start'.
After running the program, two Excel files are generated consecutively as output. The first output Excel file allows us to visualize the process by which the gametes within a population are generated per marker according to the allele frequencies within the populations. The information per marker is provided in individual Excel sheets. The first column provides the names of the populations, the second column enumerates the gametes simulated within each population, Table 1. Generalized matrix of allelic frequencies found in populations genotyped with molecular markers. Empty cells represent populations not genotyped with the corresponding molecular marker. To see a practical example access to <https://github.com/ pfernandezgr/gamete-simulator/> and look at Figure 1  the third column contains a binary code assigned to each type of allele within a population, and in the fourth column, each allele within a population is repeated according to its allelic frequency in the population multiplied by the number of gametes simulated within each population. Finally, in the fifth column, the alleles within each population (shown in the fourth column) are ordered randomly within its corresponding population.
The second and final output Excel file uses all the information from the previous Excel file and contains the information required for the input matrix for Structure v.2.3.4 ( Figure 1B).To be used in Structure v.2.3.4, this matrix should be transformed according to the instructions given in the document 'Instructions to install and use Gametes Simulator software' in the paragraphs under the title 'Final step'.

History used for the development of 'Gametes Simulator'
This program was created to analyze the genetic structure of populations (landraces) of maize characterized through bulk DNA fingerprinting. A previously designed program called FtoL (Franco et al. 2007) was developed in the R platform  Table 1. Before running the simulation, the user must provide the number of gametes to be simulated within each population and the text that should be used to fill empty spaces as required for the output. B. Final output generated by 'Gametes Simulator'software. In the first column is the name of the population repeated as many times as Gs (gametes simulated per population).The second column corresponds to the ordinal number of the population (rows have the same number if they correspond to gametes simulated for the same population). The third column enumerates the gametes simulated within each population. From the fourth column on, the alleles present in each gamete for each molecular marker are detailed in such a way that each row represents the multilocus genotype of a gamete. to solve the same problem. However, FtoL assumes that populations are in Hardy-Weinberg equilibrium, and diploid individuals are simulated using the allelic frequencies in each population. Equilibrium might not be a common situation in natural populations or landraces due to many causes: natural selection, selection made by farmers, genetic drift due to small population size, inbreeding due to assortative mating or not random mating (Ramos et al. 2019), or exogamy with other neighbor populations, especially in outbreeding species (Porta et al. 2018). The simulation of diploid individuals in landraces or natural populations assuming Hardy-Weinberg equilibrium could certainly induce bias. To avoid bias, we simulate haploid gametes using the allelic frequencies in the population. The simulation of gametes allows the analysis of the genetic structure of the population but does not reveal evolutionary processes acting in populations or the reproductive success of these gametes in the next generation. Therefore, simulated gametes represent a snapshot of the genetic composition of a population but cannot be used to predict the genetic frequencies of the next generation unless the population is assumed to be in Hardy-Weinberg equilibrium.

Performance characteristics
The 'Gametes Simulator' program uses as input the matrix of allelic frequencies per population in the format shown in Table 1. The program simulates the alleles within each population in the right proportions according to the allelic frequencies within the populations. The software orders the alleles within the populations at random. Each gamete is represented in a row and is built by the union of the alleles in the consecutive columns. Each column represents a different marker used to characterize the populations. Simulated gametes are haploid,with only one allele from each molecular marker. Tens of gametes per population can be easily simulated, but the simulation of hundreds of gametes could be hampered. The maximum number of gametes per population that can be simulated with Gametes Simulator software will depend on the matrix size (number of markers, alleles, and populations) as well as the hardware characteristics, particularly the available RAM (Random Access Memory).
The accuracy of the simulated gametes has been checked and can be checked by any researcher using this software with the reverse procedure: calculating the allelic frequencies from the alleles simulated per marker within populations (final output of the 'Gametes Simulator' software); the result obtained is the same as the data fed into the program in the input matrix. It must be taken into account that in certain cases a non-integer number of gametes is achieved by Table2. Summary of 'Gametes Simulator' algorithm, outputs, their homology with the gametogenesis process and the biological meaning

Algorithm
Output Homology with gametogenesis Biological meaning 1. Takes the matrix of allelic frequencies per marker and per population as input.
The fraction of all chromosomes in the diploid population that carry each allele 2. Translates the genetic frequencies of each allele in each population into repeated alleles as many times as Af # × Gs* 4th column in each marker Excel sheet in the file 'First output of the software' Real number of allele copies expected in a sample of gametes taken from each population Size of the sample: Gs* 3.Takes the output from 2 and sorts at random the alleles per marker within each population, allocating one allele randomly to each gamete 5th column in each marker Excel sheet in the file 'First output of the software' Random allocation of each allele per marker (locus) to each future gamete in the gametogenesis process (1st Mendel's law) and independently among unlinked markers (2nd Mendel's law). Also independent among populations because the allelic frequencies per population reported in the input matrix have no influence on one another Genetic variability among gametes within populations determined by random and by the allelic frequencies present in each population 4.Multilocus haplotype assemblage (gamete genotype): takes the output from the previous step (5th column of each Excel marker sheet) and assembles it considering all the markers. Gametes Simulator does so by gathering the genetic information for each locus per gamete after the randomization process.
Final matrix that has the list of simulated gametes´ haplotypes in each population as rows and all the markers as columns Chromatid assignment to each future gamete during anaphase II in meiosis II to give place to the gametes. The alleles present in each chromatid were determined previously by random recombination events (in prophase I) and the random alignment and splitting up of homologues (in metaphase I; anaphase I) and sister chromatids (in metaphase II; anaphase II) in meiosis I and II, respectively.
Multilocus haplotype of gametes belonging to a certain population # Af: allelic frequencies reported in the input matrix; *Gs: number of gametes to be simulated per population. Haplotype assemblage: assignment of individual alleles (following Mendel's 1st and 2nd laws using the alleles present in the population) to sets that correspond to each of the different simulated gametes.

B Porta et al.
multiplying an allelic frequency (Af) by the number of gametes to simulate (Gs), (Af.Gs); if the decimal part of the result is equal to or higher than 0.5 alleles, the software will round the value up to the next integer number, and if it is less than 0.5, it will return the integer number without the decimal part. For example, if Af.Gs ≥ 0.5 but < 1.5, the program will simulate 1 allele, and if Af.Gs is ≥ 1.5 and < than 2.5, it will simulate 2 alleles due to the evident fact that it is not possible to simulate decimal parts of an allele or any decimal part of a gamete (and it is also impossible for this to happen in nature). To downsize the errors due to the use of a small sample size (Gs), we strongly recommend simulate a number of gametes that make possible that the lowest allelic frequency in the input matrix then will be represented at least with one allele in the simulation. To that end, the researcher should take into account that the allele with the lowest frequency in the input matrix will be represented in the population's sample of gametes at least once only if the sample size (Gs: number of gametes to be simulated) is as large as the quotient (q) that represents the inverse of the lowest allelic frequency ( lowest Af) in the input (q = 1/ lowest Af). To avoid inaccuracies due to the sample size, a 2q number of gametes can be simulated.

Applications
This program is applicable to analyze genetic diversity in outbreeding species (animal and allogamous plant species) with populations constituted of diploid individuals. It enables the posterior analysis of the genetic structure of populations with a Bayesian approach using the Structure program. The resultant output is the simulation of multilocus gametes genotypes from allelic frequencies of unlinked loci in populations characterized through bulk sampling or other methodologies. This output acts as input for Structure (Pritchard et al. 2000, Falush et al. 2003).

Forms of access
The software was developed using Java ® and can be downloaded from <https://github.com/pfernandezgr/gamete-simulator/>