Genetic-molecular characterization in guava full-sib progeny

: Brazil is one of the world’s largest producers of guava ( Psidium guajava L.), a very promising fruit in the northern region of the state of Rio de Janeiro. Despite this, no guava cultivar has been developed for the region. Thus, this study proposed to examine a population of guava full sibs using microsatellite markers and to identify which genotypes are the most divergent for future crosses, to select cultivars better adapted to the soil and climatic conditions of northern Rio Janeiro. Ninety-six superior genotypes were selected according to their agronomic traits, which were characterized using 45 microsatellite markers. The genetic distance between the analyzed genotypes, their clustering pattern and the genetic structure of the population were estimated. Hierarchical cluster analysis by the neighbor joining method indicated the formation of three distinct groups. The use of molecular information revealed the existence of moderate genetic variability between the genotypes of the full-sib families. Bayesian analysis separated the genotypes into only two groups, as the individuals shared most of the analyzed genomic regions. The most genetically divergent guava genotypes, that is, those allocated to different groups, such as genotypes 5 and 85, should be recommended for future crosses to obtain segregating populations, thus giving continuity to the guava breeding program.


INTRODUCTION
Guava (Psidium guajava L.) is a fruit tree native to South America. Brazil is among the world's largest producers of guava, having large areas where soil-climatic conditions are favorable to the production of the fruit (Almeida et al. 2009). Guava growing is a promising activity in the northern region of the state of Rio de Janeiro, which, in addition to being close to port facilities, holds the potential to boost the local economy through fruit farming (Gomes Filho et al. 2010).
Despite this, Brazilian producers currently face a problem: the low number of available cultivars adapted to producing regions. Only 18 cultivars are registered in the National Cultivar Registry (Registro Nacional de Cultivares-RNC), and no guava cultivar has been developed or recommended for the state of Rio de Janeiro so far. One of the goals of the breeder is to obtain productive cultivars adapted to their producing region (Ramalho et al. 2010), besides acceptance by the consumer market.
In this respect, guava has high genetic diversity, which is favored by cross-pollination (Silva et al. 2017). Knowing the genetic variability of cultivars is an essential factor for breeding programs, as it allows the optimization of the breeding strategy to be applied .
A very important aid tool for understanding genetic variability used in breeding programs are molecular markers. Microsatellite markers (SSR) are highly polymorphic, providing a large amount of genetic information per locus. They are https://doi.org/10.1590/1678-4499.20210267 PLANT BREEDING Article abundant in the genome, multiallelic and easy to automate, in addition to being affordable (Turchetto-Zolete et al. 2017). SSR are also widely used in different countries as an efficient tool for the characterization of germplasm and in the study of genetic diversity in different Psidium species (Kareem et al. 2018).
Understanding the genetic structure of the population is essential for plant breeding. With a well-structured population, it is possible to select genotypes with desired and complementary characteristics. When the sharing of alleles between individuals is known, we are able to select more divergent genotypes for crosses, in order to generate a greater genetic variability and to look for a heterosis effect, which is useful in breeding programs (Bezerra et al. 2020).
This study aimed to characterize 96 pre-selected guava genotypes through 45 microsatellite loci, to estimate genetic variability in the population, and to identify and indicate the best crosses between genotypes with greater genetic distance.

Evaluated population
The evaluated population was trained in the experimental area at the Antônio Sarlo School of Agriculture, Universidade Estadual do Norte Fluminense Darcy Ribeiro (UENF), located in Campos dos Goytacazes, northern region of Rio de Janeiro, Brazil (21º08'02'' S and 41º40'47'' W, 88 m above sea level). The climate in the area is the Aw type (tropical subhumid and dry), with average annual temperature ranging from 22 to 25°C and average annual precipitation between 1,200 and 1,300 mm. The study was developed at the Plant Breeding Laboratory (Laboratório de Melhoramento Genético Vegetal-LMGV) at UENF. Paiva et al. (2016), who selected the best genotypes in 17 full-sib families of guava using restricted maximum likelihood/ best linear unbiased prediction method (Table 1), indicated the 96 individuals genotyped in this study.

Genomic DNA extraction and polymerase chain reaction
Samples of young leaves of the selected genotypes were collected in the experimental area, forwarded to the LMGV at UENF. The samples were macerated in liquid nitrogen, and genomic DNA was extracted using the procedure proposed by Doyle and Doyle (1990), with adaptations (Supplementary Material).
After extraction, the DNA was quantified by analysis on 1% agarose gel with 1X TAE buffer (Tris, sodium acetate, EDTA, pH 8) using the 100-bp (100 ng) Lambda (λ) marker (100 ng·μL -1 ) (Invitrogen, Carlsbad, CA, United States of America). The DNA samples were stained using a mixture of GelRedTM and Blue Juice (1:1), and the image was captured by the Loccus L-PIX EX gel documentation system. Based on the obtained images, the DNA concentration was estimated relative to the 100-bp marker, and the DNA samples were diluted to a working concentration of 10 ng·μL -1 .
Amplification products were separated on 4% MetaPhor agarose gel, immersed in TAE buffer [90 mM Tris-acetate (pH 8) + 10 mM EDTA], stained with Gel RedTM and Blue Juice (1:1), visualized by the Loccus L-PIX EX gel documentation system and compared with the 100-bp High DNA Mass Ladder (0.5 μL -1 ) marker (Invitrogen, Carlsbad, CA, United States of America) during the runs to determine amplified fragments.

Statistical analysis
The data obtained from the amplification of 45 SSR were converted into numerical code for each allele per locus. The numerical matrix was developed by assigning values from 1 to the maximum number of alleles per locus, as described next: for a locus with three alleles, the representations of 11, 22 and 33 were used for the homozygous forms (A1A1, A2A2 and A3A3); and 12, 13 and 23 for the heterozygotes (A1A2, A1A3 and A2A3). From this numerical matrix, the genetic distance between the studied genotypes was calculated using GENES software (Cruz 2013).
Three indices were tested to calculate the similarity between genotype pairs (Table 3)  Three hierarchical clustering methods were tested (Table 3): unweighted pair-group method with arithmetic mean (UPGMA), which uses arithmetic means of dissimilarity measures, avoiding characterization through extreme values (Cruz et al. 2011); the neighbor joining method, proposed by Saitou and Nei (1987), which groups the closest individuals with data from the distance matrix; and the Ward method, in which the similarity measure used in the cluster is the sum of squares between two clusters (Hair et al. 2009).
The use of different clustering methods for the same goal, without indicating the choice criterion, can make it difficult to compare results, since they are influenced by the method selected for the construction of the clustering matrix (Cerqueira- Silva et al. 2009).
In this study, the selected method was the neighbor joining due to the high cophenetic correlation coefficient (CCC) and the similarity with the results obtained in Bayesian analysis.
The weighted index was chosen because it showed the highest cophenetic correlation, estimated by Eq. 1: (2) = weight associated with locus j determined by a j = total number of alleles at locus j; A = total number of alleles studied, in which (Eq. 3) ! ! " !#$ = 1 (3); and c j = the number of common alleles between the pairs of accessions i and i' .
The index deals with similarity measures, and in cluster analysis, it is recommended to use dissimilarity measures, defined by Eq. 4: After the generation of the distance matrix, the cluster analysis of individuals was performed via dendrogram, by applying the neighbor joining method, using Mega software version 6 (Kumar et al. 2008).
The diversity indices of the 96 genotypes were estimated using Genalex 6.5 software (Peakall and Smouse 2012), based on the following parameters: number of alleles per polymorphic locus (Na), number of effective alleles (Ne), observed heterozygosity (Ho), expected heterozygosity (He), information index or Shannon index (I), and fixation index or inbreeding coefficient (F).
The Ne, that is, those actually found in the population, can be calculated by Eq. 5: in which: P 2 = the sum of the frequency of homozygous and heterozygous alleles.
The information index, known as the Shannon index (I), is used to indicate diversity, which can be calculated by Eq. 6: in which: Pi = allelic frequency for each of the alleles in question.
Ho is the proportion of heterozygous individuals observed in a studied population, calculated by Eq. 7: in which: Ho = observed heterozygosity; Nx = number of heterozygotes; N = total number of individuals in the sample. He can be defined as an estimated sum of all individuals that could be heterozygous for a locus (Eq. 8): in which: He = expected heterozygosity; P i = frequency of allele i. The fixation index (F), which can range from -1 to +1, estimates the mean coefficient of inbreeding, given by Eq. 9: in which: Ho = observed heterozygosity, which is the proportion of N samples that are heterozygous at a given locus; He = proportion of heterozygosity expected under random mating.

Analysis of the genetic structure of the population
To access the structure of the 96 genotypes, analyses were performed using the Bayesian method in Structure software version 2.3.4 (Pritchard et al. 2000).
Considering that the present study was carried out with a population from plants obtained from controlled crosses, we adopted the "no admixture" model, correlated with the allelic frequencies of the population (Cerqueira- Silva et al. 2014, Silva et al. 2016. The burn-in period and iteration number were set to 25,000 and 75,000, respectively, for each run. The number of groups (K) was varied systematically from 1 to 5, and 20 simulations were performed to estimate each K.
The ad hoc ΔK method described by Evanno et al. (2005), implemented in the online tool StructureHarvester (Earl and Vonholdt 2012), was used to estimate the most likely K for the population.
The threshold value of 0.60 was used as the maximum probability of association between the subgroups. Based on the posterior probability of association (q) of a given genotype belonging to a given group relative to the total number of groups (K), we classified individuals with q > 0.60 as members of a given cluster, whereas for clusters with an association (q) with values ≤ 0.60 the genotype was classified as mixed (Cerqueira- Silva et al. 2014).

Diversity parameters via SSR
Genetic variability was detected between the genotypes evaluated for the 45 markers used. The number of alleles per locus ranged between two and three, averaging 2.13, with a total of 96 alleles for all evaluated loci.
In a genetic characterization study of guava accessions from different municipalities in Pakistan for germplasm formation, Kareem et al. (2018) found 85 alleles from 18 SSR primers, with an average of 4.7 alleles per locus-a high number when compared with the one found in this study. Costa and Santos (2013) analyzed genetic variability through 13 SSR in Psidium accessions from the Embrapa Semiárido germplasm bank and found the total of 183 alleles. The number of alleles per locus ranged from seven to 22, averaging 14.07.
Because the full-sib population evaluated in this study originates from previous selections, a low number of alleles is expected, and the high number of alleles observed in other studies is expected when evaluating accessions from germplasm banks.
The Ne ranged from 1.02 to 1.99 (Table 4). Ho values ranged from zero to 0.97, and He values ranged from 0.02 to 0.50 (Table 4). The observed mean heterozygosity (0.24) was lower than the expected mean heterozygosity (0.31), possibly suggesting the presence of null alleles. When a mutation occurs in the primer-binding site, preventing allele amplification, the number of supposed homozygotes in plants heterozygous for the allele increases (Carvalho et al. 2010).
The information index (I) was used to indicate the genetic diversity of the population and ranged from 0.07 (mPgCIR131) to 0.69 (mPgCIR106, mPgCIR149, mPgCIR204 and mPgCIR249), averaging 0.47 (Table 4). This result suggests moderate diversity in the population, and that primers mPgCIR106 and mPgCIR204 were the most efficient in discriminating genotypes with greater genetic diversity. These same primers obtained high Ho values, confirming the index information (Lacerda et al. 2001).
This index, also known as the Shannon index, started to be more commonly used in genetic analysis with the advent of bioinformatics (Sherwin et al. 2006). It varies from 0 to 1, with values closer to 0 denoting lower genetic diversity (Moura et al. 2005).
The fixation index (F), which corresponds to the inbreeding coefficient, was estimated for the entire population and averaged 0.42, ranging from -0.94 to 1 between loci (Table 4). Only 13 loci obtained negative values, which is expected in random mating and indicates excess heterozygosity. While substantial positive values indicate inbreeding or undetected null alleles, the presence of null alleles is a problem in microsatellite data analysis, as they can lead to a false interpretation of results (Souza et al. 2008).
The loci with negative values are the same whose Ho was greater than expected, which indicates that the alleles for these loci are not being fixed by inbreeding. The remaining loci, with positive values, have excessive homozygosity, which may mean failure in allele amplification, since the population evaluated originates from cross-pollination, and negative fixation indices would be expected. However, it is important to emphasize that, because the population originates from previous selections, it is possible that positive values predominate. However, it is important to emphasize that, since the population comes from crosses between relatives and from previous selections, positive values may be predominant.

Dissimilarity by neighbor joining
Genetic dissimilarity was detected between the studied genotypes, and genotype 5 was the most divergent. The total mean dissimilarity was 0.26.
Once obtained the dissimilarity matrix, the evaluated genotypes were clustered into three distinct groups using the neighbor joining method. This method was chosen because it shows greater similarity with the Bayesian analysis than the UPGMA method, with both having a close and high CCC.
The cutoff point in the dendrogram was determined using the criterion of Mojena (1977), with cuts at 73 to 80% dissimilarity and k = 1.25. This is a statistical criterion in which the calculation is based on the relative size of the distance levels in the dendrogram, dispensing with prior knowledge of the conformation of the groups (Faria et al. 2012).
Genotypes 5, 14, 31, 38, 44 and 64 were clustered in the first group, highlighted by the green color ( Fig. 1), which was the most distant. The greatest dissimilarity found (0.83) was between genotypes 5 and 85, which had 15 alleles in common in 31 of the 45 analyzed loci. All genotypes in this group exhibited greater dissimilarity with genotype 85. Group II, in red ( Fig. 1), consisted of genotypes 18, 51 and 91. Like the genotypes in group I, genotype 18 was the most distant from genotype 85, with 0.74 dissimilarity and 22 shared alleles. Genotypes 51 and 91 were the most dissimilar to genotype 5, with dissimilarity values of 0.81 and 0.72, respectively.
Group III, in blue ( Fig. 1), contained the largest number of individuals, with 87 genotypes in total (90.6%). This number of genotypes in the same group indicates that these individuals share the greatest number of alleles for the evaluated loci. Individuals 67 and 68 were the least dissimilar, that is, the closest, with 0.93 similarity and 84 alleles in common. Silva et al. (2021) evaluated this same population for the traits of soluble solids content, fruit weight, pulp weight, number of fruits per plant and yield per plant. The individual with the highest yield per plant was 53. With this information, we may recommend its cross with individual 5, which, in addition to being the most genetically divergent, was also one of the individuals with high yield value per plant. The dissimilarity between them is 0.73, with 24 alleles in common.

Population genetic structure
Bayesian analysis suggested the formation of two groups (Fig. 2). This is a more rigorous analysis that allows observing the population structure in more defined groups, as it is less subjective than hierarchical methods, such as the neighbor joining method. Based on Evanno et al. (2005), the optimal delta K was observed when K = 2, suggesting that maximum structuring was observed when the sample was divided into two well-structured groups (Fig. 3).
A 60% probability of adhesion for belonging to a certain group was adopted. Thus, the evaluated population was separated as follows: group I, in red (Fig. 3), was formed by the majority of genotypes, with 88 in total (91.6%). This group contained most of the genotypes that belonged to the same group (III) in the analysis by the neighbor joining method (Tables 5 and  6 of the Supplementary Material).
Most genotypes belonging to group I have 100% probability of adhesion. However, some genotypes have mixed probability. For example, genotype 20 has 80% probability of adhesion to the red group, but 20% adherence to the green group, indicating alleles shared with the genotypes in this group (Fig. 3, Tables 5 and 6 of the Supplementary material). Eight genotypes (8.33%) were allocated to group II, in green (Fig. 3), namely 5,14,18,31,38,44,64 and 91. Genotypes 5,14,18,31,38,44 and 64 showed near 100% adhesion to the green group. The adhesion of genotype 91, which belonged to the green group, was just above 60%, but with alleles shared with the red group.
This structuring means that group I has a set of alleles that differentiates it from group II for the set of markers used, which are genomic markers that are not related to phenotypic traits.
A low number of groups was formed. This happens when individuals share most of the genomic regions analyzed, which can be explained by the genetic structure of the population. These individuals are related and structured as full-sib families, in addition to having been previously selected based on their superior agronomic traits.
Obtaining two groups is enough to help direct the next crosses between guava genotypes belonging to distinct groups. Thus, the 96 genotypes clearly comprise a defined set of genetic structure.
Considering the different types of analysis, preferential crosses are indicated for genotypes that are in different groups, as they are more genetically distant, which will increase vigor. Additionally, the number of alleles evaluated and shared among them should also be observed, by selecting the most divergent ones.
Furthermore, for selection purposes, these results should be combined with the agronomic data of the evaluated population, by selecting the individuals with the greatest agronomic potential and, thereby, increasing gains and the efficiency of the selection process.

CONCLUSION
The SSR used in this study were efficient in discriminating guava full-sib genotypes. There was variability in the evaluated population, which was structured in three distinct groups by the neighbor joining hierarchical method and in two groups by Bayesian analysis.
The Shannon index indicated that the primers used in the study were efficient to estimate genetic divergence in the population, which was moderate.
It is possible to indicate the most genetically divergent genotypes, that is, those allocated to different groups, to be used in the guava breeding program to obtain segregating populations. Accordingly, crosses between individual 5 and genotypes 85,89,45,51,8,16,19,96, which were the most dissimilar to each other, are recommended.

Adaptations to the Doyle and Doyle Protocol (1990) to Genomic DNA extraction to Psidium guajava L.
Young leaf samples of selected genotypes were collected and macerated in liquid nitrogen, and genomic DNA extraction was performed using the procedure proposed by Doyle and Doyle (1990), with some adaptations. Here's the protocol: 900 μL of extraction buffer containing 2% CTAB, 2 mol L-1 NaCl, 20 mmol L-1 EDTA, and 100 mmol L-1 Tris-HCl (pH 8) were added to the tubes containing the macerated samples, as well as 2% PVP and 2% mercaptoethanol, the latter two necessary for the removal of phenolic compounds. The material was incubated at 65°C for 40 minutes and gently homogenized by inversion every 10 minutes.
After the samples reached room temperature, the tubes were centrifuged for 8 minutes at 14,000 rpm and poured into a new 2-mL tube. 800 μL chloroform:isoamyl alcohol (24:1) was added to carry out the deproteinization. This material underwent gentle inversions for approximately 10 minutes, until it became cloudy.
The organic phase was separated by centrifugation at 14,000 rpm for 8 minutes. The supernatant was transferred to a properly identified 2-mL tube. These steps were repeated three more times, and the added chloroform must be 100 uL more than the volume of the supernatant in each step. Nucleic acids were precipitated by adding two-thirds (500 μL) of the volume of ice-cold isopropanol, and incubated for 30 minutes at -70°C or for 3 hours at -20°C.
The precipitate was sedimented by centrifugation at 14,000 rpm for 10 minutes. The supernatant was discarded, and the precipitate washed twice with 500 μL of chilled 70% ethanol to remove the salt present and twice more with 500 μL of chilled 95% ethanol. After each wash, the material was centrifuged for 5 minutes at 14,000 rpm.
After discarding the last supernatant, the material was taken to the dry bath apparatus, until all ethanol was removed.
Then, the material was resuspended in 100 μL of TE solution (Tris-EDTA -10 mmol L-1 Tris-HCl, 1 mmol L-1 EDTA, pH 8) with RNase at a final concentration of 10 μg mL-1 and incubated in a water bath at 37°C for 40 minutes. The material was then stored at 20 °C until use. Table 6. Separation of genotypes and their respective families in each group suggested by Bayesian analysis.