Optimal use of SSR markers for varietal identification of upland cotton

The objective of this work was to identify polymorphic simple sequence repeat (SSR) markers for varietal identification of cotton and evaluation of the genetic distance among the varieties. Initially, 92 SSR markers were genotyped in 20 Brazilian cotton cultivars. Of this total, 38 loci were polymorphic, two of which were amplified by one primer pair; the mean number of alleles per locus was 2.2. The values of polymorphic information content (PIC) and discrimination power (DP) were, on average, 0.374 and 0.433, respectively. The mean genetic distance was 0.397 (minimum of 0.092 and maximum of 0.641). A panel of 96 varieties originating from different regions of the world was assessed by 21 polymorphic loci derived from 17 selected primer pairs. Among these varieties, the mean genetic distance was 0.387 (minimum of 0 and maximum of 0.786). The dendrograms generated by the unweighted pair group method with arithmetic average (UPGMA) did not reflect the regions of Brazil (20 genotypes) or around the world (96 genotypes), where the varieties or lines were selected. Bootstrap resampling shows that genotype identification is viable with 19 loci. The polymorphic markers evaluated are useful to perform varietal identification in a large panel of cotton varieties and may be applied in studies of the species diversity.


Introduction
The genetic diversity of the genus Gossypium is high among wild and domesticated cotton species.
Wild cotton comprises 45 diploid and 5 tetraploid species (Campbell et al., 2010), of which four have been independently domesticated in four different regions worldwide.Gossypium hirsutum L., one of the tetraploid species, is classified into seven different races: one wild and six domesticated (Lacape et al., 2007).The G. hirsutum race latifolium Hutch., or upland cotton, is economically the most important one and, besides being broadly adapted, is also the main fiber crop (Campbell et al., 2010).However, there is evidence that cotton genetic diversity has been declining in breeding programs (Paterson et al., 2004), which can also lead to a reduction in yield gain through breeding, since diversity is required for selection (Campbell et al., 2010).
Diversity can be increased by using wild genotypes in breeding.However, there are constraints related to crossing barriers (Pereira et al., 2012) and mainly to traits largely different from those required for commercial cotton production.For this reason, cotton breeding has usually been performed from a narrow genetic base.In this context, intraspecific polymorphic markers can assist breeders, by easily displaying relevant genetic diversity among cotton lineages from overlooked germplasm banks, for instance.
Moreover, the identification of cotton varieties is important during breeding and registration processes, and during seed production, trade, and inspection.The identification of intraspecific polymorphic markers for varietal and cultivar discrimination is necessary, considering the narrow cotton genetic base and consequent insufficiency of morphological descriptors (Zhu et al., 2014).The use of molecular markers serves as a modern and suitable approach to varietal and cultivar identification as it is more rapid and cost effective (Korir et al., 2013).In addition, breeding efforts are facilitated by information on the genetic diversity of available germplasm resources, including those from commercial seeds.
Currently, most of the available polymorphic molecular markers for cotton varieties are interspecific, and genetic diversity, as well as molecular mapping in cotton, has frequently been done with more than one species or race (Blenda et al., 2006;Menezes et al., 2008).Despite the high diversity of SSR markers in Gossypium genomes (Lacape et al., 2007), the documentation of markers that are intraspecifically polymorphic in G. hirsutum is still incipient.
The objective of this work was to identify polymorphic simple sequence repeat (SSR) markers for varietal identification of cotton and evaluation of the genetic distance among the species varieties.

Materials and Methods
Two collections of plant material were used.The first one was composed by 20 Brazilian genotypes.Of these, 5 were commercial cultivars, identified by the BRS prefix, and 15 were inbred lines, identified by the prefix of the state in which they were selected: two from Bahia (BA), five from Goiás (GO), five from Mato Grosso (MT), and three from Paraíba identified by (CNPA).This collection represents genotypes of the breeding programs of Embrapa Algodão, developed for the main producing areas in the country.Genomic DNA was isolated from the endosperm of one seed for each genotype, placed in microtubes with sodium dodecyl sulfate extraction buffer (McDonald et al., 1994) and grinded with beadbeater equipment (BioSpec Products, Inc., Bartlesville, OK, USA).
The second plant material, also from the germplasm bank of Embrapa Algodão, was composed by 96 worldwide genotypes: 22 current Brazilian varieties, 51 obsolete varieties (17 from Brazil, 17 from the USA, 7 from Mexico, 2 from Argentina, 1 from Venezuela, 2 from India, 2 from China, 1 from Uzbekistan, and 2 from Africa), and 23 lineages of unknown-origin selected or maintained for having special traits (mainly disease resistance and superior fiber traits).Genomic DNA was extracted from young leaves through the CTAB method (Plant..., 2014).After both extraction batches, DNA was quantified for comparison with well-known amounts of lambda phage DNA (Invitrogen, Carlsbad, CA, USA) in 0.8% (w/v) agarose gels stained by Sybr Gold (Invitrogen, Carlsbad, CA, USA).
Table 1.Information on repeat motifs, MgCl 2 concentration (mmol L -1 ), annealing temperature (AT, ºC), and chromosome location for all simple sequence repeat (SSR) markers used in the intraspecific polymorphism screening of the collection of 20 upland cotton (Gossypium hirsutum race latifolium Hutch.)genotypes (1) . (1)Data were collected from the CottonGen (Yu et al., 2014) and CottonDB (Cotton…, 2011) online databases and from the variety developers.
The amplification reaction for the BNL primers was performed by an initial denaturation at 95°C for 12 min, followed by 30 cycles at 93°C for 1 min, 55°C for 2 min, and 72°C for 3 min, with a final extension at 72°C for 7 min.For the CIR, JESPR, and NAU primers, the initial denaturation was at 94°C for 5 min, followed by 35 cycles at 94°C for 30 s, annealing temperature recommended for each primer pair (Table 1) at 1 min, and 72°C for 1 min, with a final extension at 72°C for 8 min.PCR products were electrophoresed on 6% (w/v) polyacrylamide gels and stained with silver nitrate.SSR data were scored visually, and fragment size estimates were based on their mobility relative to a 50-bp ladder size standard (Invitrogen, Carlsbad, CA, USA).
Two measures of marker informativeness were obtained for each polymorphic SSR loci in the collection of 20 genotypes.Polymorphic information content (PIC) and discrimination power (DP) values were calculated as proposed by Botstein et al. (1980) and Tessier et al. (1999), respectively.Furthermore, individual observed heterozygozity (H i ), i.e. the percentage of heterozygous SSR loci, was calculated for each G. hirsutum cultivar.
For varietal identification, a number of markers was defined by performing 1,000 bootstrap resampling over an increasing number of polymorphic loci using the GenClone software, version 2.0 (Arnaud-Haond & Belkhir, 2007).Afterwards, genotyping was carried out in a panel of 96 genotypes for varietal identification.Amplification was performed according to the recommendations of the respective developers, as described before, multiplexing up to five primer pairs with similar annealing temperature and amplifying fragments of contrasting molecular weights.One primer of each pair was labeled with either 6-FAM, HEX, or NED (Applied Biosystems Inc., Foster City, CA, USA) and was combined to another in batches.Initial denaturation was at 95°C for 15 min, followed by 40 cycles at 94°C for 1 min, annealing temperature at 51 or 55°C for 90 s (depending on the primer combination), and 72°C for 90 s, with a final extension at 72°C for 8 min.The obtained PCR products were run in the ABI 3500xL sequencer and scored using the GeneMapper software (Applied Biosystems Inc., Foster City, CA, USA).
Posteriorly, the GenClone software, version 2.0 (Arnaud-Haond & Belkhir, 2007), was used to reevaluate the 1,000 bootstrap resampling over the number of markers scored in the collection of 96 genotypes.This was done in order to verify the reliability of this number of markers in discriminating the so-called multilocus genotypes (MLG).
For both collections of genotypes, genetic distances, defined as the proportion of shared alleles (Bowcock et al., 1994), were calculated over the means of 1,000 bootstrap resampling in the Microsat software, version 1.5 (Stanford University, Stanford, CA, USA).Cluster analysis was done using the unweighted pair group method with arithmetic average (UPGMA) in the Mega software, version 5.05 (Tamura et al., 2011).

Results and Discussion
Ninety-two primer pairs amplified 93 loci in the collection of 20 genotypes.From this total, 38 loci were polymorphic, totaling 40.8% (Table 2).The polymorphic loci presented a total of 84 alleles, with 2.2 alleles per locus in average, ranging from two to four alleles.The most polymorphic marker was BNL1551, which presented four different alleles.The highest PIC values (≥0.5) were those calculated for the CIR055, CIR165, and CIR249 loci, and the PIC average was 0.374.However, BNL1551, CIR249, and JESPR153b showed the highest DP values (>0.9), and the DP average was 0.433.The PIC values show the degree of marker informativeness within the latifolium race, which was smaller than the ones obtained using different cotton species (Liu et al., 2000;Lacape et al., 2007).However, DP seems to be more useful than PIC to select primer pairs for varietal identification, since it considers the number of evaluated individuals and shows the probability of randomly-selected genotypes being discriminated by each marker.
Exclusive alleles indicate a certain differentiation and can be used as a direct tool for varietal identification, as well as to check genetic purity (Schuster et al., 2006) and hybridization for that specific genotype (Selvakumar et al., 2010).Two loci, BNL786 and JESPR153b, revealed exclusive alleles for the CNPA 5052 inbred line, whereas two other loci, BNL1551 and CIR246, revealed exclusive alleles for the BRS Cedro commercial variety.The commercial variety BRS Seridó, the only one in the collection developed to be planted in Northeast Brazil, showed exclusive alleles revealed by BNL3994 and CIR105.
In addition, JESPR153 amplified two polymorphic loci, each one presenting sets of alleles with different sizes, both identified on previous cotton linkage maps (Ali et al., 2009), as a consequence of cotton allopolyploidization and the presence of homeologous loci from the A and D genomes (Lacape et al., 2009).
The individual observed heterozygosity of the collection of 20 genotypes was relatively high (Table 3), with a mean of 4.8%.The expected rate for advanced lines was H i = 0.0%, observed in five genotypes, and unexpectedly of H i = 16.2% for two genotypes.The individual observed heterozigosity of 15 plants was relatively high, i.e.H i >2% (Table 3), which was not expected since the plants are lineages and, therefore, supposed to be derived from successive self-pollinations.The occurrence of heterozygotes among advanced lines may be explained by a limited number of self-pollination, by variety release of genotypes selected in the first steps of the breeding program (Lacape et al., 2007), or by pollen contamination on seed production fields.Heterozygotes should facilitate individual identification, although they were not supposed be found very often within lineages or cultivars.
The number of markers required for varietal discrimination was 19, estimated by resampling 38 polymorphic loci from the collection of 20 genotypes; when the number of loci tends towards 19, asymptotic behavior is observed (Figure 1A).The resampling strategy allowed testing random combinations of increasing numbers of markers.This generated minimum, average, and maximum number of  discriminated MLG for each class number of sampled loci, ensuring that the chosen set of loci allows for a good estimate of the real number of MLG in the analyzed sample.Because of the arbitrary property of resampling loci, in theory, any subset of the 38 polymorphic loci could be used for exclusive identification of the collection of 20 genotypes.
The number of loci needed for varietal discrimination was confirmed through 17 selected markers evaluated in the collection of 96 cotton genotypes (Figure 1B).Out of these 17 markers, the following 12 were selected based on the polymorphism screening performed in the present study, considering the highest PIC and DP values: BNL2499, BNL3482, CIR030, CIR055, CIR081, CIR099, CIR165, CIR170, CIR249, CIR373, JESPR153, and JESPR292.Moreover, five SSRs evaluated in previous studies were also included in the present study due to shared specific interests: BNL3661, for resistance to the root-knot nematode, Meloidogyne incognita (Gutiérrez et al., 2010); CIR316, for resistance to M. incognita (Ulloa et al., 2010); JESPR101, related to four production or fiber traits (Zhang et al., 2013); JESPR110, related to fiber traits (Wang et al., 2011); and JESPR304, for resistance to Fusarium oxysporum wilt (Wang et al., 2009).Considering that four primer pairs (BNL3661, CIR165, CIR316, and JESPR101) amplified two distinct loci each in the collection of 96 genotypes, 21 loci were produced for varietal identification, i.e., two more than recommended for the collection of 20 genotypes.Bootstrap resampling over the 21 polymorphic loci from the collection of 96 genotypes indicated that these were sufficient to identify each variety (Figure 1B).For 19 loci, a minimum of 94 genotypes could be For the collection of 20 genotypes, the genetic distances measured by the 38 polymorphic loci ranged from 0.092 to 0.641, with an average of 0.397 (Figure 2).The smallest genetic distance was observed between the GO 8022 and CNPA 2571 lines, and the highest between the BA33 line and the BRS Buriti commercial variety.For the collection of 96 genotypes, the average of genetic distance was 0.387, ranging from 0 to 0.786.The most divergent genotype pair was Hopiacala (USA) and Silvermine (unknown origin), whereas the less divergent ones were Roella (unknown origin) and SA 2628 (USA), and Coodetec 403 and Epamig 4 (both from Brazil).The analysis of the calculated genetic distance distributions of all genotype pairs showed a higher frequency of distances at the classes from 0.4 to 0.5 (Figure 2), representing 78 and 64% of the results for the collections of 20 and 96 genotypes, respectively.
Varieties selected in the same region were not placed in separate clusters in the grouping pattern shown by the UPGMA dendrograms (Figures 3 and 4).This may be explained by the genealogy of the lines, since some of them originated from the same crossings in the same breeding program and were then distributed to be evaluated in various regions.Another factor to be considered in this case are the similarities between selection parameters among the regions, as well as the lack of association between those selection parameters and the SSR markers.Since distances were small, all genotypes from both collections could be considered a single group; for the collection of 20 genotypes, lineages could be subgrouped into two clusters based on distance 0.22 -one including 4 genotypes and another, the 16 remaining genotypes (Figure 3).
The recent discovery and advances on the analysis of polymorphism in cotton SSRs (Blenda et al., 2006;Chen & Du, 2006;Kebede et al., 2007;Lacape et al., 2007) led to the choice of high polymorphic markers.Microsatellites may be chosen instead of markers based on random PCR amplification because they are relatively easy to reproduce and their location in the genome can also be determined (Blenda et al., 2006).Furthermore, microsatellites are easy to perform and cost-effective, in comparison to SNP high-throughput technologies, because they are multiallelic (Gupta et al., 2005) and only a small number of markers are required for the analyses.Genomic SSRs are also more recommended than genic-derived SSRs, because gene regions tend to be less polymorphic (Kalia et al., 2011).
The usefulness of the markers is shown by the relatively high polymorphism level obtained (40.8%).Polymorphic primer pairs were distributed along 22 of the 26 amphidiploid cotton chromosomes; these markers were equally distributed on the chromosomes of both  the A and D genomes, with 18 and 20 polymorphic loci, respectively.An equivalent distribution in diversity among the A and D genomes was also observed by Figure 4. Clustering assessment obtained by the unweighted pair group method with arithmetic average (UPGMA), based on the genetic distances among 96 worldwide cotton genotypes.Lacape et al. (2007), and seems to contradict a previous belief that the D genome would be more diverse than the A genome (Adams & Wendel, 2004).
The cultivated cotton race, G. hirsutum race latifolium, is less diverse than the other races (Bertini et al., 2006).Lacape et al. (2007) found a relatively low dissimilarity value of 0.20 within latifolium genotypes, but greater than when measured in other DNA marker studies, including those with RFLP (Brubaker & Wendel, 1994) or SSRs (Rungis et al., 2005;Tyagi et al., 2014).A decrease in the diversity of cultivated genotypes, when compared to the wild ones, is mostly related to the selection for crop domestication, which may be accompanied by a dispersal bottleneck (Van de Wouw et al., 2010).For cotton, an additional bottleneck occurred when just a few genotypes were transported from Mexico to the United States during the 19 th century, from which the genotypes currently cultivated were derived (Paterson et al., 2004).
The intraspecific polymorphism of the markers may be significant for marker-assisted selection, since breeding programs might have to use some form of monitoring of allelic richness.The molecular basis of the cultivated cotton are reduced, but can be amplified by landraces or exotic germplasm introduction (Van de Wouw et al., 2010).The markers selected in the present study may be used to monitor genetic diversity among Brazilian or foreign genotypes and their crosses, as well as to select the most distant parental crosses that could foster genetic variance and, consequently, genetic gains, as shown for cotton by Gutiérrez et al. (2002).
The 19 SSRs chosen by permutation and resampling, combined with loci informativeness measures (Arnaud-Haond & Belkhir, 2007), can be used to discriminate upland cotton genotypes, with or without additional trait-linked markers.Genotypic discrimination should be used in germplasm banks and breeding programs to monitor the germplasm bank, to support breeders in variety protection or in monitoring the genetic variability of the genotypes used for crossings in the breeding program, and to understand reports of increased disease susceptibility in crops, as observed in a wheat variety (Simpfendorfer et al., 2013).Furthermore, the use of distinct multilocus genotypes should ensure variety protection in the world seed market and can be extended to experimental or commercial breeding when maximum genetic distances are required, as in the selection of parents for mapping or for other crossing purposes.
2. The geographical region where a commercial genotype is obtained by breeding does not influence clustering by the unweighted pair group method with arithmetic average (UPGMA) in cotton.

Figure 1 .
Figure 1.Minimum, average, and maximum numbers of distinct multilocus genotypes (MLG) identified by bootstrap resampling of 38 (A) and 21 (B) polymorphic microsatellite loci genotyped in the collections of 20 and 96 upland cotton (Gossypium hirsutum race latifolium Hutch.)genotypes, respectively.Nineteen markers can be successfully used for varietal identification in cotton.

Figure 3 .
Figure 3. Clustering assessment obtained by the unweighted pair group method with arithmetic average (UPGMA), based on the genetic distances among 20 Brazilian cotton genotypes.