Frequency and distribution of microsatellites from ESTs of citrus

Nearly 65,000 citrus EST (Expressed Sequence Tags) have been investigated using the CitEST project database. Microsatellites were investigated in the unigene sequences from Citrus spp. and Poncirus trifoliata. From these sequences, approximately 35% of the non-redundant ESTs contained SSRs. The frequencies of different SSR motifs were similar between Citrus spp and trifoliate orange. In general, mononucleotide repeats appeared to be the most abundant SSRs in the CitEST database, but we also identify di-, tri-, tetra-, pentaand hexanucleotide repeats. The AG/CT and AAG/CTT were the most common dinucleotide and trinucleotide motifs, with frequencies of 54.4% and 25.2%, respectively. Primer sequences flanking SSR motifs were successfully designed and synthesized. After in silico polymorphism analysis, a subset of sixty-eight primers was validated in different Citrus spp. and Poncirus trifoliata. PCR-amplification revealed polymorphism in citrus with all tested primer pairs and showed the potential of these markers for linkage mapping. Our study showed that the CitEST database can be exploited for the development of SSR markers that can amplify Citrus spp. and related genus for comparative mapping and other genetic analyses.


Introduction
Microsatellites, or simple sequence repeats (SSR), are arrays of hypervariable short (1-5 bp) repeat motifs that can be found in both coding and non-coding DNA sequence of higher organisms.These single-locus markers are mainly characterized by high frequency, Mendelian inheritance and codominance.During the last decade, microsatellites have proven to be the marker of choice in plant genetics and breeding research, because of their variability, ease of use, accessibility of detection and reproducibility (Zane et al., 2002).There are now many well-known examples of initiatives using microsatellites for different plant species, including Citrus sp.(Kijas et al., 1994;Holton et al., 2002, Kantety et al., 2002, Cristofani et al., 2003).
Microsatellites are Polymerase Chain Reaction (PCR) based, requiring previous sequence identification, primer designing for the conserved flanking regions and amplification of the target repeat.Initially, they were expensive to perform, and library enrichment protocols were widely used to reduce investments.However, new sources of microsatellites have been utilized, which are based on large genome sequencing projects.This was initially limited to species where databanks existed, but the increase in available DNA sequence information, particularly ESTs (expressed sequence tags), has provided new opportunities for development of molecular markers for several annual and perennial plant species.Examples are available for Arabidopsis (Delseny et al., 1997), grape (Scott et al., 2000), cereals (Kantety et al., 2002), eucalyptus (Ceresini et al., 2005) and others.More recently in citrus, microsatellites were investigated and characterized from public EST-database (Chen et al., 2006;Dong et al., 2006).
Microsatellites based on EST libraries (EST-SSRs) are powerful tools for genetic research in genetic variation, gene tagging and evolution, mapping and analysis of quantitative traits (Cato et al., 2001;Scott, 2001;Holton et al., 2002).In addition, microsatellites can also be used across species (Scott et al., 2000).EST-derived microsatellites have been observed to have high conserved flanking sequences among related species.This characteristic can be used to build comparative maps, identify orthologous loci and map genes of known function, such as genes controlling agronomic traits of interest (Kantety et al., 2002;Varshney et al., 2002).Although EST-derived SSRs have been shown to be less polymorphic than those derived from genomic sequences, they have some inherent advantages: quickly obtained by electronic sorting, unbiased in their repeat type, present in gene rich regions of the genome, and normally abundant (Scott, 2001).
Our laboratory, 'Centro APTA Citros Sylvio Moreira,' has developed a large citrus EST (CitEST) database, using libraries that represent different physiological conditions and citrus species.Because of the large number of sequences available from the CitEST project, this database was used to search for hypervariable motifs, such as microsatellites.Although most citrus types exhibit clear morphological variation among them, mainly within genus and species, many agronomical traits are difficult to select by conventional techniques, making assistance by molecular markers highly desirable.A potential reason for that is because most of the wanted traits are apparently quantitatively inherited (Cristofani et al., 2003;Novelli et al., 2006).Therefore, in this study, we mined the CitEST database searching for microsatellites, using an in silico approach for marker development, and an in vivo validation of candidate polymorphic markers.We developed a bioinformatic tool, named MarkerXplorer, which uses several publicly available software programs to retrieve and characterize microsatellite loci from the database.Different citrus genotypes, including a zygotic progeny from Rangpur lime (Citrus limonia Osbeck) vs. Swingle citrumelo (Citrus paradisi Macf.x P. trifoliata), were used to validate a set of markers developed in this study.

The CitEST database
The `Centro APTA Citros Sylvio Moreira' has a project to create and maintain a databank based on ESTs (CitEST) from different physiological conditions and also from different genera and species of citrus.The total number of sequences available is an constantly changing, but up to the date of this study,

Sorting sequences through a pipeline
MarkerXplorer pipeline uses several scripts and executable programs such as MISA (MIcroSAtellite identification tool) - (Thiel et al., 2003), Primer3 (Rozen and Ska-letsky, 2000), and e-PCR (Schuler, 1997), and Perl scripts to identify repeated sequences, to design specific primer pairs and to evaluate, a priori, potential polymorphic markers.To avoid redundancy on further analysis, a clustering analysis with CAP3 software (Huang and Madan, 1999) was previously performed, and the MarkerXplorer pipeline was then used to run from a multi-FASTA formatted file containing 64,726 assembled sequences representing 54,492 kb.

Microsatellite search
The adjusted parameters for our pipeline looked for mononucleotides motifs larger than 10 repeat units and all other repeats (di-, tri-, tetra-, penta-and hexanucleotides) larger than five repeat units.To identify the relative position of a SSR within a given sequence we used the strategy adapted by Ceresini et al. (2005) that categorized the microsatellites as initial (I, close to the 5' end), middle (M) or end (E, close to the 3' end).Mononucleotides were excluded from this analysis.

Primer design
One of MarkerXplorer output files was a tab-delimited file with all primer pairs flanking sequences nearby the microsatellite.The length of the amplicons was set to 100-350 bp.Oligonucleotide parameters for Primer3 were set to a length of 18-27 bp with an optimum of 20 bp, a GC content of 20%-80% with an optimum of 50%, a melting temperature (Tm) of 57-63 °C with an optimum of 60 °C, and a primer Tm maximum difference of 1 °C.From a list of nearly 2,000 potential EST-SSR markers, 68 primer pairs were selected randomly and synthesized by IDT (Coralville, IA, USA) and tested.These oligonucleotides were resuspended to 100 μM and tested on all different genotypes described below.

Functional characterization
Functional annotation of Citrus markers was obtained from GenBank using blastX algorithm against nr database (Altschul et al., 1997) and further classified by gene ontology (Ashburner et al., 2000).GO Terms were extracted from the best homologous hit.The AmiGO term browser was used to find molecular function, cellular component and biological process ontology for these sequences.

Frequency and distribution of EST-SSRs in the CitEST database
Using the MarkerXplorer pipeline, we obtained a detailed analysis of the frequency and distribution of all mono-, di-, tri-, tetra-, penta-and hexanucleotides repeats from the six Citrus and one Poncirus species.A set of 64,726 clustered sequences, with average length ranging from 586-bp (CL) to 891-bp (CS), were screened and 21,584 sequences (33.3%) containing 27,656 non-redundant SSRs were identified (Table 1).Considering that approximately 54.5 Mb were analyzed, we detected a frequency of at least 1 SSR per 1.97 kb in the expressed fraction of citrus genome.From the number of SSR-containing sequences we observed an average frequency of 22.3% of sequences containing more than 1 SSR, with C. sinensis and P. trifoliata showing the highest frequencies, 24.0 and 23.4%, respectively (Table 1).
Of those EST-SSRs identified by MarkerXplorer pipeline, 3,030 (10.9%) were represented by compound microsatellites.Despite the differences in the total number of examined sequences among the seven species, frequencies of compound microsatellites were very similar, ranging from 9.7% in C. aurantium to 11.6% in C. sinensis (Table 1).
The most frequent microsatellite types were mononucleotide repeats in all seven studied species.Among the other repeats, di-and trinucleotides showed the highest frequencies for all species, ranging from 4.9% in C. limonia to 22.2% in C. sinensis and 2.2% in C. limonia to 11.4% in C. sinensis, respectively.Tetra-, penta-and hexanucleotide repeats were represented in proportions of 0.3 to 0.9%, 0.05 to 0.2% and 0.15 to 0.3%, respectively (Table 2).
To obtain a more detailed analysis of the SSR structure, we used the categorization proposed by Weber (1990) in three classes: pure or perfect repeats, e.g., (AT) n or (CTG) n ; imperfect repeats, e.g., (TG) n (N) x (TG) m or (GGC) n (N) x (GGC) m , and compound repeats, e.g., (GT) n (AT) m , (ATC) n (GCG) m or (CG) n (AAT) m .For the seven citrus species, we observed a proportion of 89.1% of perfect, 7.2% of compound and 3.7% of imperfect repeats.When considering only perfect repeats, we observed variations in the number of repeat units per microsatellite type and species.The maximal length of the six SSR types ranged from 77 (C.latifolia) to 148 (P.trifoliata) units for mononucleotides; 20 (C.limonia) to 34 (C.sinensis) units for dinucleotides; 11 (C.aurantium) to 28 (C.sinensis) units for trinucleotides; six (C.limonia) to (11) units for tetranucleotides; five (C.reticulata) to nine (C.sinensis) units for pentanucleotides; and six (C.aurantium, C. limonia and P. trifoliata) to 13 (C.reticulata) units for hexanucleotides.
We also examined and labeled the relative position of di-, tri-, tetra-, penta and hexanucleotides repeats identified by MarkerXplorer from the seven species.From this analysis we identified that about 58.0% of all microsatellites are localized closer to the 5' end (I), while 23.0% are positioned near the 3' end (E) and 19.0% are distributed along the middle region (M).Among the seven citrus species screened here, C. limonia and C. reticulata showed the most discrepant values, with 39.7 and 62.8% of the microsatellites categorized as I, 28.7 and 17.6% as M and, 36.6 and 19.6% as E, respectively.
According to their potential as genetic markers, Temnykh et al. (2001) classified microsatellites of different sizes and types into two categories or groups: those larger than 20 repeated units (Class I); and those equal to or bigger than 12 but smaller than 20 repeated units (Class II).In this study, 31.0% of the EST-SSRs identified were classified as Class I while 69.0% were Class II.The sweet orange 1012 Palmieri et al.   showed the highest number of di-and trinucleotides in both classes (Table 3).

Marker development and in silico polymorphism detection
A list of 1,918 primer pairs were obtained from the MarkerXplorer pipeline.From these, 758 showed in silico polymorphism ranging from 2 to 10-bp when analyzing sequences deposited into the CitEST database.
To validate microsatellite markers obtained from CitEST database using the MarkerXplorer pipeline, we synthesized 68 primer pairs which were used successfully to amplify PCR products of the expected size in accessions of Citrus spp.and the related genus Poncirus.Polymorphism was revealed by all primer pairs in the citrus genotypes tested.To analyze the potential of these markers for mapping studies, all functional loci were used in a Rangpur Lime vs. Citrumelo Swingle progeny.Twenty-two of these primer pairs (32.0%) were able to reveal polymorphisms with allelic segregation for mapping (Figure 2).

Functional annotation of the EST sequences containing SSRs
Blast searches against the NCBI database were performed for each of the 68 clones that had primer pairs developed.Forty-three sequence (63.2%)matches were identified with several known proteins, while 19 (27.9%) had homology with expressed, hypothetical and unknown proteins from Arabidopsis thaliana, Cicer arietinum and Oryza sativa.Of the remaining sequences, six (8.8%) produced no hits with any known protein (Table 4).
The gene ontology categorization of the 62 sequences that showed some degree of homology revealed that 51 (~82%) of them had a protein match.From these, 40 (82.3%) were homologous to proteins with molecular functions, mainly binding (DNA, RNA and ion), catalytic (protein kinase, lyase and hydrolase) and transporter activities (Figure 3).Forty-three (84.3%) were homologous to proteins involved with the cellular component, mainly nucleus, organelles (chloroplast and mitochondrion) and membrane (Figure 3).Finally, 42 (82.3%)were classified   as involved in biological processes such as cellular metabolism (nucleotide-excision repair, protein biosynthesis, protein amino acid phosphorylation and proteolysis), transcription (mainly regulation), transport (sucrose and carbohydrate), response to stimulus (oxidative stress, water and desiccation) and cell organization and biogenesis (Figure 3).

Discussion
The CitEST project has generated a large set of EST sequences from citrus, an excellent resource for rapid discovery of SSRs.Our study clearly illustrates that ESTs are a useful source of new SSR markers for citrus that are polymorphic and can be transferred between species and related genus.A number of reports have demonstrated that EST databases are a very good source of polymorphic markers for many organisms, including plants (Delseny et al., 1997;Scott et al., 2000;Kantety et al., 2002;Ceresini et al., 2005).
The MarkerXplorer pipeline used here was able to screen the CitEST data for all SSR motifs as well as to identify, in silico, a large number of primer pairs with a high potential for germplasm characterization and genetic mapping studies.Bioinformatics approaches are increasingly being used for molecular marker development since the se-quences from many genomes are made freely available in public databases (Kantety et al., 2002;Varshney et al., 2002).These sources are mined for SSRs using computational tools thereby eliminating the need for costly, laborious and time-consuming marker development.
Our results show that SSRs in the CitEST are highly abundant (33.3%) when compared with others crops (Scott et al., 2000;Kantety et al., 2002).In other citrus databases, a frequency of 10.6% of EST sequences with at least one SSR was observed (Chen et al., 2006) and a total of 21.7% in citrus unigene analysis (Dong et al., 2006).Rangpur Lime was the species with the highest number of SSR motifs identified (64.7%).This result should facilitate its use in mapping experiments geared toward molecular breeding.Sweet orange had highest number of sequences analyzed, but this fact was not reflected in a larger number of microsatellites detected.
The dinucleotides AG/CT were the most abundant microsatellites in EST sequences, which is consistent with previous surveys of SSR repeats in annual species, as described by Kantety et al., (2002) in comparative analysis using publicly available EST databases for barley, maize, rice, sorghum and wheat.In fact, this motif was also observed in perennial crops, such as eucalyptus (Ceresini et al., 2005), apple (Newcomb et al., 2006), strawberry (Folta et al., 2005) and citrus (Chen et al., 2006;Dong et al., 2006;Novelli et al., 2006).In general, the adenine-rich repeat motifs are most common in SSRs and, in the CitEST database, these motifs (AAG, AAT, AAAT, AAAG, AAAAT) were also the most abundant (Figure 1).Similar results were reached in others analysis of EST-SSRs in citrus (Chen et al., 2006;Dong et al., 2006) and apple (Newcomb et al., 2006).Additionally, trinucleotide repeats were representative among the different classes and showed the highest level of polymorphism in citrus species (data not show).
The origin and functional role of the microsatellites in expressed sequences are not well understood, but they presumably originate from single or multiple mutational events (Zane et al., 2002).Perfect SSRs were the most frequent type in our study, followed by compound and imperfect types.This suggests a certain degree of sequence stability and conservation in their recent evolution of citrus species.The exploration of perfect SSR in the CitEST database may be a valuable tool to study the evolution of proteins in citrus, since this SSR type is common in many proteins (Katti et al., 2000).Additionally, microsatellites loci with a high number of perfect repeats are usually more polymorphic (Weber, 1990).
The position where a SSR motif occurs in the EST sequence (middle region, 5' and 3' end) can influence the level of polymorphism of a marker (Scott et al., 2000).In our analysis, the 5' end region (I) showed the highest number of repeats, but the level of polymorphism among the species was not estimated in this study.High-quality amplification products were generated in all EST-SSRs evaluated in citrus species, many of which were polymorphic for at least one species, showing that EST sequences are a useful source of markers for Citrus and correlated genera and could therefore be related to agronomic important characters in genetic mapping.
Specific primer pair design from EST sequences on the CitEST database was more efficient when compared to primer design from the genomic libraries previously developed by our group (Novelli et al., 2000(Novelli et al., , 2006)).However, we did not test the hypothesis that microsatellites derived from ESTs are less polymorphic than those derived from genomic libraries as demonstrated by several reports (Scott et al., 2000;Thiel et al., 2003).
The screening of EST-SSRs for polymorphism assessment using an F 1 progeny of the Rangpur Lime vs. Swingle Citrumelo cross revealed that 22 of the 68 amplified fragments revealed informative segregation configurations (Figure 2).The markers could be mapped in Rangpur Lime and Swingle Citrumelo according to the pseudotestcross strategy.This is an encouraging result considering that the number of informative SSR markers is still limited in citrus due to several characteristics related to the biology of these species (Chen et al., 2006;Novelli et al., 2006).
Despite the fact 82% of EST-SSRs markers presented herein showed homology with known gene products, no specific pattern of association to a specific component, process or function was detected.As observed in other works, the majority of transcripts detected represent enzymes of general metabolism (Ceresini et al., 2005;Folta et al., 2005, Newcomb et al., 2006).However, those transcripts related to biological processes such as response to biotic and abiotic stresses can now be readily mapped using existing populations.
In conclusion, our analysis revealed that our CitEST database is a valuable source for rapidly developing new SSR markers, highly transferable for diversity and mapping studies in citrus species and related genera.These data also could be applied in citrus breeding programs such as germplasm characterization, screening of zygotic and nucellar seedlings, and developing markers for marker-assisted selection.

Figure 1 -
Figure 1 -Frequency of the most common SSR motifs in CitEST database.

Figure 3 -
Figure 3 -Citrus EST-SSRs characterization as derived from Gene Ontology categories.

Table 1 -
Abundance of EST-SSRs in the seven citrus species from CitEST database.Species

Table 2 -
Frequency distribution (%) of EST-SSRs based on motif size for each species.Numbers within parentheses represent absolute number of microsatellites.

Table 3 -
Temnykh et al. (2001)n (%) of the different SSR types (perfect repeats only) according to their potential as genetic markers.Definitions of these classes followed the pattern reported byTemnykh et al. (2001).

Table 4 -
Description of 68 ESTs-SSRs from the CitEST database.