Identification and validation of novel EST-SSR markers in olives

The olive (Olea europaea L.) is a leading oil crop in the Mediterranean area. Limited information on the inheritance of agronomic significant traits hinders progress in olive breeding programs, which encourages the development of markers linked to the traits. In this study, we report on the development of 46 olive simple sequence repeat (SSR) markers, obtained from 577,025 expressed sequence tags (ESTs) in developing olive fruits generated in the framework of the Slovenian national olive transcriptome project. Sequences were de novo assembled into 98,924 unigenes, which were then used as a source for microsatellites searching. We identified 923 unigenes that contained 984 SSRs among which dinucleotide SSRs (36 %) were the most abundant, followed by tri(33 %) and hexa(21 %) nucleotides. Microsatellite repeat motif GA (37 %) was the most common among dinucleotides, while microsatellite repeat motif GAA was the most abundant trinucleotide SSR motif (16 %). Gene ontology annotations could be assigned to 27 % of the unigenes. A hundred and ten expressed sequence tag-derived-simple sequence repeats (EST-SSRs) with annotated genes were selected for primer designing and finally, 46 (42 %) polymorphic EST-SSRs were successfully amplified and used to validate genetic diversity among 24 olive varieties. The average number of alleles per locus, observed heterozygosity, expected heterozygosity, and polymorphic information content were 4.5, 0.649, 0.604 and 0.539, respectively. Twenty-seven EST-SSRs showed good diversity properties and were recommended for further olive genome investigation.


Introduction
Olive (Olea europaea L.) production is the most important agricultural branch in the Mediterranean basin, and olive oil is the main source of fats in the wealthy Mediterranean diet.In addition to high levels of monounsaturated fatty acids, it contains biologically active molecules, including biophenols, squalene, tocopherols, and phytosterols, which have many positive effects on human health (Ali Hashmi et al., 2015;Ghanbari et al., 2012) and are the major contributors to the unique taste of the oil (Cicerale et al., 2009).Consequently, the investigation of genes responsible for the synthesis of these molecules is essential to improve olive oil quality and develop new varieties.
Breeders and the olive industry are currently focused on identifying high-performance genotypes, while the increasing need for diversity of varieties that are well-adopted to changes in the environment, cultivation conditions and consumer requirements dictate the development of new olive varieties (Lavee et al., 2014).For these reasons, the use of new technologies, including new molecular markers, are essential for breeding success and validation of authenticity and traceability of primary products entering the agro-food chain (Corrado, 2016;Pasqualone et al., 2016).
With the improvement and development of next generation sequencing technology, expressed sequence tags (ESTs) are publicly available and represent very useful tool for gene and marker discovery, which are attractive for gene mapping, functional studies, genome annotation and comparative genomics (Ozgenturk et al., 2010;Rudd, 2003).Among them, genic microsatellites, or expressed sequence tag-derived simple sequence repeats (EST-SSRs), have found a special place in plant genetics and breeding of several agricultural plants (Kalia et al., 2011;Varshney et al., 2005).However, a few transcriptome projects on the generation of ESTs in olive have been completed recently (Alagna et al., 2009;Muñoz-Mérida et al., 2013;Ozgenturk et al., 2010;Rešetič et al., 2013), and a limited number of EST-SSRs are currently available in olives (Adawy et al., 2015;De la Rosa et al., 2013;Essalouh et al., 2014).The main purpose of this study was to increase the number of validated EST-SSRs, for the benefit of all interested research groups, via the identification of simple sequence repeats (SSRs) from the transcripts of developing olive fruits of the variety "Istrska belica".The validation and putative functional annotation of a new set of EST-SSRs and their applicability in olive diversity study are reported in this paper.

Identification of EST-SSRs and primer designing
A total of 577,025 ESTs from developing fruits of the "Istrska belica" olive variety, with an average length of 241 bp, were generated using the 454 pyrosequencing methodology (Rešetič et al., 2013) and were assembled by the iAssembler v.1.2.2 (Zheng et al., 2011) into 98,924 unigenes.All cleaned EST sequences (removed taxonomic nonspecific reads and reads shorter than 200 bp) have been deposited at the SRA database (http://www.ncbi.nlm.nih.gov/sra/SRX215662), and the Transcriptome Sci.Agric.v.74, n.3, p.215-225, May/June 2017 Shotgun Assembly project has been deposited at Databank of Japan/ European Molecular Biology Laboratory (EMBL) Data Library/GenBank under the accession number GDUL00000000 (http://www.ncbi.nlm.nih.gov/nuccore/GDUL00000000). In the current study, all 98,924 assembled sequences were used as a source for microsatellite identification.The mining of microsatellites was performed using the Perl script MIcroSAtelitte (MISA) (http://pgrc.ipk-gatersleben.de/misa).The main criterion for SSR identification was the minimum length, that is, eight repeat units for dinucleotides motif, six repeat units for trinucleotide and tetranucleotide, and four repeat units for pentanuclotide and other higherorder repeats.Sequences containing SSRs longer than 20 nucleotides were first reviewed with the use of Tablet (a next generation sequence assembly viewer) (Milne et al., 2010) in order to exclude all SSR-including sequences that were inappropriate for primer designing according to the following criteria: (a) very short DNA sequence flanking the microsatellite (less than 30 bases) or (b) microsatellite sequence repeat was used by assembler as an overlapping part for the adjacent reads, therefore, there is a probability that this contig is a chimeric one.Sequences were then aligned against National Center for Biotechnology Information (NCBI) non-redundant (nr) protein database using the BLASTX algorithm to determine the putative function (E < 1e-10).Only SSR-containing transcripts with annotated genes were identified as candidates for SSR marker development.Primer3 v.4.0.0 tool (Koressaar and Remm, 2007;Untergrasser et al., 2012) with default parameters was used for designation of 110 primer pairs.A single criterion for primer designation was the length of the microsatellite sequence (150 -200 bp).A shorter primer in the pair was elongated for the M13 (-21) 18 bp sequence at the 5' end (5'-TGTAAAACGACGGCCAGT-3') for economic fluorescent labelling (Schuelke, 2000).Integrated DNA Technologies (IDT) synthesized all primers.GenBank Accession numbers, locus names, primer sequences, repeat motifs, SSR locations, annealing temperatures, size ranges and putative functions are listed in Table 1.
All 110 developed primer pairs were used for initial screening in the genotyping procedure of eight olive varieties.The optimal annealing temperature (T a ) was experimentally determined for each locus where five different T a were tested.The highest annealing temperatures (T a1 ) in touchdown polymerase chain reaction (PCR) were set at 60 ºC, 58 ºC, 55 ºC, 53 ºC and 50 ºC.Loci with unstable PCR amplification were additionally optimized by raising the DNA concentration, and by increasing the number of cycles in the second step of the amplification.Finally, the amplification of SSRs was performed in a total volume of 15 μL, containing 1 × supplied PCR buffer, 2 mM MgCl 2 , 0.2 mM of each deoxynucleotide (dNTP), 0.2 μM of each locus specific primer with one of the primers in pair that was elongated for M13(-21) universal sequence (Schuelke, 2000), 0.25 μM of M13(-21) universal primer labelled with 6-FAM, VIC, PET or NED, 0.375 unit of Taq DNA polymerase (Thermo Fischer Scientific, Waltham, USA) and 40 ng of olive DNA.The amplification was performed in a thermal cycler, and the conditions of the two-step PCR were as follows: 94 ºC for 5 min, then 5 cycles at 94 ºC for 45 s, 30 s at T a1 (Table 1), which was lowered by 1 ºC in each cycle, and the extension at 72 ºC for 1 min 30 s.The second step of amplification passed through 35 cycles of 30 s at 94 ºC, 30 s at the lowest annealing temperature (T a2 ) (Table 1), 1 min 30 s at 72 ºC, and a final extension at 72 ºC for 10 min.Separation of amplified SSRs was performed in a 3130 Genetic Analyzer, using 500 LIZ size standard.Data were analyzed with Gene Mapper v.4.1 software.
Genetic parameters were calculated for 24 olive varieties over 46 EST-SSR loci that produced clear fragments after amplification.The observed (H o ) and expected (H e ) heterozygosity, polymorphic information content (PIC), and the frequency of null alleles (F(null)) were calculated using the CERVUS v.3.0.7 computer software.IDENTITY v.1.0software was used to calculate probability of identity (PI).The effective number of alleles (n e ), number of observed and all possible genotypes and deviations from the Hardy-Weinberg equilibrium (HWE) were calculated using the POPGENE v.1.32software.The algorithm by Levene (1949) was used for calculation expected genotypic frequencies under random mating, and chi-square (χ 2 ) tests were performed for HWE at each locus.Variety-specific alleles were calculated by using MICROSAT software.The AMaCAID script (Caroli et al., 2011) was used to elucidate the minimum number of markers required to distinguish all observed genotypes, including the Model3, and the fixed number of combinations was set at 50,000.
Genetic distances between 24 varieties were calculated using Jaccard's coefficient of similarity.A dendrogram was constructed from the matrix of similarity, using the unweighted pair-group method (UPGMA).For the dendrogram, the correlation coefficient between the distance matrix and the cophenetic values matrix was computed to test the goodness of fit of the cluster analysis using the MXCOMP module of the Mantel test (Mantel, 1967).All calculations were performed using the NTSYS v.2.02 software.cellular component.In addition, Blast2GO was used to assign Kyoto Encyclopedia of Genes and Genomes (KEGG) maps and an enzyme classification number (EC) (Kanehisa and Goto, 2000).For developed EST-SSRs, the position of the SSR motif in the gene was performed, that is, the SSR was located in coding sequence (CDS), 3' untranslated region (3' UTR) or 5' untranslated region (5' UTRs).

Characterization of EST-SSRs and primer designing
A total of 98,924 ESTs (36.8 Mb) from the olive transcriptome assembly (Rešetič et al., 2013) were examined with the MISA tool for microsatellite identification.Of 98,924 ESTs, 923 sequences contained 984 microsatellites.On average, one microsatellite was found in every 37.4 kb of olive ESTs.Of 923 SSR-containing ESTs, 874 (95 %) ESTs contained only one SSR locus, while 49 (5 %) contained more than one SSR locus.Furthermore, 4 % of SSR loci (34 of all identified SSRs) were present in the compound formation.
Of 984 EST-SSRs, 551 EST-SSRs possessing microsatellite sequences longer than 20 bp were further reviewed by the Tablet program.A total of 343 EST-SSRs had flanking regions longer than 30 bases and they were not located in the overlapping site within EST contig.Among these, only 197 EST-SSRs containing di-, tri-, tetra-nucleotides and compound SSRs were selected and blasted against the NCBI nr protein database.A total of 119 successfully annotated EST-SSRs were subjected to primer designing.Finally, 110 EST-SSR primer pairs were designed for further amplification.

Polymorphism and genetic diversity analysis
All 46 developed EST-SSRs were used to assess the polymorphism and genetic diversity of 24 olive varieties.All loci were successfully amplified and 205 different alleles were detected.The number of amplified alleles at each locus varied from two  to eight .The average number of alleles per locus was 4.5 and the average number of effective alleles was 3.13.H e ranged between 0.042 (OeUP-12) and 0.869 (OeUP-22), with an average of 0.604.The highest H o (1.000) was found at locus OeUP-22, and the lowest (0.042) was observed at two loci (OeUP-12 and OeUP-43).PI value varied among loci in a range from 0.076 (OeUP-22) to 0.922 (OeUP-12), while the common PI value calculated for all loci was 2.10 × 10 −24 .PIC values were in a range from 0.040 (OeUP-12) and 0.833 (OeUP-22), with a mean value 0.539.Based on the calculated PIC, 30 newly developed EST-SSRs were classified as informative markers (PIC > 0.5) and nine as suitable markers for gene mapping (PIC > 0.7) (Table 2).
Of 46 EST-SSR loci, 11 loci showed deviation from the HWE (Table 2).Most deviations can be assigned to the excess of one class and, more rarely, to two classes of homozygotes in the analyzed loci as well as to an excess of one combination of identical alleles observed in a larger set of olive varieties.In nine loci, H o was lower than H e , and in six loci, the estimated frequencies of null alleles were higher than 0.2, indicating the upper boundary under which the microsatellite null alleles are uncommon to rare (Dakin and Avise, 2004).
Among 205 alleles detected, 29 were specific to different olive varieties.The allelic polymorphisms allowed discriminating all analyzed varieties.The AMaCAID computer program was used to calculate the minimum number of markers required to distinguish all observed genotypes.All 24 olive genotypes could be differentiated by only two loci (OeUP-04 and OeUP-14).
The Jaccard's similarity coefficient was used to calculate the genetic distances among pairwise combinations in a set of 24 olive varieties from the Slovenian olive collection.The highest genetic similarity value (0.72) was observed among the varieties "Leccino" and "Zelenjak".The average similarity coefficient was relatively low (0.39).A dendrogram (Figure 1) was constructed from genetic similarity data and clusters were tested for associations.The cophenetic coefficient was relatively high (0.84) and indicated a good fit of the original data to the clustering.
Olive varieties clustered into related groups in the microsatellite dendrogram.The first cluster, which showed higher genetic similarities with respect to the other groups, contained all Tuscan varieties for oil use ("Leccino", "Leccio del corno", "Pendolino", "Leccione, "Maurino", "Frantio").This cluster also contained the Italian variety "Coratina" as well as "Athena' and "Zelenjak", which were probably brought from central Italy to the northern Adriatic coast as shown in a previous study (Bandelj et al., 2004)."Črnica" and "Štorta" were also close to the Tuscan group, which are known as Slovenian varieties for oil production and table use, re- spectively."Ascolana tenera", "Grignan", "Buga", "Samo" and "Oblica" were well-defined groups.These varieties shared more than 60 % of alleles and two of them ("Samo" and "Buga") represent the local Slovenian germ-plasm.The varieties "Arbequina", "Cipressino", "Istrska belica" and "Moraiolo" were placed on the dendrogram at lower similarity values.For these varieties, two or more unique alleles were found.

Functional annotation of SSR-containing transcripts
In order to assign the putative function of all 923 ESTs containing SSRs, a functional annotation was carried out using the Blast2GO software.A total of 247 ESTs (27 %) were detected as having homology with known proteins, 138 (15 %) were homologous to expressed, hypothetical/unknown/unnamed proteins, while 538 (58 %) SSR-containing sequences showed no significant (E < 1e-10) hits in the BLASTX analysis.During the BLASTing step, ESTs (109) were aligned to Sesamum indicum L., while only six sequences were in correspondence to the Olea europaea L. database.
The following gene ontology categories, molecular function, biological process, and cellular component, were then assigned to 247 ESTs having homology with known proteins (Figure 2).A total of 265 terms were allocated to the molecular function, 671 under biological process, and 233 under the cellular component.The most abundant ESTs were involved in metabolic (22 %), cellular (21 %) and single-organism processes (18 %) under the biological process category.According to the molecular function category, most sequences showed functions related to binding (45 %), catalytic activity (39 %) and transporter activity (6 %).The classification of the sequences within the binding group of the molecular function category showed that most ESTs fell under ion binding (34 %) and Adenosine triphosphate (ATP) binding (20 %).For the cellular component category, 54 % of the sequences were assigned to term intracellular followed by an intrinsic component of the membrane (14 %) and cell periphery (9 %).A part of newly developed and annotated EST-SSRs was related to a lipid biosynthetic process (OeUP-16), cellular response to Sci.Agric.v.74, n.3, p.215-225, May/June 2017 oxidative stress  and osmotic stress (OeUP-25), with cell transport (OeUP-01, OeUP-07, OeUP-30, OeUP-42) and with embryo development .
In addition, KEGG pathway visualization and EC annotations were also done for 247 annotated ESTs.In total, 55 pathways, including numerous cellular metabolic and biosynthesis pathways, were fully represented.Since olive is an important oil crop, the research was focused on ESTs with particular relevance in fruit metabolism.Most ESTs encoded the following enzymes: transferases (37 %), hydrolases (33 %) and oxidoreductases (19 %).Specifically, ESTs encoded enzymes for biosynthesis of secondary metabolites and lipids, including fatty acid and steroid biosynthesis, as well as sphingolipid metabolism.

Discussion
The improvement of genetic resources of agricultural plants through molecular breeding programs requires efficient molecular markers in combination with linkage maps and genomics (Jiang, 2013).Recently, a vast number of EST datasets have been generated for many crop plants, which have offered an opportunity to identify and develop numerous functional molecular markers linked to genes or traits of interest.Although the olive tree is one of the most important oil crops in the Mediterranean, only a few tested EST-SSRs are available (Adawy et al., 2015;De la Rosa et al., 2013;Essalouh et al., 2014).This fact encouraged the Slovenian team to focus their research on the development of new olive EST-SSRs and provide high quality and informative markers to the research community.
For the development of EST-SSRs, the olive transcriptome from developing fruits of variety "Istrska belica" was used.Transcripts were generated from the predominant local variety in Slovenia, which has special organoleptic properties due to high biophenol content and is also known as one of the oiliest varieties in the region.For SSRs identification, 98,924 ESTs were examined.Microsatellites were found in approximately 1 % of olive ESTs.Considerably higher frequencies were re-ported for other fruit species, including 20 % in citrus fruits (Liu et al., 2013), 18 % in peaches (Vendramin et al., 2007) and 11 % in pomegranates (Jian et al., 2012).These differences in frequency and distribution of EST-SSRs can be attributed to the different criteria used to identify SSRs in the database mining, dataset size and database-mining tools (Varshney et al., 2005).Furthermore, variations in the SSR frequency distribution are taxon-specific (Toth et al., 2000) and may reflect differences in the selection and domestication processes (Zhang et al., 2013).
In this study, dinucleotide SSRs were the most abundant (36 %), which is in accordance with results reported for sesame (Zhang et al., 2012) and some Rosaceae species (Jung et al., 2005).However, in most other fruit species and crop plants, trinucleotide repeats have been observed at the highest frequency (Jian et al., 2012;Liu et al., 2013).In olive transcriptome data, the most abundant dinucleotide SSR motif was GA/TC, while the motif GAA/TTC was the most common among trinucleotide SSRs, which support findings of Adawy et al. (2015).Furthermore, similar results have also been obtained in studies on citrus fruits (Chen et al., 2006), grapes (Huang et al., 2011), and mangos (Dillon et al., 2014).
After the annotation and assignment of putative functions to olive EST-SSRs, primers were successfully designed for 110 loci.After the optimization of the PCR protocol in a set of eight olive varieties, quality amplicons were obtained for 46 primers developed.The lengths of amplified microsatellites were in the expected range and no deviations were found.All 46 primers were used to test their applicability in a diversity study in a set of 24 olive varieties.All 46 EST-SSRs loci were polymorphic and 205 different alleles were detected, with the average number of 4.5 alleles per locus.Averages for H o and H e were 0.649 and 0.604, respectively.Slightly greater results for observed and expected heterozygosity (H o = 0.769, H e = 0.705) were obtained in a study, where 19 varieties, also used in this study, were tested with genomic SSRs (Bandelj et al., 2004).The comparable ability of genomic and EST-SSRs in the detection of genetic diversity in olives was also confirmed by De la Rosa et al. (2013).Furthermore, EST-SSRs may actually prove to be superior to genomic SSRs for diversity estimation and transferability (Gupta et al., 2003) and should be even more useful to develop linkage maps or tag agronomical important traits (Huang et al., 2011).
A good discrimination power of new genic markers was demonstrated with the selection of the minimum number of markers needed to distinguish all 24 olive varieties.Two loci (OeUP-04 and OeUP-14) were determined to be sufficient for the unambiguous discrimination of all samples.The high discrimination power of EST-SSRs was also confirmed by a large number of genotypes observed per locus, including unique genotypes, as well as 29 unique alleles detected.
Eleven out of 46 developed EST-SSRs showed a deviation from the HWE.In nine loci, the heterozygos-Sci.Agric.v.74, n.3, p.215-225, May/June 2017 ity observed was lower than expected.Since deviations from the HWE in these loci are prone to bias-likelihood estimates, attention should be paid on account of the analyses of identity and parentage as well as in population studies (Cipriani et al., 2010).
The average PIC value for these markers in the genotypes examined was 0.539, which is much higher than those found in citrus genera (mean = 0.450), pomegranate (mean = 0.381) and carob trees (mean = 0.420) (Jian et al., 2012;La Malfa et al., 2014;Liu et al., 2013).Furthermore, 27 newly developed EST-SSRs were classified as informative markers (OeUP-01, OeUP-02, OeUP-04, OeUP-05, OeUP-06, OeUP-08, OeUP-09, , and are strongly recommended due to the low frequency of null alleles and no deviation observed from HWE.These 27 EST-SSRs can be used for olive diversity and population studies. Comparing results of the clustering analysis with those obtained by genomic SSRs (Bandelj et al., 2004) revealed a similar distribution of olive varieties in related groups.Common genetic background was confirmed for all Tuscan olive varieties for oil use.The clustering analysis with new genic markers confirmed that Slovenian varieties "Zelenjak" and "Črnica" are closed related with Tuscan olives, as previously noted (Bandelj et 2004).Both genomic and functional EST-SSRs also showed that the predominant local variety "Istrska belica" had very low genetic similarities with other varieties.These results confirmed the equal ability and functionality of the use of genic SSRs for studies on genetic relationships in olives.
Alignment and functional annotation of all 923 ESTs containing microsatellite were performed using the Blast2GO tool.A total of 246 ESTs had homology with known proteins and were further annotated against the GO database.Ten newly developed EST-SSRs were associated with lipid biosynthetic process, embryo development, and cellular response to stress, while no GO terms were determined for other markers.However, other sequences (138, 15 %) were homologous to expressed, hypothetical/unknown/unnamed proteins, while 538 (58 %) SSR-containing sequences showed no hits.
A relatively high percentage of EST-SSRs without BLASTX hits could be attributed to the protein database used, while EST-SSRs are frequently positioned in UTR regions.When reviewing the location of 46 EST-SSRs, 65 % EST-SSRs were positioned in UTRs, and only 35 % was located in the coding region.Liu et al. (2013) have also observed higher density of di-, tri-and tetra-nucleotide SSRs in UTRs.It has been noted that untranslated regions of mRNAs have crucial roles in many aspects of gene regulation (Mignone et al., 2002).In a study on rice and Arabidopsis, Fujimori et al. (2003) concluded that microsatellites are located at high frequency in the 5'-flanking region of plant genes and potentially acts as factors to regulate gene expression.In contrast, microsatellite expansion in the 3' UTRs can cause transcription slippage and produce expanded mRNA (Li et al., 2004).An elevated percentage of sequences with no putative function may also only be attributed to specifically evolved gene functions and exclusive characteristics to O. europaea species, as reported by Alagna et al. (2009).The analysis of motifs of microsatellites in olive UTRs showed that tri-and di-nucleotides occurred in 35 % and 24 %, respectively.In CDSs, trinucleotide motifs were predominant (33 %).This result is in accordance with previous studies, which showed that selection against frameshift mutations limits nontriplet SSRs expansion in CDS regions (Metzgar et al., 2002).
In this study, we have demonstrated that a novel set of 46 EST-SSRs has good diversity properties, and we are convinced that it will be helpful in olive breeding programs, in the construction of linkage maps and it will aid in elucidating some biochemical pathways and physiological processes in olives.Furthermore, due to their possible transferability to the related species, EST-SSRs can also be used in the field of comparative genomics.All sequencing data and developed primer pairs for genic markers are available to the olive community and all interested research groups through public NCBI databases.

Figure 1 -
Figure 1 -Phylogenetic tree of 24 olive varieties based on the Jaccard's coefficient and unweighted pair-group method with arithmetic averages (UPGMA).

Figure 2 -
Figure 2 -Functional annotations of 247 expressed sequence tag-derived simple sequence repeats (EST-SSRs) in olives based on the Blast2GO analysis.

Table 1 -
GenBank Accession numbers, locus names, primer sequences, repeat motifs, simple sequence repeat (SSR) locations in untranslated region (UTR) or in coding sequence (CDS), primers the highest (T a1 ) and the lowest (T a2 ) annealing temperatures in touchdown polymerase chain reaction (PCR), size ranges and putative functions for newly developed expressed sequence tag-derived simple sequence repeats (EST-SSRs).

Table 2 -
Parameters of genetic variability of each expressed sequence tag-derived simple sequence repeat (EST-SSR) obtained among 24 olive varieties.Observed (H o ) and expected (H e ) heterozygosity, number of alleles (n), effective number of alleles (n e ), polymorphic information content (PIC), probability of identity (PI), probability for deviation from Hardy-Weinberg equilibrium (p-value), estimated frequency of null alleles (F(null)).