The libraries that made SUCEST

A large-scale sequencing of sugarcane expressed sequence tags (ESTs) was carried out as a first step in depicting the genome of this important tropical crop. Twenty-six unidirectional cDNA libraries were constructed from a variety of tissues sampled from thirteen different sugarcane cultivars. A total of 291,689 cDNA clones were sequenced in their 5’ and 3’end regions. After trimming low-quality sequences and removing vector and ribosomal RNA sequences, 237,954 ESTs potentially derived from protein-encoding messenger RNA (mRNA) remained. The average insert size in all libraries was estimated to be 1,250bp with the insert length varying from 500 to 5,000 bp. Clustering the 237,954 sugarcane ESTs resulted in 43,141clusters, from which 38% had no matches with existing sequences in the public databases. Around 53% of the clusters were formed by ESTs expressed in at least two libraries while 47% of the clusters are formed by ESTs expressed in only one library. A global analysis of the ESTs indicated that around 33% contain cDNA clones with full-length insert.


INTRODUCTION
Single-pass sequencing of cDNAs to generate "expressed sequence tags" (ESTs) has proven to be a powerful, economical and rapid approach to identify genes that are preferentially expressed in certain tissue or cell types of multicellular organisms (Adams et al., 1991, Hwang et al., 1997, Liew et al., 1994, Adams et al., 1995).Increasing importance has also been attributed to ESTs as a tool for the annotation of complete genome sequences of mammalians and plants.Unique ESTs provided biological evidence of hundreds of predicted genes, newly discovered genes, or transcript isoforms leading to considerable advance in gene identification mission in multicellular organisms (Andrews et al., 2000).Today, more than ten million ESTs are currently available through the dbEST entry of GenBank (http://www.ncbi.nlm.nih.gov/dbEST/dbEST_sumary.htm l); however, only 14% of dbEST release 022301 of February 23, 2001 corresponds to plant sequences.
Another useful aspect of ESTs is in accessing genetic information of species with a complex genome, whose access is difficult using conventional genetics.This is the case of sugarcane, an important crop that is cultivated in the tropics for its high sucrose accumulation in the stalk.Among the cultivated crops, sugarcane possesses perhaps one of the most complex genomes (for a review see Grivet and Arruda, 2002).Modern sugarcane cultivars are hybrids derived from the crossing of Saccharum officinarum, usu-ally having 2n = 80 chromosomes and Saccharum spontaneum, 2n = 40 -128 chromosomes.In view of the structural differences between chromosomes of the two species, the hybrids possess different proportions of chromosomes, varying chromosome sets and complex recombinational events (Grivet and Arruda, 2002).This imposes tremendous difficulties in applying conventional plant breeding techniques to sugarcane.
As a first step in depicting the sugarcane genome, the ONSA consortium (Simpson and Perez 1998) launched in September of 1998 the Sugarcane Expressed Sequence Tag project (SUCEST), aiming at sequencing random ESTs and identifying around 50,000 unique genes (http://sucest.lad.ic.unicamp.br/en/).
To improve the probability of getting a maximum number of different ESTs, researchers have been using normalized and/or subtracted cDNA libraries that bring the frequency of each clone in a cDNA library within a narrow range (Soares and Bonaldo 2000).However, normalization and/or subtraction procedures are in general laborious and have the tendency of increasing the proportion of small insert clones.In the SUCEST project we have implemented an efficient procedure to generate conventional cDNA libraries to generate large scale ESTs from sugarcane.This paper describes the construction of these libraries, representing all major organs, harvested at different developmental stages and used to generate one of the largest plant EST collections.

Plant material
Sugarcane tissues were obtained from commercial cultivars (Table I) grown at the Copersucar experimental station (Piracicaba, SP, Brazil), at the Universidade Federal de São Carlos experimental station (Serra do Ouro, AL, Brazil) and at the Centro de Biologia Molecular e Engenharia Genética (Campinas, SP, Brazil).After harvesting, tissues were frozen in liquid nitrogen and stored at -80 °C.

RNA isolation
Total RNA was isolated using Trizol (Invitrogen) according to manufacturer's instructions.Due to the high carbohydrate content and the presence of phenolic compounds, total RNA from immature seeds was isolated according to the method described by Manning (1991).
Poly(A) + mRNA was purified from total RNA using Oligotex-dT (Qiagen) according to manufacturer's instruc-tions.Purity and RNA integrity were assessed by absorbance at 260/280 nm and agarose gel electrophoresis.cDNA library construction Libraries were constructed using the SuperScript cDNA Synthesis and Plasmid Cloning Kit (Invitrogen) according to the manufacturer's protocols.One microgram of poly(A) + mRNA was reverse-transcribed using a poly-dT primer containing the NotI site.The efficiency of cDNA synthesis was monitored with radioactive nucleotides.The second cDNA strand was then synthesized by replacing the RNA in the hybrids with DNA by using a combination of RNase H, DNA Polymerase I and DNA Ligase.After the second-strand synthesis and ligation of SalI adapters, cDNA was digested with NotI, generating cDNA flanked by SalI sites at 5' ends and NotI sites at the 3' ends.Excess adapters were removed and cDNAs were size fractioned in a 40 cm long 1 mm ID Sepharose CL-2B columm.One hundred and fifty µL fractions were collected and 8 µL aliquots of each fraction was electrophoresed in 1.5% agarose gel to determine the size range of cDNAs.Fractions with cDNAs with a minimum size of 500 base pair (bp) were pooled and ligated to pSPORT1 vector (Invitrogen) predigested with SalI and NotI.The resulting plasmids were transformed in DH10B cells (Invitrogen) by electroporation.Unamplified libraries were plated and individual colonies picked and transferred to 96 well plates containing liquid Circle Grow (CG) medium (BIO 101), supplemented with 100 mg/L of ampicillin and 8% glycerol.Three copies of each cDNA clone were stored at -80 °C.
Template preparation and DNA sequencing DNA template preparations and sequencing reactions were performed in a 96-well format.Plasmid templates were prepared using modified alkaline lysis (http://sucest.lad.ic.unicamp.br).Sequencing reactions were performed on plasmid templates using a quarter of the standard volume of ABI Prism BigDye Terminator Sequencing Kit (Applied Biosystems) and the T7 promoter primer (5'-TAATACGACTCACTATAGGG-3') that hybridizes upstream of the SalI site in the pSPORT1 polylinker (5'end of the cDNA inserts) or the SP6 promoter primer (5'-ATTTAGGTGACACTATAG-3') that hybridizes downstream of the NotI site (3'end of the cDNA inserts).Reaction products were precipitated with 95% ethanol using sodium acetate (3M) and Glycogen (1g/L) as carriers and washed twice with 75% ethanol before drying under vacuum.The sequencing reaction products were analyzed on 377-96 ABI Sequencers.

Sequence analysis
Sequencing of sugarcane ESTs was performed by 23 laboratories located in Universities and Research Institutes of the State of São Paulo and sequences were processed by the Bioinformatics laboratory (LBI) located at Instituto de Computação, Universidade Estadual de Campinas.A detailed description of the methods used to receive, process, analyze, and display the sequences along with additional tools to help explore the sequence data can be found in this issue (Telles et al., 2001, Telles andda Silva, 2001).

RESULTS AND DISCUSSION
The SUCEST strategy EST programs to acquire information about the transcriptome has been carried out for hundreds of organisms including plants and mammals.In most of the cases unidirectionally cloned cDNA libraries have been prepared using bacterial or phage vectors, so that the 5' and/or 3' end of the clones can be sequenced.Since single pass reads result in average ~350 high quality nucleotides, sequencing 3' ends covers mainly the untranslated region of the transcript.Moreover, the 3'end of the cDNA clones contain a long poly-A tail that is useless in terms of biological information and in general introduces technical difficulties in the sequencing process.However, because the untranslated 3'end represent the less conserved region of the transcripts it is useful, for example, to avoid misassembly of reads coming from highly conserved sequences from members of gene families.Sequencing 5' ends of unidirectional cDNA clones, on the other hand, can be of great advantage for large scale EST projects.Since the 5' untranslated region is shorter, it is likely that it contains protein-coding sequences.In addition, because a large proportion of clones present partial cDNA sequences, it is possible to collect enough information to assemble the full consensus sequence of a transcript, increasing the likelihood that database searches will result in the assignment of their putative functions.Based on this assumptions we decide sequence the 5' end of the cDNA clones to build up the SUCEST database.

The libraries
Table I shows the description of the libraries used in the SUCEST project.A variety of tissues were sampled from different cultivars, in order to access transcript information of genes expressed in many biological systems.Two libraries AD1 and HR1 were constructed using tissues from in vitro cultured plantlets infected with Gluconacetobacter diazotroficans and Herbaspirilum diazotroficans.These are endophytic nitrogen fixing bacteria that colonize sugarcane tissues (Lee, et al., 2000).Sequencing from these libraries could lead to discovery of genes involved in plant-bacteria interaction and in nitrogen assimilation in sugarcane.Libraries AM1, AM2, LB1 and LB2 were constructed using apical meristem of young plants and lateral buds from adult plants.These libraries shall contribute with genes expressed at the initial stages of organ differentiation.Calli produced from sugarcane meristems was used in an experiment devised to access genes induced by cold and heat.Two weeks old calli was incubated at 4 °C or 37 °C for 12 h.Part of the tissues was maintained in the dark and part in continuous light.The CL6 library was prepared with a mixture of equal amounts of RNA extracted from these tissues and it is expected that this library will contribute with genes induced by cold and heat.FL1, FL3, FL4, FL5 and FL8 are libraries constructed from flower tissues harvested at different developmental stages and may contribute with genes expressed in this important plant organ.To access information on genes expressed in leaves, we constructed LR1 and LR2 libraries from leaf roll of adult plants and LV1 from etiolated leaves of plantlets grown in vitro.A collection of libraries representing roots or tissues from which roots emerge are represented by RT1, RT2 and RT3 which are libraries constructed from roots The libraries that made SUCEST sampled from plantlets grown in vitro or plants grown in greenhouse, while RZ1, RZ2 and RZ3 were constructed from root to shoot zone of young plants grown in greenhouse.SB1 is a library constructed from stalk bark of adult plants and may contribute with genes involved in the synthesis of cell wall components including waxes.SD1 and SD2 are libraries constructed from developing seeds.Finally, we constructed the libraries ST1 and ST3 from first and fourth internodes of adult plants at the time of intense sucrose synthesis and accumulation.

Quality control
Large-scale sequencing demands care with the quality of biological materials and accurate performance at each step of the process, both to provide sequence data of the highest possible quality and to detect or avoid mistakes (Adams et al., 1995).At each step of the SUCEST project, from tissues sampling to sequence analysis, quality control and evaluation procedures were used to assess the accuracy of the data.The goal of the SUCEST project was that cDNA libraries should contain all sequences present in the poly(A) + mRNA population, which is useful to access expression profile through electronic Northern; unidirectionally cloned so that the orientation of each cDNA is known, facilitating subsequent sequence analysis; include a large proportion of full-length inserts; and reveal low levels of contamination with genomic or ribosomal RNA.Table II shows the quality control steps used during cDNA library construction and sequencing.Tissues were quickly frozen in liquid nitrogen, RNA quality analyzed by different methods and the cDNAs were synthesized and size selected using special gel filtration columns.cDNAs were unidirectionally cloned in pSPORT plasmidial vector and introduced into DH10B competent cells.Libraries with title less than 1 x 10 4 were discarded.Colonies were placed into 96 well plates and stored at -80 °C.A sample of ~400 clones from each library was examined to evaluate library quality, such as percentage of clones with no inserts, percentage of ESTs with exact matches to sequences derived from ribosomal RNA species, E. coli or bacteriophage lambda, percentage of ESTs with no significant matches to any sequence in the public databases, and an estimate of the number of clusters that contain a full-length coding region sequence.Libraries selected for EST analysis typically exhibited a broad diversity of transcripts (no single gene or small group of genes dominating the distribution), a low percentage of clones with no insert, a low percentage of ribosomal RNA clones, and no evidence of contamination with sequences from other organisms.The libraries that did not meet these general criteria were discarded.
Sequencing in the SUCEST project was carried out using ABI377 sequencers, which are prone to error during gel tracking.To minimize errors the 8 th row of each 96 well plates was used to build control plates that were rese-quenced.Computer analysis was then used to check the address match.These allowed the SUCEST project to keep the address error to less than 5%, so that a sequence in the computer corresponds, with high fidelity, to a clone in the freezer.

SUCEST data set
Table III shows the summary of the complete data set of the SUCEST project.A total of 259,325 cDNA clones were sequenced in their 5'end region and 32,364 of them had also their 3'end region sequenced.Therefore, the project produced 291,689 ESTs.After trimming of low-quality sequences and removal of vector and ribosomal RNA ESTs, 237,954 ESTs potentially derived from protein-encoding messenger RNA (mRNA) remained.This represents a success index of 81.56%, which is comparable with other EST projects worldwide.Before entering the sequencing pipeline, each SUCEST cDNA library was evaluated for the average size of cDNA inserts.cDNA libraries that contained an average insert size below 500bp were discarded.The average insert size in all libraries was estimated to be 1,250bp (n = 4,000) (Table III).The distribution of the insert length was between 500 and 5,000bp.In order to clone genes encoding low molecular weight proteins, we constructed some cDNA libraries (LR2, RZ2 and SD2 -See Table IV) with an average insert size of 855bp.Quality control procedures for each step in the EST process are listed with specific points of evaluation or standards to be met.
After the trimming process, all new sequences were compared to the previous sequences that had already been deposited in the SUCEST database.Every time that an EST was similar to a sequence that already existed in the database, both were grouped together in a cluster.As noted in Table V, the 237,954 valid sequences were assembled into 43,141clusters.
Each cluster consensus sequence was compared against the non-redundant nucleotide and peptide databases (GenBank) using the programs BLASTN and BLASTX.Sequences that did not match these databases were further compared against the dbEST.Using a blast E-Value threshold (Altschul et al., 1997) equal to or below e-5, of the 43,141 SUCEST clusters, 26,525 (61.5%) had matches with an existing sequence in GenBank (Table V).Therefore, 16,616 (38.5%) of the SUCEST clusters could potentially represent new genes.These values are comparable to those found for ESTs sequences from other organisms (Hwang et al. 2000;Adams et al. 1992;Claverie 1996).Ascribing functions to those anonymous sequences has therefore become one of the major bottlenecks in plant and animal genomics.
Tissue and cellular differentiation depend on specific patterns of gene expression.Therefore, in large-scale EST sequencing, sampling many different tissues and in different physiological conditions increases the chance to pick up transcripts rare in one cell type but less rare in another.SUCEST database was built up with sequences derived from 26 libraries constructed from different tissues sampled at different developmental stages (Table I) and an average of 10,000 clones were sequenced from each library.Sequencing from many libraries resulted in a novelty ratio as good as the ratios found in other EST projects that used normalized libraries (Bonaldo et al., 1996).
Around 53.2% of the SUCEST clusters were formed by ESTs expressed in at least two libraries.This suggests that these genes are being coordinately expressed in differ-ent tissues or that they are expressed in response to specific physiological conditions or developmental requirements.On the other hand, 46.8% of the clusters (Table VI -the sum of specific contributions) are formed by ESTs expressed in only one library.This suggests that these ESTs could correspond to genes expressed in a tissue/time fashion, varying in different tissue/physiological conditions.Nonetheless, these data should be analyzed taking into account that 16,338 (37.9%) are singletons, therefore representing rare transcripts.The uniformity in the amount of singletons in the different libraries (Table VI) strengthens the value of the approach adopted.
A global analysis of all SUCEST clusters indicated that around 33% contain cDNA clones with full-length in-  serts (Table V).This is in accordance with the results obtained in the mouse EST project (Marra et al., 1999).This collection of 237,954 ESTs provides us with a preliminary view into the gene expression profile of sugarcane.The identification of genes involved in different cellular processes suggests that the generation of large-scale ESTs should provide valuable insights into the molecular mechanisms of plant function and development.ESTs were clustered using CAP3 assembler (Huang and Madan, 1999).
The E-value cut of threshold to be considered for C or S as having homology to other proteins in the nr GenBank database using BLASTX was (<10 -5 ).Clones were considered as having a putative full length insert when their sequences started within the first 15 amino acids of their hit in GenBank.C or S were considered as having tentative full consensus sequence when their sequences started within the first 15 amino acids and finished within the last 15 amino acid of their hit in GenBank.The number of clusters that contain one or more reads from a specific library is indicated, as well as, the clusters that were formed only by reads of a specific library (Unique Clusters).The number of clusters that were formed by only one read (Singleton) is also indicated.The specific contribution is calculated dividing the Unique Clusters of each library by the total number of clusters (43,141).

Table I -
Description of the SUCEST Libraries.

Table II -
Quality control and evaluation of SUCEST libraries.

Table III -
Summary of SUCEST data.Numbers of sequenced cDNA clones and generated ESTs from 26 libraries constructed from different sugarcane tissues.259,325 ESTs were generated by sequencing the 5' end of cDNA clones.Another 32,364 ESTs were generated by sequencing the 3' end of cDNAs clones.The average insert size was calculated for 400 cDNA clones from each library.The EST length and the number of bases with Phred quality ≥ 20 was calculated from the total EST set.

Table IV -
Characteristics of the SUCEST libraries.The average insert size of each library was determined in a sample of 400 clones by gel electrophoresis of the clones digested with PvuII.Valid reads are defined as reads containing at least 140 bp with Phred quality ≥ 20.The success index is the number of valid reads in relation to the number of clones sequenced.The Novelty represents the probability of a new sequence to be founding in the library.

Table V -
Statistics of EST clustering and contiging.