Classification, expression pattern and comparative analysis of sugarcane expressed sequences tags (ESTs) encoding glycine-rich proteins (GRPs)

Since the isolation of the first glycine-rich proteins (GRPs) in plants a wealth of new GRPs have been identified. The highly specific but diverse expression pattern of grp genes, taken together with the distinct sub-cellular localization of some GRP groups, clearly indicate that these proteins are involved in several independent physiological processes. Notwithstanding the absence of a clear definition of the role of GRPs in plant cells, studies conducted with these proteins have provided new and interesting insights into the molecular biology and cell biology of plants. Complexly regulated promoters and distinct mechanisms for the regulation of gene expression have been demonstrated and new protein targeting pathways, as well as the exportation of GRPs from different cell types have been discovered. These data show that GRPs can be useful as markers and/or models to understand distinct aspects of plant biology. In this paper, the structural and functional features of these proteins in sugarcane ( Saccharum officinarum L.) are summarized. Since this is the first description of GRPs in sugarcane, special emphasis has been given to the expression pattern of these GRP genes by studying their abundance and prevalence in the different cDNA-libraries of the Sugarcane Expressed Sequence Tag (SUCEST) project . The comparison of sugarcane GRPs with GRPs from other species is also discussed.


INTRODUCTION
The occurrence of quasi-repetitive glycine-rich peptides has been reported in several different organisms. Glycine-rich regions are thought to be involved in protein-protein interactions in at least three families of mammalian proteins: keratins, loricrins and a series of single stranded RNA-binding proteins (Steinert et al., 1991).
Glycine-rich proteins (GRPs) are distinguished by their characteristic primary structure, having a glycine content of 20 to 70% which forms glycine-rich domains predominantly arranged in quasi-repetitive (Gly-X)n motifs. In plants, the expression of genes encoding GRPs is developmentally regulated, and also induced by physical, chemical and biological factors such as wounding, virus infection, circadian rhythms, temperature, salinity, drought, flooding, light, salicylic acid, abscisic acid and ethylene. Besides having highly modulated expression, several GRPs also show tissue-specific expression patterns (reviewed in Sachetto-Martins et al., 2000).
The first clue for the existence of GRPs came from the observation that, in certain plant tissues, glycine is the major fraction of total nitrogen. Examples are soybean and the seed-coat of gourds which contain 21% of glycine and the cell wall of milkweed stem (31% glycine) and oat coleoptile cells (27% glycine) (Varner and Cassab, 1986). Glycine-rich proteins have indeed been isolated from pumpkin seed-coat (47% glycine) by Varner and Cassab (1986) and strawberry fruit (49% glycine) by Reedy and Poovaiah (1987). The first grp gene was isolated by Condit and Meagher (1986). The protein encoded by this gene contains 67% of glycine residues and has a predicted density very similar to the pumpkin protein isolated by Varner and Cassab (1986). These data, together with the identification of an eukaryotic signal peptide at the N-terminus of PtGRP-1, suggest that this protein is indeed a member of the GRP class of cell wall proteins.
Because the GRPs first described were cell wall located, a structural function has always been attributed to the protein encoded by every new grp gene described. However, the accumulated data clearly indicate that this concept is an oversimplification and GRPs may have very diverse locations and functions, with the only common feature among all the different GRPs being the presence of glycine-repeat domains (Steinert et al., 1991). As discussed by Steinert et al. (1991) in mammalian keratins, the glycine-repeat domains are highly flexible and may act as a "velcro" zipper in protein-protein interactions. This high flexibility makes GRPs good candidates to work in conjunction with other proteins and other macromolecules. GRPs do not necessarily perform structural functions, but it is possible that their glycine-rich regions allow these proteins to assume a structure necessary for their correct conformation and/or interactions with other proteins (Sachetto-Martins et al., 2000). It is quite possible that GRPs are components of different multi-molecular complexes, with the glycine-rich domain being required for the stabilization and flexibility of molecular interactions in such structures. Alberts (1998) introduced the idea of the cell as a collection of protein machines, and because multiple molecular systems play distinct roles in cell physiology they need to be differentially regulated, which could explain the highly complex gene expression pattern of GRPs.
The large spectrum of modulation and subcellular locations, together with the broad structural diversity of the GRPs (even in a plant with a compact genome such as Arabidopsis thaliana), indicates that GRPs do not represent a unique class of proteins. Instead, plant GRPs probably represent a diverse set of proteins which are not necessarily related (Sachetto-Martins et al., 2000).
In this paper we describe the isolation of sugarcane GRPs and summarize the structural and functional features of these proteins. During our research, special emphasis was placed on the expression pattern of these genes by studying their abundance and prevalence in the different cDNA-libraries of the Sugarcane Expressed Sequence Tag (SUCEST) project. The comparison of GRPs from sugarcane with GRPs from other species will also be discussed.

Sequence data, alignment and phylogenetic analysis
A basic local alignment sequence tool (BLAST) T-Blast-n search (Altschul et al., 1997) was performed using GRP sequences listed in a recent review (Sachetto-Martins et al., 2000) as baits against the full SUCEST expressed sequence tag (EST) data bank, using the fragment assembly program Phrap (www.pharap.com) for clustering. The Clustal W program was used to align the different standards with the proteins deduced from the SUCEST clusters (Thompson et al., 1994).
Phylogenetic analysis was performed using the Molecular Evolutionary Genetics Analysis (MEGA) software (Kumar et al., 2000). The neighbor-joining distance method was used with the pairwise deletion option for the treatment of amino acid gaps during the multiple alignment of sugarcane GRPs. For construction of the phylogenetic tree the confidence levels assigned at various nodes were determined with 5000 replications using the Interior Branch test (Sitnikova et al., 1995).

SUCEST cDNA libraries
All sugarcane (Saccharum officinarum L.) sequences used during this work were obtained from the SUCEST project and derived from cDNA libraries specific to different sugarcane tissues, organs or conditions of growth (for detailed information see http://sucest.lad.ic.unicamp.br/ cgi-bin/prod/BD/webobjects/LibraryList.pl).

RESULTS AND DISCUSSION
Distribution of glycine-rich proteins genes on SUCEST data bank As the name of this collection of proteins suggests, glycine repeats are the major structural feature of GRPs. The glycine-rich domains of plant GRPs consist of sequence repeats that can be summarized by the formula (Gly)n-X. However, several other distinct motifs can also be identified (Table I and Figure 1). Among these, two fre- 264 Fusaro et al. quently observed consensus sequences split GRPs into two major groups: group 1, the consensus targeting the endoplasmic reticulum (signal peptide) which is present in most of the GRPs identified up to now, and group 2, the RNA-binding consensus sequences RNP-1 and RNP-2. As discussed by Sachetto-Martins et al. (2000), other motifs also exist, e.g. the RNA-recognition motif (RRM), the oleosin-conserved domain, the cold-shock domain, CCHC zinc-fingers, the C-rich carboxy-terminus, the amphiphilic -helix and H-rich, P-rich and T-rich sequences. Five different classes of GRPs are shown in Table I, three groups based on the pattern of the glycine-rich repeats (class I, GGGX; class II, GGXXXGG; class III, GXGX) and two other groups based on the type of functional conserved motif (class IV, the oleosin glycine-rich proteins and class V, the RNA-binding GRPs).
In order to search the SUCEST database for genes encoding glycine-rich proteins, we used the typical GRP proteins sequences described for other species as well as the common motifs frequently found in these proteins. Different types of GRPs were found, a total of 150 different clusters being distributed among the general classes summarized in Table I, representing the first isolation of sugarcane GRP genes. These data show that the SUCEST sequences are equivalent to the large collection of GRPs from monocotyledonous plants already published. Even when compared in absolute terms, the number of GRP genes listed in this paper surpasses the number of GRPs available before, suggesting that our inventory might contain almost all the GRPs expressed in monocotyledonous plants.
Oleosin GRPs are specifically expressed in the anther tapetum layer of Brassica and Arabidopsis (both dicotyledonous plants). The presence of oleosin-conserved sequences in GRPs is indicative of the involvement of these GRPs in lipid stabilization during tapetal development, as upon tapetal degradation lipid-containing complexes are sorted to the pollen coat to become part of tryphine (Ting et al., 1998). Interestingly no member of the oleosin-GRP class was found in the SUCEST data bank. The absence of oleosin-GRPs in sugarcane and the fact that this kind of GRP has never been isolated from a monocotyledonous plant is evidence of the differences between the anthers and pollen grains of monocotyledonous and dicotyledonous plants.
The distribution of each EST sequence between the different SUCEST libraries was also analyzed. The SUCEST data bank comprises 291,689 EST sequences, arranged in 43,141 clusters, these reads coming from 37 different cDNA libraries constructed from different plant tissues under different culture conditions (see http://sucest. lad.ic.unicamp.br/cgi-bin/prod/BD/webobjects/LibraryLis t.pl). Since several GRP genes present tissue-specific expression in other plants, we analyzed the distribution of the reads from each cluster per library, considering as preferentially expressed clusters all those in which the reads came from two different libraries at most. Several clusters presented a putative tissue-specific expression pattern.

Sugarcane ESTs encoding GRPs with GGGX repeats
The typical GGGX array is the most frequently observed array, and is generally present in GRPs displaying a signal peptide and high (40% to 70%) glycine content. However, it can also be observed in GRPs without a signal peptide but with RNA-binding sequences, in which case the repeats are usually GGGY (Sachetto-Martins et al., 2000).
The existence of a signal peptide consensus sequence in the N-terminal end of most GRPs suggests that these proteins are located in the cell wall or cell membrane. Studies with PvGRP-1.8 have shown that this GRP is localized in the cell wall of proto-xylem elements and, surprisingly, is exported to these cells rather than synthesized there (Ryser and Keller, 1992). The same pattern of exportation to a non-producing cell has been observed for PtGRP-1 (Condit, 1993).
By analogy with HRGPs, Sachetto-Martins et al. (2000) have proposed that cell wall GRPs have a structural function, probably acting as a scaffold or agglutinating agent for deposition of cell wall constituents. The detection of PvGRP-1.8 in tissues undergoing lignification might indicate some interaction between these two cell wall components. The presence of tyrosine residues in many plant cell wall GRPs, the progressive reduction in solubility of PvGRP-1.8 and soybean GRP during development and the detection of a higher molecular weight isoform of ZmGRP-4 in root cap mucilage may indicate that the association of these proteins with themselves or with other components of the extracellular matrix might occur via isodityrosine cross-links, as described for the extensins (Sachetto-Martins et al., 2000).
Thirty-seven clusters encoding GRPs with GGGX repeats were identified in the SUCEST database. These sequences were different to each other but related to different types of GRPs previously described from monocotyledonous and dicotyledonous plants (Table II). Among these 37 clusters, 28 were isolated from only one or two types of libraries, indicating that these genes could correspond to GRPs with a potential tissue-specific expression pattern. Four of these clusters (SCBFAD1094D08.g, SCEQHR 1081H03.g, SCRLAD1102C04.g and SCSFHR 1043H11.g) were specifically identified in the library prepared from sugarcane infected with the nitrogen-fixating endosymbiont bacteria Herbaspirillum rubrisubalbicans (HR) and Gluconacetobacter diazotroficans (AD), suggesting a possible role for these genes in the mechanism of plant-microbe interaction. Four leaf-specific clusters (SCA CLB1046B05.g, SCCCLB1C04A01.g, SCJLLR1104 A01.b and SCRFLR2021C05.g) and four flower-specific clusters (SCQSFL1127H07.g, SCSBFL1046H04.g, SCVPFL1069E01.g and SCRLFL3006C09.g) were also observed.
To analyze the similarities between the 37 sugarcane GRPs and the 20 genes previously published by Sachetto-Martins et al. (2000) we aligned these sequences, but, as expected due to the high variation in the structure of this type of GRP, the alignments obtained presented many gaps and regions with no sequence overlapping, which made the construction of a dendrogram impossible.
Sugarcane ESTs encoding GRPs with N-and C-terminal domains homologous to nodulins The GGXXXGG repeat occurs in a group of GRPs in which some members share similarities with soybean nodulin 24 (Sandal et al., 1992), the tripeptide repetition interspersed between the glycine residues being generally composed of Y, H, R, N or Q. Some of these GRPs have been proposed to be located at the interface between the host plant membrane and the matrix surrounding the endosymbiont (Sandal et al., 1992). Tripeptides with similar composition have been observed in non-GRP extracellular matrix proteins. We propose that it may be a ligand attachment region, in analogy to the RGD motif of the extracellular matrix adhesive proteins of mammals. The tripeptide repeats in plant GRPs may also represent interactive sites for association with other proteins or with other cellular structures (Sachetto-Martins et al., 2000).
The binding function of this type of GRP has been recently proved by the interaction of AtGRP-3 with the extracellular domain of cell wall associated kinases (WAKs). It was shown that phosphorylation modulates this interaction and that the cysteine-rich C-terminal domain of AtGRP-3 is responsible for this interaction in vitro. AtGRP-3 regulates Wak1 function through binding to its cell wall domain and the interaction of Wak1 with AtGRP-3 occurs in a pathogenesis-related process in planta (Park et al., 2001).
Searching the SUCEST data bank using the previously reported GRPs with N-and C-terminal domains homologous to nodulins, as well as the cysteine-rich domain, allowed us to identify 8 different GRPs (Table III). Although these 8 sequences are not very similar to each other, they appear to be related to the previously described GRP-1 and GRP-2 from barley. Four of these clusters were preferentially observed in one or two types of libraries. Alignments conducted with the sugarcane ESTs and other similar GRPs resulted in the unrooted tree shown in Figure 2. Two well-separated groups can be seen. All the sugarcane genes group together, being more related to the two barley GRPs and the several Arabidopsis AtGRP-3-like genes.

Sugarcane ESTs encoding GRPs with lower glycine content
This last pattern of glycine repeats, GXGX, is generally observed in GRPs with an average glycine content of 20%. Similar to the GGGX group (Table II) this GRP group shows a high degree of structural diversity and probably contains several different types of GRPs.    Twenty different clusters were identified, of which eleven were isolated from only one or two library types (Table  IV). Interestingly six of these genes show preferential expression in the root system (SCACHR1035G08.g, SCCCHR1001B01.g, SCCCRT3001B10.g, SCCCRT 3008C02.g, SCRURT2014C05.g and SCVPRT2080 B02.g), and are also related to two previously identified root-specific GRPs of maize. As observed for the GGGX GRPs, the high variation in the structures of the GXGX GRPs did not allow the construction of a dendrogram.

Sugarcane ESTs encoding RNA-binding GRPs
Several GRPs with RNA-binding sequences have been identified in different plant species, most of them having an 80 to 100 amino acid N-terminus conserved region (the RNA-Recognition Motif (RRM) or Consensus Sequence-type RNA-Binding Domain (CS-RBD)) containing two conserved sequences (RNP-1 and RNP-2). Two plant GRPs (AtGRP-2 and NsGRP-2) have different RNA-binding sequences at their N-terminus. Instead of an RRM domain they have a cold-shock domain with only the RNP-1 sequence (Sachetto-Martins et al., 2000). Some of these GRPs are known to have in vitro RNA binding activity (Ludevid et al., 1992;Hirose et al., 1993;Freire et al., 1995;Hanano et al., 1996;Dunn et al., 1996) and also bind single-stranded DNA (Hirose et al., 1993;Dunn et al., 1996) and are phosphorylated in vitro (Dunn et al., 1996) and in vivo (Freire et al., 1995). Deletion assays conducted with the RZ-1 protein have shown that the RNA-binding domain and the C-terminal glycine-rich region are essential for RNA-binding activity (Hanano et al., 1996). Immuno-precipitation studies have shown that the MA16 protein interacts with RNAs through a complex association with several proteins (Freire et al., 1995). These results, together with the immuno-localization of SaGRP (Heintzen et al., 1994), MA16 (Albà et al., 1994) and RZ-1 proteins (Hanano et al., 1996) in the nucleus have led to the hypothesis that these GRPs may be involved in RNA processing, maturation or the control of gene expression. The modulation of some RNA-binding GRPs by factors such as ABA (abscisic acid), salinity, wounding, cold-stress and circadian rhythm may reflect their involvement in the modulation of the pathways activated by these stimuli (Sachetto-Martins et al., 2000).
In addition to their glycine-rich and RNA-binding motifs, AtGRP-2 (de Oliveira et al., 1990), NsGRP-2 (Obokata et al., 1991) and RZ-1 (Hanano et al., 1996) contain one or two CCHC zinc-fingers. Similar domains have been observed in yeast and mammals splicing factors, as well as in several retrovirus GAG proteins and in the human nucleic acid binding protein CNBP (Kingsley and Palis, 1994),and it is possible that these proteins represent components of the plant cell splicing machinery (Hanano et al., 1996).
Based on their structural features, the RNA-binding GRPs can be classified into three different sub-classes. Proteins from the first class show a RRM conserved motif at the N-terminal end followed by a glycine-rich region with GGYGG repeats. GRPs from the second class show similar organization, but present a CCHC zinc-finger inside their glycine-rich region. Proteins from the third subclass are organized with a cold-shock domain at the N-terminus and two copies of the CCHC zinc-finger in their glycine-rich region (Table I and Figure 1). Our analysis of the SUCEST database identified a novel group of RNA-binding GRPs (subclass IV, clusters SCCCCL3005A03.b and SCCCLR1C01G05.g), this new subclass having two copies of the RRM motif followed by a C-terminal glycine-rich region distinct from previous GRPs.
The RNA-binding GRPs from subclasses I and III show a high degree of divergence, and these differences introduced a high number of gaps during the alignment of all the RNA-binding sequences and did not allow correlation between the sequences or construction of a dendrogram. To analyze the correlation between the four RNA-binding GRP subclasses, the alignments were constructed mutually excluding GRPs from subclasses I and III (Figures 3 and 4). The observation that class IV RNA-binding GRPs remain as a separated group in both analysis justifies this treatment and shows the existence of a fourth subclass of RNA-binding GRPs.
Subclass I RNA-binding GRPs were subdivided into at least three subgroups, subgroups Ia and Ib being composed of sequences from monocotyledonous plants and subgroup Ic containing all the sequences from dicotyledonous plants (Figure 3). We found 62 SUCEST clusters encoding subclass I RNA-binding GRPs (Table I). These clusters are preferentially related to the CHEM2, MA16, S1 and S2 GRPs and present e-values ranging from 7e-39 to 9e-81. In general, these clusters were expressed in a large spectrum of libraries, being represented in virtually all the libraries analyzed. These results indicate that the transcripts of these genes accumulated at high levels in most of the organs investigated, suggesting that they may be constitutively synthesized and involved in fundamental cell processes. However, four clusters presented a more restricted expression pattern and seven (SCAGHR1015F04.g, SCCCHR1001D05.g, SCCCHR 1001H05.g, SCEZAD1082F12.g, SCJLFL1048D02.g, SCJLHR1029C09.g, and SCSGFL4C07D05.g) were preferentially expressed in flower-libraries. Some clusters were expressed during infection with nitrogen-fixating endosymbionts (Table V).
Eleven subclass II RNA-binding GRPs were detected in the SUCEST database, and five of them presented a preferential expression pattern (Table V). GRPs of this subclass have only recently been isolated from monocotyledonous plants (Ni et al., 2000), and the detection of sugarcane GRPs from this class may help to elucidate their biological function.
Subclass III GRPs have never before been isolated from monocotyledonous plants, but we identified ten clusters in the SUCEST data bank, six of them representing putative tissue-specific genes ( Table V). The principal characteristics of these GRPs are the combination of two well-defined nucleic acid-binding motifs, a cold-shock domain and CCHC zinc-fingers.
The diversity of the glycine-repeats in GRPs suggests that some of these genes may have originated from an ancestral gene by recombination events. Genes of GRPs organized in tandem on the genome have been isolated in Arabidopsis thaliana and bean. The high GC content of the grp genes makes them hot spots for recombination, and this mechanism is responsible for the genetic diversity of the mammalian keratin gene family in which evolutionary divergence is due to recombination at glycine-rich regions (Steinert et al., 1991). Because of its multi-species origin, sugarcane is thought to have one of the most complex plant genomes, with a variable number of chromosomes (generally 2n = 70-120) and a large DNA content. Hot-spot recombination could explain the high variability in the different classes of sugarcane GRPs.
Several GRPs were expressed in a tissue-specific manner, and the identification of the cell types in which GRPs are expressed opens the possibility of addressing questions on the function of these genes. In other plant species GRPs have been described that are preferentially expressed in xylem (Keller et al., 1989), phloem (Condit, 1993), epidermis (Sachetto-Martins et al., 1995), tapetum (Ferreira et al., 1997), embryo (Gòmez et al., 1988) and roots (Goddemeier et al., 1998). Tissue-specific genes have 270 Fusaro et al.
received a great deal of attention because of their biotechnological potential. Using the promoter of the tapetum-specific ta29 GRP male-sterile tobacco and Brassica plants were obtained for hybrid production in crop improvement programs (Mariane et al., 1990). The identification of sugarcane GRP genes preferentially expressed in specific organs opens up the possibility of isolating tissue-specific promoters that can be used to drive the expression of exogenous genes in sugarcane as well as in other monocotyledonous plants.