Identification, classification and expression pattern analysis of sugarcane cysteine proteinases

Cysteine proteases are peptidyl hydrolyses dependent on a cysteine residue at the active center. The physical and chemical properties of cysteine proteases have been extensively characterized, but their precise biological functions have not yet been completely understood, although it is known that they are involved in a number of events such as protein turnover, cancer, germination, programmed cell death and senescence. Protein sequences from different cysteine proteinases, classified as members of the E.C.3.4.22 sub-sub-class, were used to perform a T-BLAST-n search on the Brazilian Sugarcane Expressed Sequence Tags project (SUCEST) data bank. Sequence homology was found with 76 cluster sequences that corresponded to possible cysteine proteinases. The alignments of these SUCEST clusters with the sequence of cysteine proteinases of known origins provided important information about the classification and possible function of these sugarcane enzymes. Inferences about the expression pattern of each gene were made by direct correlation with the SUCEST cDNA libraries from which each cluster was derived. Since no previous reports of sugarcane cysteine proteinases genes exists, this study represents a first step in the study of new biochemical, physiological and biotechnological aspects of sugarcane cysteine proteases.


INTRODUCTION
Proteinases, or endopeptidases, are enzymes that catalyze the hydrolysis of peptide bonds within proteins. Based on their catalytic mechanisms these enzymes are classified as serine, cysteine, aspartic or metallo-proteinases (Barrett, 1980). Cysteine or thiol proteinases (EC 3.4.22) are those that contain a cysteine residue in the active site. These proteases have been identified in phylogenetically diverse organisms, such as bacteria, eukaryotic micro-organisms, plants and animals (Rawlings and Barret, 1994).
More than 30 families of peptidases, grouped in at least six clans (or superfamilies), make up the class of cysteine proteases. Members of the six major clans are defined according to the nature and linear organization of the catalytic residues along the primary sequence as follow: Clan CA has the catalytic residues Cys, His and Asn or Asp ordered in sequence; Clan CD presents two catalytic residues, His and Cys, in sequence; Clan CE has a triad formed by His, Glu or Asp and Cys at the C-terminus; Clan CF also presents a catalytic triad, but ordered as Glu, Cys and His; Clan CG has a dyad of two cysteine residues and Clan CH presents a Cys, Thr and His triad with the catalytic cysteine at the N-terminus (Rawlings and Barret, 2000).
A common feature to all cysteine proteinases with known three-dimensional structure is the existence of a bi-lobed structure, with the catalytic site located in the cleft between the lobes (Rawlings and Barret, 2000). The papain superfamily, or clan CA, corresponds to the best-known cysteine peptidases and has the catalytic residues Cys-25 and His-159 conserved in all of its members. They are synthesized as preproenzymes and are located in lysosomes or analogous organelles. The most studied cysteine proteinase is papain, from Carica papaya, that represents the typical member of this superfamily.
Sequence analysis has revealed that other higher plant cysteine proteinases and cathepsins B, H, L and S from mammalian lysosomes are members of the papain C1A family. In addition, bleomycin (family C1B), calpains (family C2), streptopain (family C10) and viral proteases also belong to this superfamily (Rawlings and Barret, 2000). The calpains are cytoplasmic, calcium dependent cysteine proteases, which differ in requiring micro or milimolar concentrations of Ca 2+ for activity and have a very high conserved molecular structure (Croall and Demartino, 1991).
In mammals, cysteine proteases such as lysosomal cathepsins comprise a group of small proteases having Mr values of less than 30.000 and are active at acidic pH. They are synthesized as precursor molecules, which contain N-terminal signal peptides that are cleaved off during transport through the membrane of the endoplasmic reticulum. Cathepsin B is one of the well-characterized lysosomal cysteine proteinases. Human cathepsin B was the first lysosomal cysteine proteinases whose crystal structure was elucidated (Mort, 1998) Cotyledonous legumes present an increasing peptidase activity during germination, corresponding to an atypical cysteine endopeptidase (legumain) with cleavage specificity for asparagine or aspartate residues in the P1 position of the peptide target (Ishii, 1994). Legumain has been shown to have sequence and functional similarity to plant vacuolar processing enzymes (VPE) and also to hemoglobinase from Schistosoma mansoni (Chappell and Dresden, 1986). Plant VPEs (Hiraiwa et al., 1993) have been proposed to play a role in the degradation of seed storage proteins or, alternatively, in their limited proteolysis, which also occurs during maturation of these proteins (Kembhavi et al., 1993). Thus, it seems that similar enzymes are involved in two opposite processes, i.e. storage protein deposition and mobilization. In addition to its location in the vacuoles of developing seeds, this processing enzyme can be detected in other vegetative organs, such as hypocotyls, roots and mature leaves suggesting that VPE could be key enzymes in vacuolar metabolism (Hara-Nishimura et al., 1998).
In spite of their very well characterized physicochemical properties, the role of the cysteine proteases in vivo is not yet completely understood. The complex spatial and temporal regulation of the expression of some cysteine proteases suggests that they can play diverse functions in cell metabolism. Moreover, in many cases the expression of multiple cysteine proteinases within a single organism is independently regulated (Watanabe et al., 1991;Koehler and Ho, 1990;Linthorst et al., 1993).
As part of the ongoing program to characterize sugarcane Expressed Sequence Tags (ESTs) we have identified sugarcane cysteine proteases, sub-sub-class 3.4.22, and correlated their localization and putative functions. Understanding the evolutionary relationship between cysteine protease could help identify the function of individual proteases.

Sequence data, alignment and phylogenetic analysis
The cysteine proteinase (E.C.3.4.22 sub-sub-class) amino acid and deduced amino acid sequences, were accessed from the SwissProt (SP) data bank (Bairoch, 2000). The cysteine proteinase official name, E.C. number, organism name and SP data bank accession number of the cysteine proteinases used are shown in Table I. A T-Blast-n search (Altschul et al., 1997) was performed using these bait sequence against the full SUCEST cDNA data bank.
The multiple alignment program (MAP) computes a multiple global alignment of sequences using a pairwise method. Its algorithm for aligning two sequences computes the best overlapping alignment between two sequences without penalizing terminal gaps. In addition, long internal gaps in short sequences are not heavily penalized. The MAP produces a consistent alignment notwithstanding some sequences present in long terminal or internal gaps, and the MAP is designed in a space-efficient manner, allowing long sequences to be aligned (Huang, 1994). This method was used to align the different cysteine proteinases standards from the E.C.3.4.22 sub-sub-class and also to align the proteins deduced from the SUCEST clusters with other cysteine proteinases from plants, vertebrates and invertebrates, the acronyms and accession numbers of these sequences being shown in Table II. Phylogenic analyses were performed using the Molecular Evolutionary Genetics Analysis (MEGA) software, version 2.0 (Kumar et al., 2000). The pair-wise deletion option was adopted on the treatment of amino acid gaps on the sugarcane cysteine protease multiple alignment. Trees were obtained from Neighbor-joining analysis derived from the p-distance method. In the phylogenetic tree construction, the confidence levels assigned at various nodes were determined after 5000 replications using the Interior Branch test (Sitnikova et al., 1995). 276 Correa et al. Description of SUCEST cDNA libraries All sugarcane sequences used in this work were obtained from the Brazilian SUCEST project (http://sucest. lad.dcc.unicamp.br/en/) and derived from cDNA libraries specific to different sugarcane tissues, organs or growth conditions. The libraries were are as follows: apical meristem from mature (AM1) and (AM2) immature plants; 1 cm (FL1) and 5 cm (FL3) flower base; 50 cm (FL4), 20 cm (FL5) and10 cm (FL8) flower stem; lateral buds (LB1 and LB2); large (LR1) and small (LR2) leaf-root insert libraries; etiolated leaves (LV1); grouped data of two non-redundant libraries (NRn); grouped data of three root libraries (RTn); grouped data of three leaf-root transition zone libraries (RZn); stem bark (SB1); grouped data of two seed libraries of different insert sizes (SDn); grouped data of two stem libraries from the first and fourth internodes (STn); libraries derived from calli submitted to a 4-37°C temperature change and three (CL3), four (CL4), six (CL6) and eight (CL7) hours of a light/dark cycle; plants infected with the bacteria Gluconacetobacter diazotroficans (AD1) and Herbaspirillum rubrisubalbicans (HR1).

Relationships among E.C.3.4.22. cysteine proteinase members
The evolutionary history of protease families may be regarded as the evolution from a single general-purpose ancestral protease to multiple and increasingly specific paralogous enzymes through a process of repeated gene duplication. There is biochemical evidence for this in relation to the papain superfamily in the trichomonads (North, 1991) and for cysteine-dependent proteases in Giardia (Parenti, 1989), two groups of protozoa, which were among the earliest diverging eukaryotes (Knoll, 1992). It therefor appears that the papain superfamily originated early during eukaryote evolution, and may, indeed, have occurred before the divergence of prokaryotes and eukaryotes.
The analysis of the phylogenetic tree ( Figure 1) clearly shows the clustering of all cysteine proteinases members from clan CA and family C1, except cathepsin B. In fact, this major cluster could be sub-divided in two groups: group I containing the proteins more closely related to papain, the type member of the C1 family, and group II, formed by cathepsins H, K, L and S, which presents a weak internal branch support of 79%. Cathepsin B could be placed in group II by its enzymatic similarities with this group, but corresponds to a member that strongly diverged from the other members of the C1 family. The five other cysteine proteinases included in this analysis form a very heterogeneous cluster composed of two different clans (CA and CD) from five different families (C2, C13, C10, C11 and C14). Identification of SUCEST cysteine proteinase homologous sequences All clusters present in the SUCEST data bank were automatically submitted to a general BLAST search against DNA, cDNA and protein data banks worldwide. This type of analysis allowed the identification of correlated sequences with the lowest and most significant e-values. Sometimes, however, these homologous sequences were not yet characterized or their real biological functions still await confirmation, and for this reason we have used the reverse approach to identify sugarcane cDNA clones with a real potential for presenting cysteine proteinase activity. In other words, the sequences of 24 well characterized and defined cysteine proteinases (17 of them presented in Figure 1) were used to find homologous SUCEST clusters having the lowest e-values.
The overall analysis, with a cut-off value of e -5 , allowed the identification of 76 different clusters separated into 12 distinct groups (Table III). Each group corresponds to one of the standard cysteine proteinases used in the T-Blast-n search. Several clusters were identified by more than one E.C. standard sequence. In Table III, a cluster is placed in a specific E.C. group only when the cluster has produced its lower and more significant e-values with the standard bait sequence of that corresponding group. The values may be considered parsimonious because some clusters present partial sequences as evaluated by the relative size ratio between the amino acids deduced from the sequence of the cluster and those present in the standard E.C. proteases (Table I).
Legumain (22 clusters) and actinidain (15 clusters) were the two most representative groups. Only one homologous cluster was found for calpain and bromelain and three clusters were identified as being common to papaya-protease-B, caricain and ananain. Papain group had four clusters, chymopapain eight clusters and cathepsins B, l and H, five, seven and four clusters respectively.
More than 26 clusters, with e-values lower than e -30 , were identified by the T-Blast-n algorithm when the search was performed with cathepsin-S (E.C.3.4.22.27) and cathepsin-K (E.C.3.4.22.38) sequences. These data are not presented in Table III because  The initial search and classification of sugarcane proteinases using T-Blast-n and e-values, allowed the identification of 16 clusters corresponding to cathepsin-like proteinases. To verify that these clusters were members of the cathepsin B, H and L group, their sequences were aligned (using the MAP algorithm) with 23 other cathepsin sequences. Due to the small size and no sequence overlapping of some cathepsin cluster, these small C-or N-terminal clusters were omitted from the file used to generate the cathepsin phylogenetic tree. The phylogenetic tree ( Figure  2) obtained clearly shows the existence of three major groups, corresponding to each of the cathepsin classes (B, H and L) and are 100% supported by the internal branch test. Moreover, when cathepsin sequences from other plants were present, the sugarcane clusters showed a closer and statistically significant relationship with monocotyledon sequences. The close relationship of clusters SC B1, SC B2 and SC B3 may suggest that they were derived from a single ancestral gene now present in at least three active copies on the polyploid genome of sugarcane. The distinctive nature of the three previous cluster in relation to clusters SC B4 and SC B5 is also observable in the topology of the tree. 278 Correa et al.  *Clusters not used in the phylogenetic analysis, with acronym derived exclusively from the high e-value grouping. Size corresponds to the SUCEST cluster and the standard EC protease amino acid ratio.

Classification and relationship of SUCEST legumain-like proteinases
The sugarcane legumain clusters (Table III) were aligned with 16 plant legumain and another eight animal legumains in a similar way to that which was done with the cathepsins. Six clusters were left out from the phylogenetic analysis due to their small size and non-overlapping with the other sequences. Only regions encompassing the three N-terminal conserved domains were used to construct the legumain phylogenetic tree shown in Figure 3, where five major groups can be seen: group-I containing monocotyledons and 13 of the sugarcane clusters; group-II made up of dicotyledons and three sugarcane clusters contains a mixture of plant groups; group III containing only dicotyledons; group IV consisting of only vertebrates and group V comprised of invertebrates only. The existence of different legumains in a same species, possessing different protein targets and biological roles, have been described (Okamoto and Minamikawa, 1999). The presence of legumain clusters in group-I and group-II, allied to the heterogeneous pattern of legumains inside group-I, suggest that at least some of these clusters (SC Leg02, SC Leg03 and SC Leg05) may have differentiated cellular functions.

Analysis of sugarcane cathepsin and legumain expression pattern
A preliminary analysis of the sugarcane cysteine proteinase expression pattern was made by the direct correlation of the reading frequency of each cluster in the different SUCEST cDNA libraries. Of the 76 cysteine proteinase clusters identified in the SUCEST bank, we focused the analysis on the cathepsin and legumain groups.
Analysis of cathepsin expression (Table IV) revealed that the most well represented cluster was SC B2, corre-280 Correa et al.  sponding to a cathepsin B, which was present in almost all the libraries except for the callus and etiolated-leaf libraries. In general, cathepsin B and H reads were represented in libraries produced from vegetative organs and developing seeds but were rare in callus and flower libraries. Two cathepsin L clusters, SC L1 and SC L6, were found only in the seed library. Cathepsin H occurred less frequently, with a total of 39 reads, against 79 for cathepsin B and 70 for cathepsin L. The most frequent and ubiquitous legumain cluster was SC Leg07, which was present in libraries constructed from apical meristem, flowers, lateral buds, etiolated leaves, stem and Herbaspirillum infected plants. On the other hand, a total of 11 clusters were derived from reads found in one type of library only, this being the case for cluster SC Leg09, which was present only in the library of Gluconacetobacter-inoculated plants and cluster SC Leg05, which occurred only in the library of Herbaspirillum-infected plants. Other clusters have been derived from stem (SC Leg13, SC Leg15 and SC Leg17), root (SC Leg14 and SC Leg20), leaf-root transition zone (SC Leg19), lateral bud (SC Leg01), and lateral root (SC Leg 04) libraries. However, more detailed analysis concerning the expression pattern of these genes will be necessary in order to confirm the tissue-specificity of these clusters/genes. Apart from these differences, the presence of legumain in the vegetative organs of sugarcane agree with previously published work describing legumain expression in organs such as roots, leaves, flowers and hypocotyls (Kinoshita et al., 1995;Hara-Nishimura et al., 1998;Okamoto and Minamikawa, 1999). However, surprisingly, clusters sequenced from libraries constructed from germinating/developing seeds were very rare, and no clusters were derived from seeds only. Some clusters were found in other plant organs as well as seeds e.g. cluster SC Leg06 (20 cm flower stems, lateral buds and stems), cluster SC Leg03 (1 cm flowers and steams) and cluster SC Leg22 (leave-root transition zone). No direct correlation was observed between Total 5  4  4  3  1  3  1  3  3  1  -2  -19  6  7  4 3  ---2  4  4  79 Cathepsin H sugarcane sequence homology and the expression pattern of different clusters. Thus clusters SC Leg07 and SC Leg09, very close in the tree (Figure 3), differ significantly in their expression pattern (Table V), although clusters SC Leg01 and SC Leg12, which belong to different groups (Figure 3), present a relatively similar expression pattern.
At this stage speculations about a role for different sugarcane proteases may be premature because the analysis of the SUCEST database is still producing new data, but it is hoped that the results presented here will contribute to the understand of the role of cysteine proteases in sugarcane. Much work remains to be done in order to confirm the observed expression patterns and to assess the real biological diversity of each of the cysteine proteinases revealed in this study.
ACKNOWLEDGMENTS G.C. Corrêa was supported by a CNPq fellowship. The authors wish to thank Dr. G. Domont and Dr. G. Sachetto-Martins for the critical reading of the manuscript and FAPESP for support and development of the SUCEST project.