In silico survey of resistance ( R ) genes in Eucalyptus transcriptome

A major goal of plant genome research is to recognize genes responsible for important traits. Resistance genes are among the most important gene classes for plant breeding purposes being responsible for the specific immune response including pathogen recognition, and activation of plant defence mechanisms. These genes are quite abundant in higher plants, with 210 clusters found in Eucalyptus FOREST database presenting significant homology to known R-genes. All five gene classes of R-genes with their respective conserved domains are present and expressed in Eucalyptus. Most clusters identified (93) belong to the LRR-NBS-TIR (genes with three domains: Leucine-rich-repeat, Nucleotide-binding-site and Toll interleucine 1-receptor), followed by the serine-threoninekinase class (49 clusters). Some new combinations of domains and motifs of R-genes may be present in Eucalyptus and could represent novel gene structures. Most alignments occurred with dicots (94.3%), with emphasis on Arabidopsis thaliana (Brassicaceae) sequences. All best alignments with monocots (5.2%) occurred with rice (Oryza sativa) sequences and a single cluster aligned with the gymnosperm Pinus sylvestris (0.5%). The results are discussed and compared with available data from other crops and may bring useful evidences for the understanding of defense mechanisms in Eucalyptus and other crop species.


Introduction
Pathogen attack can severely affect crop production, with losses that can achieve 80% of the production especially in tropical countries.At the global level, losses have been estimated to accomplish around 12% of the world crop production (James et al., 1990).The most important group of genes that has been used by breeders for disease control is the plant resistance (R) genes: single determinant of an effective and specific resistance that can often be characterized by localized necrosis at attempted infection sites (Rommens and Kishore, 2000).
It is proposed that pathosystems are usually highly specific, with a matching R-gene on vegetal cell that recognizes elicitor proteins (called Avr-effector) of each infective pathogen.Plant will be resistant and the growth of the pathogen will be arrested only when both genes, R and Avr, are present (Ellis et al., 2000a).So, for each R-gene a correspondent Avr gene co-exists: this is the basis of the genefor-gene concept, suggested by Flor (1956Flor ( , 1971)).Avirulence gene products actually described do not comprise a defined family of related proteins, since no sharing similar motifs or domains could be found.On the opposite, R-gene products are separated into distinct but related protein classes, according to their conserved structural domains.Conserved domain function identified for R proteins suggests two fundamental mechanisms during pathogenic infection: (I) the pathogen recognition, conducted mainly by leucine-rich repeats (LRR) regions, which play a direct role in protein-protein specific recognition event; and (II) signaling of pathogen presence in order to activate defense related genes (Richter and Ronald, 2000).
The TIR (Toll interleukine 1-receptor) and CC (coiled coil) regions are involved in signal transduction during many cell processes (Martin et al., 2003), while the NBS (Nucleotide Biding Site) usually signalizes for programmed cell death in animal cells (van der Biezen and Jones, 1998).Additionally, a kinase catalytic region is present in some R-genes.This domain plays a direct role in both signaling processes and pathogen effectors.Additionally the NBS region contains not only the three motifs involved in nucleotide binding but additional motifs as well.This extended region of homology is referred to as the NB-ARC domain (Richter and Ronald, 2000).Sometimes this do-main contains a distinct predicted nucleoside triphosphatase (NTPase) domain known as NACHT, common in animal, fungal and bacterial proteins, implicated with apoptosis induction and transcription activation (Koonin and Avarind, 2000).
Resistance genes are members of a very large multigene family, are highly polymorphic and have diverse recognition specificities.They are commonly clustered in the genome, often in tandem direct repeats, what is consistent with the theory that they originated through gene duplication and that they are continuously evolving through unequal exchange (Song et al., 1997).
Most of the resistance genes that have been cloned and characterized resemble components involved in signal transduction.These can be classified into five categories based on their predicted protein structure (Song et al., 1997, Ellis andJones, 1998).
The first class is represented by the Pto gene of tomato, which encodes a protein with a catalytic serinethreonine kinase (ser-thre-kinase) and a myristoylation motif in his amino terminal region (Martin et al., 1993).
The third class includes similar proteins as described for class II, presenting a toll receptor for interleukine-I (IL-1R) instead of a CC sequence at the amino terminal region (Meyers et al., 1999).This class is referred as TIR-NBS-LRR, including the genes L (Lawrence et al., 1995), and P (Dodds et al., 2001) of flax; RPP1 (Botela et al., 1998), RPP4 (van der Biezen et al., 2002), RPP5 (Parker et al., 1997) and RPS4 (Gassmann et al., 1999) of A. thaliana and N (Whithan et al., 1996) of tobacco.This class (also present in animals) is supposed to be absent in monocotyledonous plants (Ellis and Jones, 1998), being present in all dicotyledonous taxa actually studied.
The proteins encoded by the three classes of genes previously cited do not present a transmembrane sequence and are therefore classified as intracellular R-proteins (Martin et al., 2003).
The fifth class includes a single gene, Xa21 from rice that presents an extracellular LRR, a transmembrane region (TM) and a cytoplasmatic ser-thre-kinase.Thus, the structure of Xa21 indicates an evolutionary link between different classes of plant disease resistance genes (Song et al., 1997).
There is still a sixth class that presents genes with no conserved domains, as described for the previous five classes.This group comprises the gene Hm1 from maize, a reductase that confers resistance to the fungus Cochliobolus carbonum (Johal and Briggs, 1992); Mlo from barley, a putative regulator of defense against Blumenaria graminis (Piffanelli et al., 2002) possibly associated to the plasma membrane (Buschges et al., 1997); and RPW8 from A. thaliana, that confers non-specific resistance to the fungus Erysyphe chicoracearum (Xiao et al., 2001).
Due to its qualities as high level of adaptability, fast growing capacity and wood quality, Eucalyptus plantations are carried out in all tropical areas in diverse continents.Eucalyptus is the most widely used tree for delivering raw material for the paper industry used in the production of cellulose and to regenerate degraded areas.Over the past 50 years large-scale planting of fast growing exotic E. grandis, E. urophyla, E. saligna and many hybrids (particularly grandis x urophyla) has occurred in Brazil aiming to reforest some regions and to create an adequate supply of wood, timber and fuel for different purposes (McNabb, 2002).In the late 2001s growing areas reached 138.132 ha, generating more than 7,398 direct employments (BRACELPA, 2004).
The advance of plantations to hot and humid areas resulted in favourable conditions to the development of diseases especially in young individuals that are often severely attacked by fungal (e.g.Mycosphaerella cryptica, Dichomera versiformis, Cylindrocladium spp.and Phaeophleospora epicpccoides) and bacterial pathogens (Barber et al., 2003, Mafia andAlfenas, 2003).
Eucalyptus Genome Sequencing Consortium (FOREST) aimed to identify over 15,000 expressed genes from 100,000 sequenced EST from 19 libraries from specific tissues and stages.
The present work aimed to perform a data mining-based identification of plant disease R-genes in FOREST database, by using well known R-genes sequences as template, comparing the identified sequences with known R-genes deposited in public DNA and protein databases.

Materials and Methods
Amino-acid sequences of known genes have been used as query in the search for R-gene homologues and analogs in Eucalyptus transcriptome database.Accession numbers at NCBI (National Center for Biotechnology Information; http://www.ncbi.nlm.nih.gov) of sequences used are shown in Table 1, together with sequences features and accession numbers.They are grouped according to the conserved domains previously described.Members of the sixth class (reductases and other R-genes with no recognizable conserved domains) have not been included in the present evaluation.
All Eucalyptus sequences used during this work were obtained from FOREST project and derived from cDNA libraries specific to different tissues, organs or conditions of growth from the species E. grandis, E. globulus, E. saligna and E. urophylla.For detailed information see https://forests.esalq.usp.br/Librariesinfo.html.
Reverse alignments were realized on 'FOREST EG_Clusters' database using the program TBLASTN (Altschul et al., 1990), the e-value cutoff adopted was 1e -23 .Matching clusters to query sequences were then annotated on a local database called 'non-redundant' made with aid of the Microsoft Access ® program.Cluster name was adopted as primary key in order to prevent data redundancy regarding clusters aligning with more than one query sequence.In the few cases when this occurred the name of both queries has been also annotated for the respective cluster.
The clusters frame of the TBLASTN alignment was used to predict the Open Reading Frames (ORFs) for each searched cluster.For this purpose, the Expasy Translate Tool (bo.expasy.org/tools/dna.html)was used, which predicts the correct ORF for a DNA sequence in the corresponding amino acid FASTA sequence.The obtained ORFs were subsequently submitted to a Reverse Position Specific BLAST (RPS-BLAST) against Conserved Domain Database (Marchler-Bauer et al., 2002)

Results
After the TBLASTN alignments performed at FOREST EG_Clusters database, a total of 478 clusters aligned with the diverse R-genes (Table 1) used as query (data not showed).These clusters were, as described in section 'Material and Methods', inserted on a local database called 'non-redundant'.This procedure generated a set of 210 non-redundant clusters which have been annotated for one or more than one R-gene (data summarized in Figure 1 and Tables 2 and 3).
Regarding the sequence identity of the best alignment, 22 clusters showed equally significant similarity to two different classes of R-genes.From these, 18 included LRR plus LRR-Kinase here called MIX I (sequence data presented in Table 3); three included NBS-LRR plus TIR-NBS-LRR (called MIX II) and one LRR plus Kinase (called MIX III).
Sizes of Eucalyptus clusters aligned to R-genes varied from 3,316 (cluster EGEQRT3301C03 classified to group MIX-III) to 520 nucleotides.The prediction of clusters cod-ing regions revealed that ORFs were coded in both forward and reverse reading frames, with an average of 304 amino acids (aa) in length.ORF sizes varied from 990 (cluster EGEQRT3301C03 of the LRR-KINASE class) to 134aa.Regarding the average ORF length in each R-gene class, we observed 417aa for KINASE, 276aa for NBS, 238aa for TIR-NBS-LRR, 247aa for LRR-TM, 352aa for LRR-KINASE, 372aa for MIX I, 343aa for MIX II and 990aa for MIX III class.
The search for conserved domains (CD-Search) revealed conserved regions (Figure 1, Table 1) in 166 of the 210 here analyzed clusters.A total of 40 clusters presented the kinase domain, 37 of them matched to Pto gene (class I) after the TBLASTN alignment, with only three grouping into KINASE-LRR (two of them) and MIX III (one of them) classes.These two classes also showed associated LRR segments as well.Regarding the LRR domains, these could be identified in 67 different clusters in all classes (except KINASE class I, represented by Pto) with a total of 442 occurrences.This number is higher than the number of clusters due to their occurrence in tandem repetitions.Sometimes these sequences are imperfect and may be difficult to recognize with available in silico tools, so it is possible that a larger number may be identified manually.
Twenty clusters showed the NB-ARC domain.In a specific case, this domain occurred associated to a different TIR domain as was cited above.Additionally, a NACHT domain (closed-related to NB-ARC) was identified exclu-Barbosa-da-Silva et al.
Most of the 44 clusters with no conserved domains presented shorter ORFs (262 aa in average), with four of them presenting a putative transmembrane region.
A graphic representation of the distribution of conserved domains as compared with class-grouped clusters is presented in Figure 2.
Considering the best matches to the 210 clusters identified, 198 were from plants of Dicotyledonous families, with emphasis on A. thaliana.From monocots only rice (O.sativa) sequences appeared as best matches (11 clusters).One of the sequences from MIX III group aligned with Pinus silvestris (Gymnosperm), the only non-Angiosperm included in the present study.A comprehensive inventory of all species that aligned with Eucalyptus with their taxonomic affiliation and habit (herbaceous or woody) is presented in predictions (Figure 3).The reliability class (RC), which is a confidence measure for the prediction, showed that only 11 sequences were defined into RC1 (higher than 80%), and 53 for RC2 (higher than 60%) class.Most of the sequences are predicted to be located at unspecific subcellular localization (133 sequences) while 35, 20 and 19 were predicted to contain mitochondrial targeting, signal and chloroplast transit peptides, respectively (Figure 3).
After evaluation with the TargetP program, sequences with motifs specific for transmembrane anchoring could be identified in 44 of all analyzed sequences.From these 19 belonged to LRR or LRR-KINASE-related sequences and, unexpectedly, five showed to be TIR-NBS-LRR and 20 to be KINASE-related sequences.

Discussion
The reverse alignment (TBLASTN) strategy (Altschul et al., 1997) adopted by our group identified a set of 210 clusters similar to the major classes of disease R-genes in the current version of the FOREST database, Barbosa-da-Silva et al. 567  what comprises 0.63% of the actually generated clusters.This approach allowed the identification of a large set of candidate sequences by using various representative genes per class, while some recent works employed few genes (Koczyk and Chelkowski, 2003).Using several previously described and sequenced R-genes as template was a useful and low-time consuming strategy in the search for R-genes candidates in plants.In this approach it was expected that some similar genes grouped at the same class should cause some level of redundancy (Meyers et al., 1999).The strategy of generating a local database (called non-redundant) by adopting the cluster number as a primary key register was very effective in the solution of this problem.Additionally, this approach was useful in the identification of the respective R-gene class for each Eucalyptus cluster.The number of R-genes here identified is quite high, especially considering that none of the 19 libraries were obtained under pathogen stress condition.By the other hand, when additional ESTs are generated especially under infection by pathogen, many of the identified clusters may be united in larger clusters of R-genes that may include more domains.
Evidences have shown that R-genes are quite abundant in higher plants, but the most functionally defined R-genes belong to the supergene LRR-NBS family.After completing the whole genome sequencing of the model plant A. thaliana a total of 85 TIR-NBS-LRR have been identified (The Arabidopsis Genome Initiative, 2000), less than the number of clusters (93) actually identified in Eucalyptus.Especially genes containing NBS-LRR domains were estimated to be in number of ca.166 for A. thaliana and ca.600 for rice (O.sativa) by Richly et al. (2002), but this later number is still not confirmed.A recent work reevaluated and reannotated all NBS-LRR encoding genes in A. thaliana genome database, revealing 149 genes of this class (including 94 TIR and 55 non-TIR sequences) in the genome of A. thaliana (Meyers et al., 2003).In our evaluation of FOREST database we found 114 clusters (93 and 21, respectively) of this class.It is interesting to note that in the evaluation of Meyers et al. (2003) not only the presence of the TIR or of the CC motif was determinant for the grouping of both distinct classes.Also the NBS-LRR domains co-evolved and were determinant in the divergent evolution of the two groups, with the CC-bearing sequences forming four subgroups and the TIR-bearing sequences forming eight subgroups, regarding the size, composition and order or introns and exons.Pan et al. (2000) compared tomato and Arabidopsis sequences of this class by systematically amplifying the tomato genome using a variety of primer pairs based on ubiquitous NBS motifs, generating 70 sequences, from which 10% were putative pseudogenes.The sequences were also used in mapping approaches, revealing a clustering R-gene homologues between tomato and potato (Solanum tuberosum, also from the Solanaceae family).Clustering of R-genes was also detected in A. thaliana, with most of the genes located in chromosomes 1 (49) and 5 (55), confirming the initial hypothesis that these genes are clustered in few chromosomes (The Arabidopsis Genome Initiative, 2000).This fact was also observed in other crops, as chickpea (Cicer arietinum; Benko-Iseppon et al., 2003).In this last case, with some synteny and colinearity within this species and Arabidopsis.The clustering of R-genes in specific chromosomes and the existence of conserved domains have allowed the establishment of interesting strategies for identification, mapping and breeding directed to the incorporation of such genes from wild relatives.Considering the number of genes from this group in this last species, it is to expect that they are also clustered in Eucalyptus, what can also be valuable for the establishment of Eucalyptus breeding strategies in the future, especially considering the previous existence of mapping populations for this crop.
Overall annotation revealed that Arabidopsis also carries homologues of other R-gene classes, including 174 genes encoding LRR-kinases (Xa21 group), but many of which are likely to play a role in development rather than defense (Jones, 2001).The present work revealed only eight clusters with significant homology to Xa21 but this number can increase if only the kinase sequence is used as template, since the LRR may be quite variable between rice and Eucalyptus.Exceptional R-genes have proven to provide durable disease control, due to the fast evolving pathogen genome that breaks resistance.The Xa21 gene is an important exception to this rule that reveals the full potential of R-genes for breeding purposes (Rommens and Kishore, 2000).This may be very valuable especially considering the possibility of pyramidization of such genes in important crops, increasing the potentiality of an effective specific R-Avr intection.
Another abundant family of R-genes in plants is the ser-thr-kinase with about 50 genes in Arabidopsis encoding protein kinases that are strongly homologous to tomato's Pto gene (Jones, 2001).In Eucalyptus we found almost the same number (49) of clusters also with high homology to the Pto sequence.
Regarding R-gene classes identified in Eucalyptus, an interesting phenomenon was observed in the present work: R-genes pertaining to different classes were able to align significantly to the same cluster on Eucalyptus database.This can be explained by the evidences that known R-genes combine a limited number of related functional domains (Ellis et al., 1999(Ellis et al., , 2000a)).Then, similar motifs would be present in different R-genes, and it is possible that a gene resembling to a determined class may search another belonging to a different class by local similarity at the site of the conserved motif.But in the practice, previous works do not speculate this possibility, once that the genes identified for specific R-genes are directly assigned to its own class as shown by evidences raised from works previously reported (Ronald, 1997;Jones, 2001;Romeis, 2001).
The MIX class one (MIX I) included 18 clusters resembling to genes which belong to both LRR and LRR-KINASE classes.These clusters were searched basically by using Cf (Jones et al., 1998) and Xa21 (Song et al., 1995) amino acid sequences as queries.In this case, the most plausible explanations would be the presence of the LRR domain, common to both classes, being responsible for the alignment and grouping of some clusters in both classes.By the other hand, LRRs are referred as fast evolving sequences and are in some cases quite imperfect, making manual annotation necessary.Often their amino-acid sequences are quite specific to their gene group (Dixon et al., 1998;Ellis et al., 1999).For example, using the LRR of Xa21 against GenBank database will reveal significant alignments only to Xa21 genes of rice (and some other Poaceae) and less significantly to Arabidopsis, but no sequence including other gene classes align significantly.A similar approach to the present work was used for the analysis of SUCEST (Sugarcane EST project, also running in Brazil) database (Morais, 2003) with no similar results.Song et al., (1997) suggested that the structure of Xa21 (here referred as class V) itself indicates an evolutionary link between different classes (I and IV) of plant disease resistance genes.May this be the case of this cluster that present a new link between two classes and can represent a new gene for Angiosperms?
Another surprising result was obtained by analyzing the unique cluster with both domains LRR and KINASE.It would be expected to find both domains in genes resembling Xa21 but this cluster (EGEQRT3301C03.g)showed itself similar to both Pto (class I, described by Martin et al., 1993) and Cf (Class IV, described by Jones et al, 1994) genes.This double similarity occurred on different motifs.The Pto gene is known to encode a ser-thre-kinase protein (Martin et al., 1993) and it was at this motif that the cluster showed similarity to this gene.On the other hand, Cf genes encode extracellular LRRs and it was at the LRR motif that the similarity was found.This cluster could be grouped in the LRR-KINASE class.So, why did it not align with Xa21, the single known gene with both LRR-KINASE domains?It should be answered by analyzing the KINASErelated clusters.Despite of the conservation of this region (Romeis, 2001), none of the Pto (KINASE) or Xa21 (LRR and a receptor-KINASE) related clusters were mixed (aligned together) during the annotation process.This shows that the kinase segment is less-redundant than LRR at least during our in silico gene prediction, once that the kinase CD is present in both Pto and Xa21 genes, they do not caused the mixture of their matching clusters on a mixed class.
The last case of mixture occurred to MIX class II including the motif TIR-NBS-LRR.Two of the three clusters pertaining to this mixed class (EGEQST6001H02.gand EGJECL1208G03.g) were searched at the FOREST database by the genes RPP5 (TIR-NBS-LRR; Parker et al., 1997) and RPS5 (NSB-LRR; Noel et al., 1999).The third cluster (EGEZRT3006B12.g)was obtained through search using RPP5 and RPS4 (both TIR-NBS-LRR; Gassmann et al., 1999) and I2 (NBS-LRR; Simmons et al., 1998) queries.We initially supposed that the redundancy was due to the presence of NB-ARC (NBS) conserved motif.However, the first two clusters did not show any motif after in silico CD-search and, again, the region that apparently caused the mixture of the classes was the LRR motif, once that it was predicted in cluster EGEZRT3006B12.g.
In view of the results discussed above, could we speculate that Eucalyptus bears some new classes of R-genes?Before taking further conclusions and in order to solve the questions raised by the present work, we intend to evaluate these groups of clusters in regard to their domain and interdomain structure and organization, evaluating also the clusterization process, before taking further conclusions.
The conserved domains (CDs) identified during our investigation showed that most of the Eucalyptus predicted sequences possess the same motifs shared by disease R-genes.The CD with the higher level of sampling was LRR, which was present in all classes (except KINASE class I, represented by Pto) with a total of 442 occurrences.The other frequent domain shared by R-genes, the NB-ARC, was observed in 27 sequences, notably in TIR-NBS-LRR and NBS-LRR predicted clusters.This motif is commonly found in such sequences, and it is proposed that NB-ARC plays a role in activation of downstream effectors (van der Biezen and Jones, 1998) by their sequence similarity to mammalian CED-4 and APAF-1 proteins which are involved in apoptosis (Chinnaiyan et al., 1997).In plants the TIR motif is found only associated to NBS regions of dicotyledones, being possibly absent in monocotyledones (Meyers et al., 1999).In Eucalyptus (a eudicot genus of the Myrtaceae family) TIR domains were quite abundant, as expected, being found in 39 clusters (all from TIR-NBS-LRR-class).
Another very common motif present in two classes of disease R-genes is the kinase domain.This motif is shared by Pto (ser-thre-kinase) and Xa21 (receptor-kinase) genes, members of the KINASE and LRR-KINASE classes, respectively.We found that all kinase domains found were associated to the classes KINASE, LRR-KINASE and MIX III.As commented here, despite of its conservation, this domain generally does not cause redundancy while searching in databases.
Transmembrane motifs were found only in 44 of all analyzed sequences.Of these clusters five TM were, unexpectedly, found in TIR-NBS-LRR-related sequences (a group of R-genes that acts at the intracellular level), while the remaining 19 were as expected LRR or LRR-KINASE-related sequences.
Information regarding the localization of disease resistance proteins in plant cells is still scarce (Martin, 1999).Spatial organization is usually variable among distinct gene classes and tissues affected, and there are no strong evidences in favor of conserved correspondence between R and Avr products spatial occurrence (Bonas and Lahaye, 2002).However, immunocytochemistry approaches allowed the subcellular localization of some Avr and R components (Boyes et al., 1998).Here, we adopted an in silico approach which uses neural network-based methods to predict the topology (i.e.localization) of protein sequences of the selected clusters.In spite of the large number of predictions obtained, only 11 sequences were defined into RC1 (reliability class 1 ³ 80%), and 53 for RC2 (³ 60%).Of these significant predictions, we observed that neural network was able to predict the localization of only a small number of proteins (29.62%) compared to the total sample of Eucalyptus R-genes.This percentage of representation is much lower than the 80% obtained for plant test sets carried out by Emanuelsson et al. (2000) with the same approach.It is important to note that these predictions are based on the N-terminal information available for sequences.Thus, this low number of predictions can be explained by the fact that the FOREST database was obtained from expressed sequence tags, an approach that usually do not include Ntermini for many EST generated.
Our Eucalyptus transcriptome cDNA sequence analysis revealed that there are 210 clusters with significant alignment to major classes of plant disease R-genes.Differentially from the other genomic efforts, as O. sativa (Goff et al., 2002) we used a redundant set of well described R-genes to screen for RGAs (Resistance Genes Analogs) on FOREST database.This proved to be a very sensitive approach, since best matches in NCBI present sometimes annotation mistakes and we also observed during the present work that some of the best GenBank matches to Eucalyptus R-clusters presented no conclusive description of function.This was also the case also of the first annotation of Arabidopsis genome sequences, as pointed out by Meyers et al., (2003).After reannotation of NBS-LRR sequences a total of 56 of the A. thaliana R-genes had to be corrected from earlier evaluations on GenBank (Meyers et al., 2003).These results show how important procedures as annotation and detailed evaluation of generated sequences are.These evidences bring to reflections about the strategic design of many genome and transcriptome projects, considering that the data mining is not expensive (normally only fellowships are needed) but still receive few investments from financing agencies, diminishing the final impact of the results.
The comparison of our results regarding the number (and maybe the organization) of identified Eucalyptus clusters was mainly with A. thaliana, especially due to the lack of open databases for other plant species with EST projects.Many differences considering the here analyzed R-related sequences can be explained by using diverse arguments: (i) The larger genome of Eucalyptus (e.g.E. grandis with 640 Mbp; Myburg et al., 2003) in contrast with the small and "compact" genome of A. thaliana (120 Mbp) (ii) The distant taxonomic position: both are dicots, but distantly related families (Brassicaceae and Myrtaceae) and finally (iii) the different levels of complexity: Eucalyptus is a wood perennial plant species and Arabidopsis is an annual herb.Herbaceous species are often regarded as faster evolving than woody species considering different morphological and genetic aspects (Bennet, 1972, Enrendorfer, 1982, Morawetz 1984, 1986, Bennet and Leitch, 1995, 2000).
Considering these evidences we observed that most of the information regarding R-genes available in databases refer to herbaceous (not woody) crop plants (few wild plants), maybe because most identified and sequenced R-genes were consequence of mapping approaches that are very time consuming in woody plants and difficult to realize in open pollinated species.The larger number of sequences from A. thaliana representing best alignments to Eucalyptus does not represent a higher similarity to this plant species, moreover it reflects the large number of sequences of this model plant deposited in GenBank.In our evaluation, only 23 woody species appeared as best matches for the clusters studied, including 22 species from different dicotyledonous families and one Gymnosperm species (Pinus sylvestris).This may justify some of the surprising results obtained in the present work and suggest that identification of R-genes in a larger number of taxonomic groups may be a very promissory approach to understand the natural evolution of these sequences when not affected by the influence of man.Regarding the actual knowledge of R-gene structure and diversity, some authors suggested that this gene class evolves faster than other genes (Ellis et al., 2000b) what should be evaluated in a larger number of taxonomic entities including wild species and also primitive taxa.

Concluding Remarks
Using bioinformatic tools it was possible to identify classify and verify the actually sequenced R-genes in Eucalyptus transcriptome.No previous sequences of this type could be found in protein or nucleotide databases for this crop.The identified sequences will be valuable resources for the development of markers for molecular breeding and identification of RGAs (resistance gene analogs) in Eucalyptus and other related species.The identified clusters constitute also excellent probes for physical mapping of genes in this species, giving support to genetic mapping programs and synteny studies.Considering the size of some clusters, they may also be used for fluorescent in situ hybridization (FISH) on Eucalyptus chromosomes, helping also in the comparison of different parental species and the respective hybrids.
The present work on Eucalyptus, based on FOREST database brought some light to the existing R-gene group in this important crop species and also regarding resistance response in higher plants, leading to the following conclusions: • All five gene classes of R-genes with their respective conserved domains are present and expressed in Eucalyptus.
• Some new combinations of domains and motifs of R-genes may be present in Eucalyptus and could represent novel R-gene structures, what should be analyzed in detail.
• Despite the lack of libraries from tissues ellicitated by pathogens a high number of R-genes was found in different libraries of FOREST project.This may suggest, that the identified clusters are expressed constitutively but also leads to the supposition that a higher number of R-genes may be present in Eucalyptus under other experimental conditions.
Besides the detailed analysis of different groups of genes and domains we intend to evaluate the expression of the selected clusters in the different libraries of the project.Furthermore, some additional efforts may be necessary to complete some sequences of R-genes, especially considering that their size vary between 321 (in case of Pto) and 1802 amino-acids (in case of Xa1 gene) and many identified sequences possibly present incomplete domains.
Further in silico, in vitro and in vivo evaluations of Eucalyptus genome may be a very promissory approach.Manipulation of the expression of these genes in economically important woody plant species aiming to improve disease resistance is necessary.Despite of the challenge that this mission may represent, some reports indicate that this strategy is feasible.

Figure 1 -
Figure 1 -Representation of main R-genes classes considering the presence and position of conserved domains from literature data, as compared with Eucalyptus clusters from FOREST database.For each class the data about significant alignments to R-genes is given, including following information: number of clusters identified for each class (clusters aligning with more than one class are not included), number and percentage of clusters per class bearing indicated conserved domains, size range (maximal and minimum) of sequence in nucleotides (n) and of ORF in amino-acids (aa).Abbreviation: CD = Conserved domains.

Figure 2 -
Figure 2 -Graphic representation of the distribution of conserved domains against class-grouped clusters.Values on the base after each domain indicate the number of clusters of each class presenting the indicated domain (also represented in the corresponding columns).Abbreviations: LRR = Leucine-rich-repeats; NB-ARC = Nucleotide-binding-site and additional motifs; NACHT = NB-ARC related domain, including an NTPase implicated in apoptosis and MHC transposition activation.

Table 1 -
Classification and features of R-genes used as query against the FOREST database.The used genes are grouped in five R-gene classes (I: LRR+Kinase) with respective accession number at NCBI, source species, gene name and domain range (in amino-acids).

Table 4 .
The post-translational inferences carried out for cluster products (TargetP program) revealed a large number of

Table 2 -
Blast results and sequence evaluation of Eucalyptus R genes, including the best matches of each R gene and MIX classes: (I) data about the query: gene class and name, NCBI gi½-number, species and family.(II) Features and evaluation results of Eucalyptus clusters related to R-genes: cluster number, cluster size in nucleotides (n), ORF (Open Reading Frame) size in amino-acids (aa), e-value; score and frame.

Table 3 -
FOREST clusters classified in the MIX I group, resembling to genes which belong to LRR and LRR-KINASE classes, including: respective templates (query sequences), cluster number and size in nucleotides (n), ORF-size in amino-acids (aa), range of LRR domain after CD-search, identity and results of the best alignment (BLASTp) in NCBI (GI number, species, score and e-value).

Table 4 -
Inventory of the organisms that appeared as best alignment to each of the 210 here identified Eucalyptus clusters related to known resistance genes.The organisms are grouped by gene class (I to V and MIX I to III), taxonomic affiliation (class, subclass, family and species) and habit (herbaceous or woody).Numbers in parenthesis indicate amount of gene members in each taxonomic group or species.