Patterns of molecular evolution in pathogenesis-related proteins

The genes encoding 13 classes of pathogenesis-related (PR) proteins were examined for positive selection using maximum-likelihood (ML) models of codon substitution. The study involved 194 sequences from 54 species belonging to 37 genera. Although the sizes of the sequences examined varied from 237 bp for PR12 to 1,110 bp for PR7, most classes (9 out of 13) contained sequences made up of more than 400 nucleotides. Signs of positive selection were obtained for sites in PR proteins 4, 6, 8, 9 and 15 using an ML-based Bayesian method and likelihood ratio tests. These results confirm the importance of positive selection in proteins related to defense mechanisms already observed in a wide array of organisms.


Introduction
Pathogenesis-related (PR) proteins are coded by host plants as a response to pathological or related situations, and normally accumulate not only locally in the place of infection, but are formed systemically following infection by bacteria, fungi or viruses, or after induction by abiotic stress factors.Such proteins have a wide array of functions in that they can be hydrolases, transcription factors, protease inhibitors, enzymes associated with various metabolic pathways and allergenic products.The functional motifs of PR proteins are related to a number of eukaryotic intra and intercellular proteins involved in very distinct functions, e.g.sperm-maturation glycoproteins in rodents, store proteins in seeds or flower development and differentiation.It is possible, therefore, that the defensive functions of PR proteins evolved after their emergence as gene families (see review in van Loon and van Strien, 1999).
It is clear that PR proteins are important not only to the plants themselves but also to any attempt to improve plants through artificial selection, and it is, therefore, of interest to establish how such proteins evolved naturally.The neutralist theory of molecular evolution (Kimura, 1983) states that the majority of nucleotide and amino acid substitutions have no adaptive consequences, although tests developed to verify this assumption (Wayne and Simonsen, 1998) in some cases reject a strictly neutral model but are unable to distinguish between different forms of natural selection.A powerful method to detect positive selection at the molecular level is the comparison of synonymous (d S ) and non-synonymous (d N ) substitution rates in genes which code for proteins using the expression w = d N /d S .If amino acid changes are advantageous w will be greater than one, while if they are deleterious w will be lower than one, with neutral mutations yielding w = 1.However, this simple ratio does not account for variable rates of selection between sites.Appropriate codon-based models were developed by Nielsen and Yang (1998) and Yang et al. (2000) who developed a series of 14 different models (M0-M13) to thoroughly investigate the variability of w ratios between sites, each model having a different assumption about the nature of the distribution that could be found.However, some of those models are hard to use and only models 0 (one-ratio), 1 (neutral), 2 (selection), 3 (discrete), 7 (beta) and 8 (beta & w) are recommended by Yang et al. (2000).
The subject of the present paper is a survey of the primary structure of representatives of 13 of the 15 PR protein families, searching for evidences of positive selection.It will be seen that the search resulted in the identification of several sites in which such a process is probably occurring.

Data retrieval
The databanks used as sources for the protein and DNA sequences were the SWISS-PROT and TrEMBL (Bairoch and Apweiler, 2000) developed by the Swiss Institute of Bioinformatics and the European Bioinformatics Institute respectively.In a few cases the GenBank (National Center for Biotechnological Information, USA) was also utilized.These databanks and several other tools can be found in the Expert Protein Analysis System (ExPASy) site at http://www.expasy.org/.
The sequence search was based on the PR protein classification performed by van Loon and van Strien (1999), the descriptions and authors referenced for each PR protein in that article being used as key words for the searches.
Several searching tools were utilized with the objective of covering all the sequences registered up to now.The tool which was found to be the most efficient for the recognition and identification of the available sequences was the Sequence Retrieval System (SRS) of Etzold and Argos (1993) available on the SWISS-PROT page.This procedure was used to obtain most of the sequences and was especially useful for imperfectly described sequences because it permits several searching procedures, i.e. by bibliographic reference, organism, key word, gene name or access code.The PSI BLAST Network Service (Altschul et al., 1997) was also employed and shown to be very comprehensive, avoiding the retrieval of sequences that would not be useful in the analysis.This tool has a direct link with the SWISS-PROT access (http://www.expasy.org/sprot).

Methodology
The DNA sequences were aligned with the multiple sequence alignment CLUSTAL X program version 1.8 (Thompson et al., 1997;Jeanmougin et al., 1998) with manual corrections based on the codon pattern obtained using the DNATagger program, a color program for DNA coding alignments (Monteiro de Basso and N.M.Scherer, unpublished).Considerable care was taken to guarantee a proper analysis of the data, and archives of the alignments are available on request.
The phylogenies were estimated for each family by maximum likelihood and neighbor-joining methods using the p distance option and the HKY substitution model (Hasegawa et al., 1985) implemented in the Tree-Puzzle program (Schmidt et al., 2002).The resultant tree topologies were used to calculate the branch lengths using the M0 model through the CODEML program of the PAML packet (Yang, 1997).A relative measure of sequence divergence was calculated using the average number of nucleotide changes per codon per branch, that is S/(2T-3), where 2T-3 is the number of branches of an unrooted tree of T taxa and S is the number of nucleotide substitutions per codon along the tree.
Afterwards, analyses using the maximum-likelihood models recommended by Yang et al (2000) were implemented in the PAML program.All models were run using the F3x4 option in the PAML program, where expected codon frequencies were based upon nucleotide frequencies occurring at the three codon positions.The one-ratio model (M0) assumes one w ratio for all sites.The neutral model (M1) presupposes a proportion p 0 of conserved sites with w 0 = 0 and p 1 = 1 -p 0 of neutral sites with w 1 = 1, as would occur if almost all non-synonymous substitutions were either deleterious or neutral.The positive selection model (M2) adds an additional class of sites with frequency p 2 = 1-p 0 -p 1 and w 2 is estimated from the data.In the discrete model (M3), the probabilities (p 0 , p 1 and p 2 ) of each site which was submitted to purifying selection, neutral selection and positive selection, respectively, and their corresponding w ratios (w 0 , w 1 , w 2 ) are inferred from the data.The beta model (M7) is a null test for positive selection, assuming a beta distribution with w between 0 and 1.Finally, the beta & w (M8) model adds one extra class with the same ratio w 1 .Sites which yielded posterior probabilities higher than 95% were considered significantly affected by selection.
The likelihood ratio test (LRT) was used to verify whether the difference ratio w was significantly different from 1 for each pairwise comparison: M1 vs. M2, M0 vs. M3, M7 vs. M8 and M8a (beta & w = 1) vs. M8 (beta & w ³ 1).The LRT performs a comparison of the likelihood scores of the two models, with the constraint of w = 1 and without such constraint: LR = 2 (ln1 -ln2).This LRT statistic approximately follows a chi-square distribution and the number of degrees of freedom is equal to the number of additional parameters in the more complex model.The question of sequence divergence and the accuracy and power of the likelihood ratio test in detecting positive selection has been examined by Anisimova et al. (2001Anisimova et al. ( , 2002) ) and we have followed their recommendations in the choice of the tests.

Results
The main characteristics of 13 of the 14 recognized PR protein families, plus one that may be soon included (L.C. van Loon, personal communication) are given in Table 1.As mentioned in the introduction, these proteins have a wide array of forms with different properties but they can be considered together due to their function in plant defense.
The PR proteins studied, whose accession numbers in the data banks are displayed in Table 2, have their main characteristics listed in Table 3.They varied markedly in sizes, from only 237 bp (PR12) up to 1,110 bp (PR2), the latter being 4.7 times larger than the first.The interval of values for the transition/transversion rates was from 1.45 (PR10) to 3.42 (PR15), while the relative sequence divergencies ranged between 0.027 (PR8) and 0.423 (PR3).
Table 4 lists the parameter estimates and loglikelihood values under models of variable w ratios between sites and those obtained with the M0 model (oneratio), which assumes the same ratio for all sites.The results for PR proteins 2, 3, 5, 12, and 14 were negative, while those for PR proteins 1, 10 and 13 were borderline in the sense that significant indications for positive selection are suggested for one of the models with no confirmatory evidence from the other models.In contrast, the data for PR proteins 4, 6, 8, 9, and 15 were the most interesting, since the ML analysis inferred positive selection for those proteins in more than one model.In PR proteins 4, 6 and 8 all models examined for the presence of positively selected sites indicated the presence of such sites but for PR proteins 9 and 15 only two of them allow the same conclusion.However, the statistical significance of the difference between the null models (M0, M1 and M7) and other models required the application of likelihood ratio tests (LRTs).
We calculated four LRTs, which compared M0 with M3, M1 with M2, M7 with M8, and M8 with M8a, the latter being a procedure available in the PAML program, in which a null M8a, with w = 1 fixed is compared with the al-PR proteins molecular evolution 647    ternative M8 with the constraint that w should be equal or greater than one.The significant results of these LRTs are shown in Table 5.
The LRT results (Table 5) indicate that all comparisons for PR proteins 6, 8 and 9 were statistically significant and for PR15 in three comparisons.For PR1, PR4, PR10 and PR13 significance was achieved in the M1 vs. M2 and M0 vs. M3 comparisons only.
Let us briefly summarize the cases in which values of w significantly higher than one were obtained in at least two models.and M8.For PR4 we found that w ~2 in 2% of the sites and site 99 showed significant numbers in the M2, M3 and M8 models.A value of w ~4 was found for PR15 but only for site 48 with models M3 and M8.
It is important to observe that the M0 vs. M3 comparison is rather a test of variable selective pressure between sites than a test of positive selection (Anisimova et al. 2001,Wong et al. 2004).However, in all cases in which this comparison showed significant values, additional indications of positive selection were found in relation to other models.

Discussion
The evidence for positive selection at the molecular level is now overwhelming.Wolfe and Li (2003) list 16 genes or proteins related to defensive systems or immunity, 18 related to evading defensive systems or immunity, 12 related to male reproduction, five related to female reproduction and 22 which they placed in a miscellaneous category in which such type of selection was verified.To these we could add the results of Swanson and Aquadro (2002) on members of the antifreeze protein multigene family, of Rodríguez-Trelles et al. (2003) on the xanthine dehydrogenase gene and of Clark et al. (2003) on humanchimpanzee-mouse orthologous trios.In relation specifically to innate immunity in plants and animals, Nürnberger and Brunner (2002) identified a series of parallels between the recognition of general elicitors and pathogen-associated molecular patterns.
The approach developed by Z. Yang and collaborators has been especially useful for these investigations.A general evaluation of the methods involved in the comparison between synonymous and non-synonymous substitution rates in protein-coding DNA sequences was made by Yang (2001).Details on the accuracy and power of Bayesan approaches to this problem were presented by Anisimova et al. (2001Anisimova et al. ( , 2002) ) and Suzuki andNei (2002, 2004), while Yang and Nielsen (2002) and Yang and Swanson (2002) explored additional questions regarding codon-substitution models.
The physiological significance of the results presented in the present paper needs further investigation.Restricting our attention to the PR proteins in which clear indications of positive selection were found, we verify that PR proteins 4 and 8 are chitinases (although of different types: PR4 = types I, II; PR8 = type III), PR6 is a proteinase inhibitor, PR9 a peroxidase and PR15 an oxalate oxidase.Bishop et al. (2000) extensively discussed the molecular structures and patterns of amino acid replacements in chitinases I and III, concluding that they are basically different, and that the unusual pattern of adaptive replacements in the active site cleft of chitinase I may be due to an arms race between the plant and inhibitors developed by the pathogenic species.We have identified a higher number of sites with indications of positive selection (w ~3 in 5% of the sites) in PR8 (a chitinase III type protein) than in PR4 PR proteins molecular evolution 651 (w ~2 in 2% of the sites) which includes both types I and II chitinases.Additionally, sites 128, 131 and 145 are close to one of the catalytic sites while site 279 is near an active site residue.In contrast, no indication of positive selection was found for PR3, also a chitinase.PR6 is a proteinase inhibitor which directly acts against insects and other herbivores, micro-organisms and nematodes, positive selection at certain of its sites is therefore not surprising.
The PR9 peroxidases probably function in strengthening plant cell walls by catalyzing lignin deposition in reaction to microbial attack.Its sites 45 and 191 are located respectively between helix 2 and beta sheets, and in helix 13, near the active site of the peroxidase chain (Pfam Protein Families Database; Bateman et al., 2002) and should be functionally important.
Oxalate oxidases influence different stages of the plant's metabolism.Site 48 of PR15 may be significant in this regard, but structural data are needed to verify this possibility.
The presence of positive selection is of course determined by the role that a given protein has in the biology of a determined organism.Ongoing unpublished results of our group have detected the absence of positive selection in maturases, oleosins and auxins of several plant species and the presence of positive selection in glycoproteins belonging to four other plant species.

Table 1 -
Information about the pathogenesis-related (PR) proteins investigated in the present study.

Table 2 -
Pathogenesis-related (PR)proteins to which the sequences considered belong, and respective accession numbers in the data banks.

Table 3 -
Information about the material investigated in the present study.

Table 4 -
Parameters estimates and log-likelihood values under models of variable w ratios among sites.
The highest values of w ~6 were obtained for PR6.