Divergent evolution and purifying selection of the H (FUT1) gene in New World monkeys (Primates, Platyrrhini)

In the present study, the coding region of the H gene was sequenced and analyzed in fourteen genera of New World primates (Alouatta, Aotus, Ateles, Brachyteles, Cacajao, Callicebus, Callithrix, Cebus, Chiropotes, Lagothrix, Leontopithecus, Pithecia, Saguinus, and Saimiri), in order to investigate the evolution of the gene. The analyses revealed that this coding region contains 1,101 nucleotides, with the exception of Brachyteles, the callitrichines (Callithrix, Leontopithecus, and Saguinus) and one species of Callicebus (moloch), in which one codon was deleted. In the primates studied, the high GC content (63%), the nonrandom distribution of codons and the low evolution rate of the gene (0.513 substitutions/site/MA in the order Primates) suggest the action of a purifying type of selective pressure, confirmed by the Z-test. Our analyses did not identify mutations equivalent to those responsible for the H-deficient phenotypes found in humans, nor any other alteration that might explain the lack of expression of the gene in the erythrocytes of Neotropical monkeys. The phylogenetic trees obtained for the H gene and the distance matrix data suggest the occurrence of divergent evolution in the primates.


Introduction
The human H antigen is synthesized through the action of two α1,2 fucosyltransferases (α 1,2 FT) encoded by two distinct genes, H (FUT1) and Se (FUT2), both tissueand stage-specific.The H enzyme is responsible for the expression of the H antigen in tissues derived from mesoderm and ectoderm, such as erythrocytes and vascular endothelial cells, and the Se enzyme for the expression of the same antigen in tissues derived from endoderm, such as epithelial cells of the digestive tract and exocrine glands (Watkins, 1980;Oriol et al., 1981Oriol et al., , 1986)).The H and Se genes, and the Sec1 pseudogene constitute a cluster on the long arm of human chromosome 19 (19q13.3),where H is telomeric (Ball et al., 1991;Rouquier et al., 1994Rouquier et al., , 1995;;Reguigne-Arnould et al., 1995, 1996).
The H human gene consists of eight exons and three introns.The last exon is a coding one, whereas the others have a regulatory function (Koda et al., 1997(Koda et al., , 1998)).Its product is a typical type II transmembrane protein, with 365 amino acids (Larsen et al., 1990).
All nonhuman primates studied to date present the H antigen in their secretions, although only the great apes present it on the surface of erythrocytes and in the vascular endothelial cells (Ruffié, 1974;Socha et al., 1984;Blancher and Socha, 1997).Few studies of the structure of this gene in New World primates (Platyrrhini) have been conducted.Apoil et al. (2000) isolated a fragment corresponding to the H gene in Saimiri and Callithrix, which potentially encodes an enzyme with 365 amino acids, equivalent to the human enzyme.However, these authors did not find mutations that might explain the lack of expression of the gene in the erythrocytes of platyrrhines and prosimians, and suggested that the insertion of an Alu-Y sequence in the promoter region of the H gene, in intron 1, has led to red cell expression in men and great apes.
In this study, we describe the complete sequence of the coding region of the H gene in New World primates, the occurrence of selection and the type of evolution affecting it, in order to clarify the evolutionary pathway followed by FUT1.
ples were obtained from the primate sample bank of the Genetics Department of the Federal University of Pará.

Isolation, amplification and sequencing of genomic DNA
DNA was isolated according to the phenolic extraction protocol (Sambrook et al., 1989).The coding exon was amplified using the external primers F1U-103 (5' TTCGCCTTTCCTCCCCTGCA 3') and F1L-1264 (5' TGAAGCCACGTACTGCTGGC 3'), described by Yu et al. (1997).We constructed the internal primers F1U-714 (5'CAACAGCGCCTACCTCCG 3') and F1L-1005 (5' TGTGTGAGCAGGGCAAAGTC 3'), based on the sequences described by Apoil et al. (2000).PCR was performed using 1x reaction buffer, 100 ng of DNA, 0.4 mM of each primer, 0.03 U/µL of Taq DNA polymerase, 1.4 mM of MgCl 2 , 0.1 µg/µL of BSA and 10.0 mM of each dNTP.Thirty-five amplification cycles were carried out under the following conditions: 94 °C for 50 s, °C for 50 s and 72 °C for 90 s.Amplified fragments were purified with the Wizard® PCR Preps kit (Promega) and sequenced by the dideoxyterminal method (Sanger et al., 1977) using the Big Dye Cycle Sequencing Standard kit and an ABI 377 (Perkin Elmer) automatic sequencer.Fragments were sequenced in both strands in order to clarify ambiguous sites.
A site was interpreted as polymorphic only when double peaks were exactly overlapping.
Phylogenetic analyses were carried out by PAUP, version 4.0b10 (Swofford, 1998), using maximum parsimony, with heuristic search and random addition of taxa, and neighbor-joining methods.The confidence level of each node was verified using the bootstrap method, with 2000 replications.Neighbor-joining analysis and divergence matrix were developed according to the evolutionary model and parameters of Modeltest, version 3.06 (Posada and Crandall, 1998).The evolution rate was calculated by r = K/2T (Li, 1997), where "r" was the rate of nucleotide substitution, "K" was the genetic distance between two or more taxa and "T" was the divergence time between taxa, in years.The divergence times used were those proposed by Goodman et al. (1998).The relative rate test was developed as proposed by Takezaki et al. (1995).

Results and Discussion
The coding region of the H gene was completely sequenced in 18 samples, including at least one representative of each of the fourteen genera studied, except Brachyteles, for which the coding region was partly sequenced (1,002 base pairs).The other samples (17 in total), which included the majority of the genera, were only partly sequenced (Table 1).
In most samples, the region alignment presented 1,101 nucleotides.As compared to the human sequence, the Old World monkeys and all Neotropical monkeys except Callicebus moloch and Brachyteles present an insertion of three base pairs (bp) between positions 67 and 69 (codon 23), which corresponds to that described by Apoil et al. (2000) for Hylobates and humans.The 3-bp deletion at positions 139 to 141 (codon 47), described by Apoil et al. (2000) in Callithrix, was also found in the other callitrichines (Saguinus and Leontopithecus).So, in most platyrrhines, H encodes a protein with 366 amino acids, one more than in humans, in which codon 23 of the transmembrane domain was lost.Like humans and Hylobates, the callitrichines, Callicebus moloch and Brachyteles have lost a codon, resulting in a protein with 365 amino acids.
A predominance of pyrimidine bases (55.5%) over purine bases (44.5%) was found.The mean GC content was 63%, and the GC:AT ratio was 1.70.This proportion is similar to that found by Epstein et al. (2000) in the phosphofructokinase gene and in other genes related to the development of tumors, which exhibit great gene expression activity.On the other hand, the GC content was higher than that described by Kitano et al. (1998) for the Rh and Rh50 genes (between 45% and 55%), which also encode surface antigens.
A highly heterogeneous distribution of amino acids was observed, ranging from a minimum of 1.65% (lysine) to a maximum of 12.03% (leucine).Similarly, specific codons were predominant in a given amino acid family, containing GC third bases.
As compared to human sequences, the mean similarity of amino acids in platyrrhines was 90.83%, an intermediate value between those found by Apoil et al. (2000) for the Old World monkeys (96.0%) and the prosimians (87.0%), and the mean similarity between the nucleotide sequences of the platyrrhine H gene and the human H, Se and Sec1 genes was 91.2%, 67.28%, and 65.71%, respectively.The high level of similarity is certainly due to the conservation of the coding region of the gene along primate evolution.This becomes even more evident when we consider the similarity between the human protein and those of other vertebrates, such as rats, rabbits and pigs (80%), and mice (75%) (Piau et al., 1994;Hitoshi et al., 1995Hitoshi et al., , 1999;;Cohney et al., 1996;Domino et al., 1997).
All platyrrhines shared 23 exclusive mutations, in addition to those shared with other nonhuman primates.A number of alterations exclusive to different platyrrhine genera and families were also identified.As compared to the human sequence, the H gene of the platyrrhines presented two amino acid alterations in the transmembrane domain and one in the second conserved motif (Figure 1).Aotus and S. b. ochraceus were the only taxa which presented more than one amino acid substitution in the second protein motif.None of the mutations was equivalent to those responsible for the H-deficient phenotypes found in humans, nor were any other alterations identified that might explain the lack of expression of the gene in the erythrocytes of Neotropical monkeys.These data and the high GC content reflect the functionality of the H gene in the platyrrhines, supported by the absence of mutations at the glycosylation sites and by the conservation of the amino acid serine (position 5 of the cytoplasmatic tail) and of the C-terminal region of the gene, where the majority of the nonsense and frameshift mutations occur (Wagner et al., 2001).The glycosylation sites must be intact for full enzyme activity, and the amino acid Ser 5 is crucial for locating the enzyme in the Golgi complex (Christensen et al., 2000;Milland et al., 2001).These findings support the hypothesis of Apoil et al. (2000) that a factor outside the coding region of the gene is responsible for its expression in red blood cells and that this event is recent in the evolution of the gene.Nucleotide substitution patterns revealed a predominance of transitions over transversions, with a mean ratio of 2.095.In pairwise comparisons, there was a slight predominance of C ↔ T transitions, which were 1.5 times more frequent than A ↔ G.There was a clear predominance (2.5 times) of C ↔ G transversions.
To calculate dS and dN, the primates were divided into seven groups: the order Primates, apes and humans, Old World monkeys, New World primates, and the families Atelidae, Pitheciidae and Cebidae (according to Goodman et al., 1998).Synonymous substitutions were clearly predominant, even when considering the three distinct parts of the coding region separately: the N-terminal region (NT), the transmembrane domain (TD) and the C-terminal region (CT).In fact, only in the TD region of Cebidae dN is higher than dS (Table 2), although not significantly according to the t-test (ts < 1.65 and p > 0.05 in the one-tailed test).In the platyrrhines the dN:dS ratio was 0.34, a value similar to that found in primates by Zhang (2000) for 47 genes with moderate evolution rates (mean dN:dS ratio of 0.28), suggesting the occurrence of purifying selection.To test this hypothesis, we applied a Z-test, which resulted in a value equal to zero, thus rejecting the null hypothesis (dS = dN) and accepting the alternative hypothesis (dN < dS).This result is different from that of Kitano et al. (1998), who found more nonsynonymous substituitions than synonymous substitutions in Rh blood group genes of primates, which is clear evidence of positive selection.This could be due to an interaction between organisms (parasites) and host mammal blood group antigens in the case of the Rh blood groups, which does not occur in the H antigen (Kitano et al., 1998).
The estimated evolution rate of H in the platyrrhines was 0.462 x 10 -9 substitutions/site/year (Table 3).In the families, rates varied between 0.384 x 10 -9 in Cebidae and 0.668 x 10 -9 in Atelidae.These results indicate, according to the relative rate test, that the H gene is evolving at a constant rate, both in the platyrrhines and in the order Primates.These values are in agreement with the suggestion of Barreaud et al. (2000) and Bureau et al. (2001), who observed that the evolution rate of H was intermediate among the α1,2 FT genes.This finding supports the hypothesis of selective pressure, given that selection, structural and functional requirements are the main factors which determine the evolution rate of a protein (Duret and Mouchiroud, 2000;Tourasse and Li, 2000).
Because of its broad expression, the H gene should be under relatively high selective pressure, given that the type of protein it encodes tends to be more conserved than tissue-specific ones (Hastings, 1996).If a protein, or part of a protein, has strict structural or functional requirements, the coding gene must be under strong selective pressure, which limits the alterations in the gene product.As a consequence, this gene will evolve more slowly, which explains why certain functionally critical regions, such as catalytic sites or ligation domains, are better conserved in the molecule (Tourasse and Li, 2000).This characteristic is clearly apparent in the present study, where the number of synonymous substitutions was almost four times greater than that of nonsynonymous ones, especially in the conserved motifs of the α1,2 FT protein.This corroborates the theory of Duret and Mouchiroud (2000), who suggested that a reduction in dN could be related to an increase in selective pressure on the amino acid sequence of the protein.No evidence of saturation was found for the H gene in any of the Primates analyzed (Figure 2).
Maximum parsimony analyses resulted in a single most parsimonious tree, that was identical with that obtained by neighbor-joining analysis, with similar bootstrap values.Therefore, we present only the parsimony arrangement (Figure 3).The phylogenetic relationships are similar to those proposed for New World monkeys (Schneider et al., 1996;Goodman et al., 1998;Schneider, 2000).
The genetic distance matrix shows low divergence rates, with the highest intrageneric value found in Callicebus (1.23%), and the highest intergeneric value for Pir x Ss1 (5.56%).The considerable overall similarities reinforce once again the highly conserved nature of the sequences.
The results of the genetic distance matrix, which shows low substitution rates, the agreement between the gene tree and the proposed phylogeny of New World monkeys (Schneider et al., 1996;Goodman et al., 1998;Schneider, 2000), the absence of saturation and the common nucleotide alterations shared by all Neotropical primates

344
Figure 1 -Amino acid sequence of FUT1 gene.The lozenges (◊) indicate the glycosylation sites described by Larsen et al. (1990) and Apoil et al. (2000); the box indicates the position of the transmembrane domain (TD) and of the three conserved motifs described by Oriol et al. (1999).

346
Figure 1 (cont.)-Amino acid sequence of FUT1 gene.The lozenges (◊) indicate the glycosylation sites described by Larsen et al. (1990) and Apoil et al. (2000); the box indicates the position of the transmembrane domain (TD) and of the three conserved motifs described by Oriol et al. (1999).

Figure 2 -
Figure 2 -Plot of the saturation test of the H gene.The graphic shows the absence of saturation in the studied samples.

Table 1 -
Samples used in the present study, their respective codes, access numbers, number of base pairs sequenced and origin, where known.

Table 2 -
Rates of synonymous (dS) and nonsynonymous (dN) substitutions per site of the H gene for the different primate groups analyzed.