The complete mitochondrial genome of the pirarucu ( Arapaima gigas , Arapaimidae , Osteoglossiformes )

We sequenced the complete mitochondrial genome of the pirarucu, Arapaima gigas, the largest fish of the Amazon basin, and economically one of the most important species of the region. The total length of the Arapaima gigas mitochondrial genome is 16,433 bp. The mitochondrial genome contains 13 protein-coding genes, two rRNA genes and 22 tRNA genes. Twelve of the thirteen protein-coding genes are coded on the heavy strand, while nad6 is coded on the light strand. The Arapaima gene order and content is identical to the common vertebrate form, as is codon usage and base composition. Its control region is atypical in being short at 767 bp. The control region also contains a conserved ATGTA motif recently identified in the Asian arowana, three conserved sequence blocks (CSB-1, CBS-2 and CBS-3) and its 3’ end contains long series of diand mono-nucleotide microsatellite repeats. Other osteoglossiform species for which control region sequences have been published show similar control region characteristics.


Introduction
Comparing complete animal mitochondrial genome sequences is becoming an increasingly common method of phylogenetic reconstruction and of modeling genome evolution.Mitochondrial genomes from over 300 vertebrate species, with a large concentration on teleost fishes, have now been sequenced (Boore, 1999;Curole and Kocher, 1999;Inoue et al., 2001;Miya et al., 2001;Miya et al., 2003).Mitochondrial sequences have proven to be of great utility in molecular phylogenetic studies, providing large number of phylogenetically informative characters, and complete genome sequences have provided valuable insights into a number of deep-level phylogenetic questions (e.g., Boore and Brown, 1998;Mindell et al., 1998;Naylor and Brown, 1998;Miya et al., 2001;Inoue et al., 2003a;Miya et al., 2003;Brinkmann et al., 2004).
The mitochondrial genome is typically compact at ~16 kb with few, if any, intergenic spacers.The two noncoding regions which usually represent less than 5% of the total genome size are the control region which contains the heavy-strand replication origin and is involved in regulating of transcription and replication (Clayton, 1982;Shadel and Clayton, 1997), and the light-strand replication origin (Wong and Clayton, 1985).While the structural properties of the control region are important in transcription and replication, actual sequence of nucleotides is relatively free to vary (Shadel and Clayton, 1997) making the control region a popular candidate for population-level and phylogeographic studies (Avise, 2004).
Complete mitochondrial genomes are available for a number of the species of the order Osteoglossiformes, but not for Arapaima gigas.A peculiarity of all of the osteoglossiform genomes deposited in GenBank (AB043025 and AB043068, Inoue et al., 2001), with the exception of Scleropages formosus, is that they are all missing the control region.In their conservation genetic study, Hrbek et al. (2005) were also unable to amplify the control region of majority of the individuals used in the study in spite of designing specific, highly stringent primers.Those individuals that amplified often produced only a weak product; amplification of different individuals resulted in different sized products, and some individuals also showed multiple bands.These results pointed to the possible presence of repeats and secondary structures that would prevent efficient amplification and sequencing of this region.Multiple PCR bands suggested possible mtDNA heteroplasmy.Consultation with the authors of the Osteoglossum and Pantodon mitochondrial genomes revealed that they also were unable to efficiently amplify and sequence the control region; only the 5' portion of the control regions are deposited, and are characterized by a large series of tandem repeats.In a recent publication characterizing the complete mitochondrial genome of the Asian arowana Scleropages formosus, Yue et al. (2006) observed tandem repeats in the control region, and also mitochondrial heteroplasmy.Therefore, we sequenced the complete mitochondrial genome of Arapaima gigas, including the control region, in order to characterize the genome and assess its potential phylogenetic utility, and that of its genes and gene regions.

Laboratory protocols
The tissue sample used in this analysis was obtained from a specimen captured in a participatively managed fishery area north of the city of Santarém.A white muscle tissue sample was collected and preserved in 95% ethanol and transported to laboratory.Total genomic DNA was extracted using Qiagen spin-column according to the manufacturer's protocol.
To assure fidelity of priming, we used a touch-down PCR method.The temperature profile consisted of 1) preheating at 68 °C for 60 s, 2) denaturation at 93 °C for 10 s, 3) annealing at 55-50 °C for 35 s, 4) extension at 68 °C for 7 min, and 5) a final extension at 68 °C for 10 min.Steps 2-4 were repeated 25 times; in the first 9 cycles, the annealing temperature was decreased by 0.5 °C until 50 °C an-nealing temperature was reached.Using this methodology we amplified three overlapping segments.
PCR products were evaluated on a 1% agarose gel, and then purified with Qiagen spin-columns.The sequencing strategy employed the 'primer walking' methodology.We sequenced each amplified fragment with the two amplification primers, and upon obtaining the sequence information we performed additional sequencing reactions with internal primers available in the laboratory or specifically designed primers until sequence data were obtained for the complete fragment.Many of the additional primers used in this study were derived from primers published in Miya and Nishida (1999); see Table 1 for primers.
Cycle sequencing PCR followed manufacturer's recommended protocol for DYEnamic ET Dye Terminator mix (GE Healthcare); primer annealing temperature was at 50 °C and we used ~30 ng of purified PCR product.Cycle sequencing PCR products were precipitated using a mixture of 70% ethanol and 175 mM ammonium acetate.Precipitated DNA product was resuspended in Hi-Di Formamide, and resolved on a MegaBACE 1000 automatic DNA analysis system (GE Healthcare) using the manufacturer's recommended settings.
The complete genome was re-sequenced, providing a 2x genomic coverage.It was then aligned and annotated against the mitochondrial genome of Osteoglossum bicirrhosum (GenBank# AB043025).

Data analysis
Orthologous protein-coding regions were aligned in Clustal W (Thompson et al., 1996), and alignment was confirmed by conceptually translating protein-coding DNA regions into amino-acid sequences in BioEdit (Hall, 1999).Alignments of ribosomal and transfer RNAs were constructed in Clustal W (Thompson et al., 1996) and manually adjusted, if necessary, to conform to secondary structural models (Kumazawa and Nishida, 1993;Ortí et al., 1996;Wang and Lee, 2002;Waters et al., 2002).Codon usage frequencies, and amino acid composition of the genome was inferred in the program MEGA 3.1 (Kumar et al., 2004).Mitochondrial gene regions were tested for an anti-G bias characteristic of the mitochondrial DNA genes, but not of the nuclear genome, to support our conclusion that we have collected genuine mitochondrial DNA data (Zhang and Hewitt, 1996).Hairpins in the control regions were inferred using the software mFold (Zuker, 2003) implemented on the website www.idtdna.com.

Results
The total length of the Arapaima gigas mitochondrial genome is 16,433 bp.The genome sequence is deposited in GenBank under the accession number EF523611.The Arapaima gene order and content (Figure 1 and Table 2) is identical to the ancestral vertebrate state (e.g., Inoue et al., 2001Inoue et al., , 2003a;;Miya et al., 2003).The genome codes for one 294

Hrbek and Farias
Table 1 -Primers used in the amplification and sequencing of the complete mitochondrial genome of Arapaima gigas.The primer designations correspond to their 3' position in the human mitochondrial genome (Anderson et al., 1981) by convention.H and L designate the heavy and the light strand, respectively.Many of the primers reported for the first time in this study are used in ongoing studies in our laboratory, or were derived from primers published in Miya and Nishida (1999).subunit of the Cytochrome b (cob) which forms part of the ubiquinol cytochrome c oxidoreductase complex; three subunits of the Cytochrome oxidase (cox) which form part of the cytochrome c oxidase complex; seven subunits of the NADH dehydrogenase (nad) which form part of the nico-tinamide adenine dinucleotide ubiquinone oxidoreductase complex; and two subunits of ATP synthase (atp).It also contains the small (rrnS) subunit and the large (rrnL) subunit ribosomal RNA genes and 22 tRNA genes (trn).A noncoding control region located between trnP and trnF genes contains the origin of heavy strand replication (O H ), and the light strand replication origin (O L ) is found between the tRNAs genes trnN and trnC.
The reference Arapaima gigas individual that we sequenced had a relatively short control region of 787 bp.A schematic representation is shown in Figure 2. Similar to typical vertebrate mitochondrion, this non-coding region contains the heavy-strand replication origin (O H ) and can be divided into three different domains (Brown et al., 1986).Domain I is only 147 nucleotides long and contains a 23 bp thermo-stable hairpin suggested to be involved in the regulation of replication of the mitochondrial genome (Buroker et al., 1990); however, it does not appear to contain the termination associated sequence (Doda et al., 1981).Domain II, the central conserved block, extended from nucleotide 148 to 386.Domain III contained three conserved sequence blocks (CSB-1 at position 461-486; CSB-2 at position 577-593; and CSB-3 at position 620-637).Similar to the results reported by Broughton (2001), the CSB-1 was the least conserved, while CSB-3 was the most highly conserved of the three blocks.In do-

296
Hrbek and Farias  main III, a 14 unit AT repeat is present from position 678-705 and shortly thereafter it is followed by mono-nucleotide adenine and thiamine repeats.Repeat sequences are 5 A residue, 11 T residue, 6 T residue and 9 A residue mono-nucleotide repeats separated by short non-repeat sequence regions.As expected, the control re-gion is heavily biased against guanine with a composition of 0.342 (A), 0.217 (C), 0.132 (G), and 0.309 (T).The control region for O L contains a highly conserved hairpin loop with a perfectly complementary bases-pairing stem (CCTCCGCCT/AGGCGGGAGG).The secondary structure of the O L has been suggested to regulate light-strand replication (Wong and Clayton, 1985).With the exception of cox1 which starts with GTG, all protein-coding genes begin with the ATG start codon; stop codons include 12 TAA, six of which are incomplete, and one AGA (Table 2).Incomplete stop codons are common in mitochondrial genes, and TAA stop codons are created via posttranscriptional polyadenylation of the 3' end of the mRNA (Ojala et al., 1981).Reading frames of three pairs of genes, atp8-atp6, nad4L -nad4 and nad5-nad6 overlap by several nucleotides, a pattern which is also common in other vertebrate mitochondrial genomes.Some genes are separated by up to five nucleotide non-coding spacers (Table 2).
Codon usage in the 13 coding genes consisted of 28.4% A, 26.1% C, 13.8% G and 30.7%T bases.These values were similar to those observed in other osteoglossiform fishes and show a strong anti-G bias.The anti-G bias was especially pronounced in the third position of the 12 heavy-strand encoded genes which consisted of 41.7% A, 28.1% C, 3.8% G and 26.4% T bases (Figure 3; Table 3).The rank order of nucleotide usage frequency at the third codon position is the same as in Osteoglossum bicirrhosus and Pantodon buchholzi; however, in Scleropage formosus the rank order of A and C is reversed.The most frequently  encoded amino acids were leucine (16.65%), followed by threonine (8.25%), isoleucine (8.17%) and alanine (7.72%).The least common amino acid was cystein (0.79%) -see Table 4.
All 22 Arapaima mitochondrial tRNA genes possess anticodons that match the vertebrate mitochondrial genetic code (Kumazawa and Nishida, 1993).Each tRNA sequence may be folded into a cloverleaf structure with 7 bp in the aminoacyl stem, 5 bp in the TΨC and anticodon stems, and 4 bp in the DHU stem.tRNA stem regions include some non-complementary base pairings, a pattern also commonly observed in other vertebrates (e.g., Kumazawa and Nishida, 1993).The 3' CCA nucleotide tail of mature tRNAs is most likely added post-transcriptionally (Roe et al., 1985).
When the rrnS and rrnL genes are transcribed into putative rRNAs, both rRNA sequences may also be folded into secondary structures.Stem regions appear to be conserved, whereas loop regions are somewhat more variable relative to other vertebrate sequences.The functional requirement for specific base pairing appears to constrain the evolution of stems relative to some portions of loops (e.g., Sullivan et al., 1995;Ortí et al., 1996;Wang and Lee, 2002).

Discussion
One of the motivations of our study was to sequence the mitochondrial control region, and determine the factors which impeded its efficient use in phylogeographic and population genetic analyses (Hrbek et al., 2005).In spite of designing specific, highly stringent primers, Hrbek et al. (2005) were unable to obtain amplification results in the majority of the individuals used in their conservation genetic study.Not all individuals amplified, and those that did, often produced only a weak product often with large size differences among PCR products.PCR amplification of a number of individuals also produced multiple bands suggesting possible mtDNA heteroplasmy.Inoue et al. (2001) encountered similar difficulties (pers.com.), and for these reasons did not characterize the control regions of the two osteoglossiform species Osteoglossum bicirrhosum (GenBank# AB043025) and Pantodon buchholzi (GenBank# AB043068).A recent study of Yue et al. (2006) characterized the complete mitochondrial genome of the Asian arowana Scleropages formosus and reported tandem repeats in the 5' and 3' ends of the control region, as well as mitochondrial heteroplasmy.These observations suggest that the control region anomalies observed in Arapaima may be a general property of the control regions of the fishes of the order Osteoglossiformes.
The control region of the reference specimen of Arapaima gigas is relatively short at 787 bp.It contains three domains (Brown et al., 1986).Domain I appears to lack the termination associated sequence (Doda et al., 1981), but it does contain a 23 bp thermo-stable hairpin (Figure 2).The hairpin contains the ATGTA/TACAT motif which Yue et al. (2006) also observed in other osteoglossiform fishes and in fishes of the genus Anguilla.Previously this motif was observed only in mammals (Saccone et al., 1991) and the lungfish (Zardoya and Meyer, 1996).Although not pointed out by Yue et al. (2006), the hairpin is actually inverted (TACAT/ATGTA) in the phylogenetically closely related Asian (Scleropages formosus) and the silver (Osteoglossum bicirrhosus) arowanas, but not in other osteoglossiform species.The inverted repeat appears to be a molecular synapomorphy for the Scleropages + Osteoglossum clade.Same as in Scleropages formosus, domain III of Arapaima gigas contains microsatellite repeats; specifically domain III of the reference individuals contains a 14 unit AT repeat followed by mono-nucleotide repeat sequences of A (5x), T (11x), T (6x) and A (9x).Repeats in both the 5' end (domain I) and the 3' end (domain III) of the control region are rare and currently have only been reported in Scleropages formosus (GenBank# DQ023143, Yue et al., 2006).
Although only preliminarily characterized, a similar pattern of repeats is observed in the 5' and 3' ends of the control region of Heterotis niloticus, the African sister taxon of Arapaima gigas.The control regions of Osteoglossum bicirrhosum (GenBank# AB043025) and the African Pantodon buchholzi (GenBank# AB043068) also Arapaima gigas mitochondrial genome 299 contain a large and complex tandem repeats in the 5' end of the control region which corresponds to the domain I hairpin (Yue et al., 2006).However, confirmation of the exact pattern and structure of the control regions of these species is not possible since the central and 3' end portions of the control regions are not available.The control region of the mormyriid Gnathonemus petersii lacks large blocks of repeats, and appears to contain all three CSBs.The mormyriids together with hiodontids and notopterids are sister clade to the clade containing the genera Arapaima, Heterotis, Sclerophages, Osteoglossum and Pantodon (Nelson, 1994;Sullivan et al., 2000), which suggests that the observed control region peculiarities are phylogenetically restricted.
The mitochondrial control region regulates replication of the heavy strand and transcription (see review in Shadel and Clayton, 1997).Together with the conserved sequence blocks whose role is involved in positioning RNA polymerase for transcription and for priming replication (Clayton, 1991;Shadel and Clayton, 1997), an important regulatory element is the termination associated sequence (TAS) normally observed in domain I. TAS appears to act as a signal for termination of D-loop strand synthesis, however, it could not be identified in our reference individual.We speculate that the 23 bp thermo-stable hairpin found in domain I (Figure 2) may take on the role of a signal for termination of D-loop strand synthesis in the absence of TAS.This conclusion is contrary to that of Yue et al. (2006) who suggest that the domain I hairpin may be a binding site for proteins involved in replication.Elucidating the role of the domain I hairpin and understanding the consequence of the apparent lack of TAS for mitochondrial replication and for the transcription of mitochondrial genes, if any, will require biochemical and cell molecular studies, however.The second major regulatory region, the light strand replication origin, is found between the genes trnN and trnC.It is represented by a 35-bp non-coding sequence which may be folded into secondary structure consisting of a perfect 9-bp stem and a 13-bp loop.Secondary structures at the light strand replication origin may act as initiation signals for light strand replication (Wong and Clayton, 1985) and appear to be fully functional.
Protein coding genes are characterized by an anti-G bias which is particularly strong at the third codon position where G is present at only 3.8% frequency (Figure 3; Table 3).The anti-G bias may be due in part to selection against less stable G nucleotides on the light strand, which is exposed as a single strand for a considerable length of time during the asymmetrical replication of mtDNA (Clayton, 1982).A further implication of the model of Clayton (1982) has been pointed out by Reyes et al. (1998).The deamination of cytosine into uracil and adenine into hypoxanthine on the heavy strand would lead to a decrease in G content in the light strand, and an increase in G on the heavy strand.The low G content observed in mitochondrial genes may, thus, also be a result of the asymmetrical replication 300 Hrbek and Farias of the mitochondrial genome.Still further contribution to the anti-G bias may result from the preference for adenine during mRNA transcription, as ATP is generally the most common ribonucleotide available in the mitochondria and, thus, is most efficiently transcribed (Xia, 1996).In contrast to G, the most commonly used nucleotide is A which is also the most commonly used nucleotide in Osteoglossum bicirrhosum and Pantodon buchholzi, but not in Scleropages formosus.Amino acid usage is also similar to that observed in other osteoglossiform fishes, and is heavily biased towards the use of leucine.
It is clear that the control region patterns, or their variations, observed in Arapaima gigas are also observed in other osteoglossiform fishes.What is unclear is if these control region characteristics are due to phylogenetic conservatism, or due to homoplasy.No matter what the mechanism, the control region is unlikely to be a suitable phylogenetic marker for phylogeographic and population-level studies due to large stretches of repeats and secondary structures which make amplification and sequencing difficult.Further complications arise due to mitochondrial heteroplasmy potentially caused by slip-strand replication (Macey et al., 1997c) of the domain I hairpin and or of the domain III microsatellite region.
The availability of the complete genome of Arapaima gigas will facilitate molecular population studies of both the pirarucu and other osteoglossiform fishes, such as the two species of arowana Osteoglossum bicirrhosum and Osteoglossum ferreirai, and the aquiculturally important African species Heterotis niloticus.The mitochondrial genome is composed of a mosaic of highly conserved and highly variable sections among the evolutionarily divergent Arapaima gigas and Osteoglossum bicirrhosum.This characteristic greatly facilitates choosing appropriately informative genomic regions for particular questions, as well as primer design for other osteoglossiform species.

Figure 1 -
Figure 1 -Schematic map of the complete mitochondrial genome of Arapaima gigas.

Arapaima gigas mitochondrial genome 297 Figure 2 -
Figure 2 -Schematic map characterizing of the control region of Arapaima gigas.Indicated are the three domains, a potentially regulatory hairpin in domain 1, the three conserved sequence blocks in domain 3, and a series of repeats in domain 3.

Figure 3 -
Figure3-Nucleotide composition of the 12 mitochondrial genes coded on the heavy strand.Nucleotide composition at first, second and third positions for individual genes is presented on the left side.On the right side, averages across all genes are presented.

Table 2 -
Gene organization of the Arapaima gigas mitochondrial genome.

Table 3 -
Codon usage of the Arapaima gigas mtDNA.

Table 4 -
Amino acid usage (%) in the 13 protein coding genes of the Arapaima gigas mtDNA.