Comparative analysis of codon usage patterns in Rift Valley fever virus

Abstract Rift Valley fever virus (RVFV) is a vector-borne pathogen and is the most widely known virus in the genus Phlebovirus. Since it was first reported, RVFV has spread to western Africa, Egypt and Madagascar from its traditional endemic region, and infections continue to occur in new areas. In this study, we analyzed genomic patterns according to the infection properties of RVFV. Among the four segments of RVFV, the nucleotide composition, overall GC content and the difference of GC composition in the third position of the codons (%GC3) between groups were the largest in the S (NP) segment, showing that more diverse codons were used than in other segments. Furthermore, the results of CAI analysis of the S (NP) segment showed that viruses isolated from regions where no previous infections had been reported had the highest values, indicating greater adaptability to human hosts compared with other viruses. This result suggests that mutations in the S (NP) segment co-evolve with the infected hosts and may lead to expansion of the geographic range. The distinctive codon usage patterns observed in specific genomic regions of a group with similar infection properties may be related to the increasing likelihood of RVFV infections in new areas.


Introduction
Recently, infection with Rift Valley fever virus (RVFV) was reported for the first time in China (Liu et al., 2017). Although it was identified in a patient who was returning to China from Angola and was not directly infected in China, no RVFV infections have been reported in Angola previously (Liu et al., 2017). Since its first report of infection and transmission between lambs in the Rift Valley of Kenya in 1930, RVFV continues to cause infections (Daubney et al., 1931). Previously, RVFV infections were found mostly in parts of Africa such as Kenya, but infections are increasing outside the traditional endemic region, such as in the Middle East and Europe (Madani et al., 2003;Chevalier et al., 2010;Grobbelaar et al., 2011). This trend indicates that RVFV is highly likely to cause infections in new areas.
RVFV is a vector-borne viral pathogen in the genus Phlebovirus and is known to cause zoonotic infections and change hosts via mosquitoes (Aedes spp., Culex spp., Anopheles spp., etc.) (Elliott, 1997;Bouloy and Weber, 2010). RVFV is an enveloped negative single-stranded RNA virus and ranges in size from 80 to 120 nm (Ellis et al., 1979;Pepin et al., 2010). RVFV has a circular three-segment genome, and these segments form a panhandle secondary structure due to cDNA sequences at the end of each segment (Hewlett et al., 1977;Boshra et al., 2011). Different proteins are encoded in each segment. The L segment encodes RNA polymerase used in the replication and mRNA transcription processes (Gerrard and Nichol, 2007). The M segment encodes two glycoproteins (Gn and Gc) that are required for viral entry and assembly and a nonstructural protein that inhibits cell apoptosis (Gerrard and Nichol, 2007). The S segment, with ambisense characteristics, encodes nucleoproteins that induce a host immune response in the antisense orientation, and nonstructural (NS) proteins that damage the host genome and function as an interferon antagonist in the complementary orientation (Gerrard and Nichol, 2007). Infections with RVFV can lead to serious illness, including retinitis, hepatitis, renal failure, meningoencephalitis, and severe hemorrhagic diseases, and can cause death in humans (Bird et al., 2009). As there are currently no effective vaccines or treatments for RVFV, the emergence of RVFV in new areas may lead to serious public health problems (Faburay et al., 2017). RVFV infection is mainly spread by mosquitoes, and therefore the area infected with RVFV is limited by the habitat distribution of its mosquito vectors (Tantely et al., 2013). However, recent climate change and increasing international trade have resulted in migration and expanded habitat for the vectors, allowing RVFV infection to occur in unexpected areas (Chevalier et al., 2010;Tantely et al., 2013). In this study, we analyzed the infection properties of RVFV based on previously reported sequence information.

Data collection
Sequence data was downloaded from the National Center for Biotechnology (NCBI) GenBank database (https://www.ncbi.nlm.nih.gov/genbank/) in order to compare the genetic characteristics of RVFVs that infect humans. RVFV sequences isolated from infected humans were studied in this analysis.

Phylogenetic analysis
Phylogenetic analysis was performed on the L, M, and S (NP and NS) segments using the program MEGA7 (http://www.megasoftware.net) to examine the evolutionary relationships among RVFVs by region and time (year) (Kumar et al., 2016). Sequence alignment was performed with MUSCLE in MEGA7, and the maximum likelihood (ML) method based on the Tamura-Nei model was used to construct phylogenetic trees (Tamura and Nei, 1993;Ku-mar et al., 2016). A robustness test was conducted with the bootstrap value set to 1,000.

Codon usage analysis
Analysis of codon usage bias in viruses provides information on molecular evolution; it can also improve understanding of the regulation of viral gene expression and help to identify the efficient expression process of viral proteins required to evade immune responses (Shackelton and Holmes, 2004;Butt et al., 2014). In this study, genomic patterns were compared by analyzing the nucleotide composition features of each segment, and codon usage bias was evaluated using the effective number of codons (ENC). The ENC value is 20 if only one synonymous codon is preferred and ranges up to 61 if all synonymous codons are equally preferred (Wright, 1990). There is an inverse relationship between ENC and gene expression. A lower ENC value indicates strong codon usage bias and elevated gene expression, while a higher ENC value indicates a diversity of codons encoding amino acids and lower gene expression (Wright, 1990). Generally, an ENC value > 35 suggests that there is a relatively conserved genomic composition (Comeron and Aguadé, 1998). Furthermore, differences in the preference of codons for a single amino acid were examined using relative synonymous codon usage (RSCU) values (Sharp and Li, 1986). Amino acids can be simultaneously encoded by one to six different codons, and codons encoding the same amino acid tend show preferential usage (Plotkin et al., 2006). Generally, codons with RSCU values > 1.0 are more preferred (abundant codons), while those with RSCU values < 1.0 are less preferred (less-abundant codons). An RSCU value of 1.0 indicates that all codons were used randomly or equally (Sharp and Li, 1986). In this study, codon usage patterns were analyzed using the tools of the Gene Infinity website (http://www.geneinfinity.org/sms/sms_codonusage.html), and codon adaptation index (CAI) values were calculated for comparison of general codon usage patterns among the virus and its hosts, human and mosquito, using the CAIcal program (ver. 1.4, http://genomes.urv.cat/CAIcal).

Phylogenetic relationships and classification of RVFV
Phylogenetic trees were constructed for each segment (S [NS, NP], M, and L) of the RVFV genome. RVFVs were grouped according to the infected region (country) and time (year) in the constructed trees ( Figure 1). This result indicated that RVFVs do not cause infections with the same genetic composition, but rather the genomic features of this virus vary with region and time due to mutations, which can also lead to changes in viral infection patterns. RVFV infections do not maintain the same level of toxicity every year, and the reported death rate due to the virus varies ac-2 Kim et al.  tion, 2007; Hassan et al., 2011). This result shows that mutations in RVFV may affect its toxicity. Although, the genetic lineages (A~G) of RVFV have been classified by previous studies (Bird et al., 2007;Ikegami, 2012), the groups of RVFV in this study were re-classified based on the phylogenetic analysis for the collected sequences. This is because previous studies did not consider the sequences of RVFV that occurred in the 2000s. Therefore, we based on these results, codon usage patterns of the five groups of RVFV (Group 1: Kenya [2006][2007]

Nucleotide composition of the CDS region in RVFV
Four CDS regions were analyzed for each segment to compare the nucleotide compositions of the five groups identified in phylogenetic analysis (Table 2). In the L, M and S (NS) segments, no significant difference in base composition was detected. In contrast, the nucleotide composition features of each group in the S (NP) segment showed a difference in composition of the third base. The third bases A (A3), C (C3), T (T3), and G (G3) of the S (NP) segment had overall frequencies in the range of 17.48-21.09%, 21.22-23.48%, 25.61-27.66%, and 29.73-33.44%, respectively. These results show that among the four CDS regions of RVFV, the S (NP) segment may be a useful indicator for identifying the genetic properties of RVFVs.

Compositional properties of the CDS region of RVFV
The %GC, %GC3, ENC, and CAI values were calculated for each group in order to analyze codon usage patterns in RVFVs. The %GC and %GC3 values showed the most significant differences between groups within the S (NP) segment (Table 3). The %GC3 value indicates the frequency of occurrence of guanine (G) or cytosine (C) at the wobble site, which is the third position of a codon. The %GC3 values were found to be greater than 50% for all groups in the S (NP) segment, but less than 50% in the other three segments. This result shows that the frequency of codons ending in G or C is higher than that of adenine (A) or thymine (T). In particular, the %GC3 values of the five groups were 51.50-56.90%, showing a greater difference between groups than other segments. The CDS region of the S (NP) segment encodes a nucleoprotein, and nucleoprotein of RVFV is known to induce host immune responses. This finding suggests that differences in host immune responses to the virus and the varied outcome of viral infection for each group may be caused by the properties of the S (NP) segment. As a result of ENC analysis, RVFV was found to have a high ENC value overall. Although the difference between groups was not great, the ENC value of the S (NP) segment was notably high (> 60), indicating that the CDS region of the S (NP) segment uses a greater variety of codons than other CDS regions. The CAI value is a measure of similarity in the codon usage pattern of a given gene, with that of the host species used as a reference. This study used the CAI values of the mosquito (Aedes aegypti), a representative vector of RVFV, and the infected host 4 Kim et al. (Homo sapiens) for comparison of general codon usage patterns. As the CAI value approaches one, the codon usage pattern becomes more similar to that of the reference individual. Overall, the CAI value with humans (Homo sapiens) as a reference was higher than that with mosquitos (Aedes aegypti). Remarkably, the CAI value of the S (NP) segment is highest in Group 5. In this study, all viral data for Group 5 were obtained from RVFVs collected in 2016. These viruses were isolated from new regions (Angola and China) where no previous cases of infection were reported, and the viral data used for analysis is the most recent data among the five groups. These results suggest that mutations in the S (NP) segment co-evolve with the hosts (mosquitoes and humans) and may allow the virus to expand its geographic range.
In addition, ENC plots were generated for each CDS region in order to determine the degree of compositional constraints on codon usage bias in the RVFVs (Figure 2). The ENC plot shows variation of ENC values according to the change in %GC3 as a decentralized graph and is known to be an effective method for examining codon usage variations among genes. In the present study, ENC values plotted against %GC3 of the CDS regions in the L, M and S (NS) segments were distributed below the curve, showing that codon usage is biased. In contrast, for the S (NP) segment, the ENC values were distributed above the curve, indicating that codon usage is more variable.

Prevalence of preferred codons
RSCU analysis was performed to determine whether group-specific properties could be discriminated from differing codon preferences in each CDS region (Figure 3). In the L segment, the codons AGC (R) and AGG (R) showed relatively large differences in RSCU values compared to other codons and were found to be over-represented. Most other codons showed similar preferences, with the same over-represented codons (³ 1.6) and under-represented codons (£ 0.6) and no differences among groups. In the M segment, the codons AGC (R), AGG (R) and UCA (S) showed relatively large differences in RSCU values compared to other codons and were identified as over-represented codons. The RSCU values of the codons CGA (R), CGG (R) and GGG (G) in Groups 1 to 4 were 0.36-0.39, 0.56-0.6 and 1.8-1.91, while those in Group 5 were 0.64, 0.37, and 1.54, respectively, indicating large differences compared to other groups. In the S (NP) segment, UUA (L) was an under-represented codon except in Group 3 (0.72) and had the lowest representation in Group 5 (0.2), while CUG (L) was identified as an over-represented codon in Group 1 (1.88) and Group 5 (2.02). In Group 5, the most highly preferred codon was UCU (S), with RSCU values ³ 1.6 (1.62), while the RSCU value of the codon UCG (S) was 0.0, showing a different codon usage pattern from other groups. In Group 1, the RSCU values of the codons CAU (H) and CAC (H) were 1.51 and 0.49, respectively, indicating differences in codon preference from other Codon usage patterns in RVFV 5  and 0.57, respectively, indicating a difference in codon preferences compared to other groups. RSCU analysis showed that the difference in codon preference between groups was more variable in the S segment than in the L and M segments.

Discussion
Various factors allow viruses to expand their range and rapidly evolve pathogenicity when adapting to new environments and hosts, including natural environmental factors and anthropogenic factors, such as climate change and the development of international trade and transportation.
Surveillance of the emergence of viruses is important, as an unexpected influx of new infectious agents into a new area can cause serious illnesses in unimmunized populations. RVFV infections are being reported in new areas continuously and constant monitoring for the emergence of the virus is required. This study investigated whether the effects of RVFV on hosts differed among epidemic periods and whether evolutionary changes in viruses are involved in the expansion of the affected area. RVFVs were grouped based on the collection time and region through phylogenetic analysis. Based on the sample clusters, nucleotide composition and codon usage were analyzed. The nucleotide composition, overall GC content, and differences in GC content in the third codon position (%GC3) between groups were greatest in the S (NP) segment, confirming that more diverse codons were used there than in other segments. Remarkably, in CAI analysis of the S (NP) segment, Group 5 had the highest value, indicating that Group 5 viruses have the greatest similarity to the reference data in terms of codon usage patterns and expression levels, and that they are better adapted to human hosts compared with other groups. Group 5 consisted of the most recent viral samples among the five groups, and all Group 5 viruses were isolated from new regions (Angola and China) where no previ-6 Kim et al. ous cases of infection had been reported. These results suggest that mutations in the S (NP) segment co-evolve with infected hosts, i.e., mosquitoes and humans, and may lead to expansion of the areas where viral infection occurs.
Due to the limitations of the published data, we could not analyze some recently isolated RVFV sequence data, other than data for Group 5 collected in Angola and China. Sufficient genetic data can reduce the bias that can occur during the analysis, so if a future analysis is performed using addi-tional public data, it may provide important information to confirm the relationship between evolutionary variation in the patterns of RVFV and the incidence of infection. RVFV viruses have a relatively large number of conserved genomic regions, and infection has occurred mainly in limited areas due to the geographically limited habitat of its vectors. However, the results of this study showed distinct codon usage patterns in specific genomic regions and identified a group of RVFVs that might have an increased possibility of causing infections in new areas based on genetic mutations. Therefore, continuous monitoring of RVFV is necessary to prevent an epidemic of this infectious disease. The codon usage patterns of RVFVs demonstrated in the present study suggest the need for continuous monitoring of RVFV infections, particularly with regard to mechanisms of viral evolution and adaptation to new environmental conditions and to human hosts.