Could human coronavirus OC43 have co-evolved with early humans?

Abstract This paper reports on an investigation of the role of codon usage evolution on the suggested bovine-to-human spillover of Bovine coronavirus (BCoV), an enteric/respiratory virus of cattle, resulting in the emergence of the exclusively respiratory Human coronavirus OC43 (HCoV-OC43). Analyses based on full genomes of BCoV and HCoV-OC43 and on both human and bovine mRNAs sequences of cholecystokinin (CCK) and surfactant protein 1 A (SFTP1-A), representing the enteric and respiratory tract codon usage, respectively, have shown natural selection leading to optimization or deoptimization of viral codon usage to the human enteric and respiratory tracts depending on the virus genes under consideration. A higher correlation was found for the nucleotide distance at the 3rd nucleotide position of codons and codon usage optimization to the human respiratory tract when BCoV and HCoV-OC43 were compared. An MCC tree based on relative synonymous codon usage (RSCU) data integrating data from both viruses and hosts into a same analysis indicated three putative host/virus contact dates ranging from 1.54E8 to 2.44E5 years ago, suggesting that an ancestor coronavirus might have followed human evolution.

Betacoronaviruses have a history of spillover to humans leading to the emergence of pathogens, such as the Middle East Respiratory Syndrome Human Coronavirus (MERS-CoV) and the Severe Acute Respiratory Syndrome Human Coronavirus (HCoV-SARS) (Li et al., 2005;Gossner et al., 2016). Such a pathogen emergence is limited by ecological and genetic factors (Gandon et al., 2013), and codon usage, i.e., the deviation from the random use of different codons for the 2 to 6-fold degenerate codons (Hershberg and Petrov, 2009;Roth et al., 2012), is one genetic factor that might help to explain this process.
Codon usage evolution has a measurable role on the adaptation of viruses to hosts (Chantawannakul and Cutler, 2008) due to natural selection based on translation efficiency and also drift according to the genomic mutation pressure (Nei and Kumar, 2000;Hershberg and Petrov, 2009). Nonetheless, codon usage studies meet limitations on plausible indicators and dating methods to estimate the coevolution patterns after a virus meets a new host species. If the dating of a spillover event based solely on virus nucleotide sequence data would agree with codon usage dating, based on both virus and host data, is hitherto unknown.
The aim of this study was to analyze the BCoV/HCoV-OC43 spillover to humans based on codon usage data for codon selection regime, fitness and virus/host relationship dating estimates.

Sequences
Complete genome sequences were retrieved from GenBank for BCoV (strain BCoV R-AH187, EF424620.1), detected in 2000 in the USA , and HCoV-OC43 (strain 19572, AY903460.1), detected in 2004 in Belgium (Vijgen et al., 2005a). These two sequences were considered as representatives of the diversity of each virus, and the inclusion criteria were based on genome completeness and annotation.
The eight coronavirus genomes were split into each coding region/mRNA for the analyses. Nsps 1-16 sequences were checked based on nsps 3 and 5 cleavage sites (Ziebuhr and Snijder, 2007;Wojdyla et al., 2010).
As representatives of highly expressed, tissuespecific proteins for the respiratory and enteric tracts of H. sapiens sapiens and B. taurus taurus, complete mRNA sequences were retrieved from GenBank for the surfactant protein A1 SFTPA1 (NM_001077838.2 and NG_021189.1) and cholecystokinin CCK (NM_001046603.2 and NM_000729.4), respectively.

Codon adaptation index (CAI) limits for human coronaviruses and DCAI for HCoV-OC43 and BCoV
CAI is an indicator of translational fitness of an mRNA regarding a reference translational system, ranging from 0 (no fitness) to 1 (highest fitness) (Lee et al., 2010). To determine the lower and upper limits for HCoVs in the respiratory and enteric tracts of humans, the eight HCoV sequences had their CAIs calculated for each coding region/mRNA using human SFTPA1 and CCK sequences as references in CAI Calculator 2 (Wu et al., 2005) based on the equation by Sharp and Li (1987).
CAI differences (DCAI) were calculated as HCoV-OC43 CAI -BCoV CAI (calculated as mentioned above) for each coding region/mRNA regarding human respiratory and enteric tracts in order to access the codon optimization (DCAI>1) or deoptimization (DCAI<1) for the bovineto-human spill over.

Codon usage selection regimes
For each HCoV-OC43 and BCoV coding region/mRNA, the observed effective number of codons (Nc) and the frequency of G or C at the 3rd codon positions in synonymous codons (%GC3s) (Wright, 1990) was calcu-lated using ACUA 1.0 software (Vetrivel et al., 2007) and CAI Cal (Puigbo et al., 2008), and both indicators were plotted in the expected number of codons (ENC)/ expected %GC3 graph (Wright, 1990). Dots from observed values outside the expected values curve are an indication of natural selection, while those on the curve indicate drift/ mutation pressure.

Viruses/hosts codon usage co-evolution analysis
For each HCoV-OC43 and BCoV coding region/mRNA and human and bovine CCK and SFTPA1, the values of RSCU (relative synonymous codon usage) were estimated for the 59 nonstop degenerate codons using Mega 7 software (Kumar et al., 2016). Codons with RSCU <1 are considered non-preferred, and those with RSCU >1 are preferred, while an RSCU=1 indicates a neutral codon (Su et al., 2009).
Next, continuous RSCU values were assigned the binary values 0 (RSCU£1) and 1 (RSCU>1), and data from both hosts and both HCoV-OC43 and BCoV assembled into a single alignment were used to build an MCMC MCC tree with the simple model. This included estimated frequencies, burn in=10% states, uncorrelated exponential relaxed clock (which showed a lower standard deviation when compared to lognormal clock) and constant population size (due to the lack of consensus priors for an exponential growth coalescent analysis for H. sapiens sapiens, B. taurus taurus and coronaviruses) and was built using Beast v. 1.8.3 (Drummond and Rambaut, 2007;Drummond et al., 2012).
The RSCU binary distance between human and bovine CCK and SFTPA1 was calculated as the total difference for each of these two datasets and used as a measure of codon usage distance for the enteric and respiratory tracts, respectively, for these two host species.

Results
Codon adaptation index (CAI) limits for human coronaviruses and DCAI for HCoV-OC43 and BCoV CAI upper and lower limits for the seven human coronaviruses included in this study in human respiratory and enteric tracts were 0.244-0.611 (corresponding to HCoV-SARS nsp11 and nsp10, respectively) and 0.244-0.472 (corresponding to HCoV-SARS ORF7b and nsp11, respectively).
A DCAI=0 was found for nsp7 on both respiratory and enteric human tracts and for nsp15 on the respiratory tract. DCAI values for each coding region/mRNA of HCoV-OC43 on the human enteric and respiratory tracts are represented in Figure 1.
For both BCoV and HCoV-OC43 nsp7, the lowest CAI distance (-0.039) was found for both the human and respiratory and enteric tracts regarding the lower CAI limit calculated for all seven human coronaviruses, while the highest CAI distances for the lower human coronaviruses CAI was found for BCoV and HCoV-OC43 N for both the human and respiratory and enteric tracts (-0.282 and -0.302, respectively) and BCoV nsp15 (-0,282) for the human respiratory tract.
Correlation analysis of DCAI and nucleotide identities amongst the 23 BCoV and HCoV-OC43 homologous coding regions/ mRNAs based on 1st, 2nd and 3rd and on the 3rd nucleotide position only showed the highest r 2 (correlation coefficient) value (0.27) for the 3rd nucleotide position regarding the human respiratory tract, while r 2 values for DCAI and 1st, 2nd and 3rd regarding the human enteric and respiratory tracts were both 0.05 and, regarding the 3rd positions only and the human enteric tract, 0.07.

Codon usage selection regimes
All Nc x %GC3s plots were found either above or below the ENC x %GC3 expected curve for all HCoV-OC43 and BCoV coding regions/mRNAs and for human and bovine CCK and SFTPA1 (Figure 2), an indication that codon usage in these cases was ruled by natural selection.
In Figure 2, the two closest dots to bovine and human SFTPA1 dots represent the internal I protein of BCoV (upper) and HCoV-OC43 (lower), while the two dots at the bottom of the graph refer to BCoV and HCoV-OC43 nsp11.

Viruses/hosts codon usage co-evolution analysis
All 95% HPDs (Highest Posterior Densities) are presented in years. In the MCC tree shown in Figure 3, the first split event (node A, 95% HPD 2.44E5-1.54E8) resulted in two major clusters, the largest one containing all HCoV-OC43 and BCoV coding regions/ mRNAs data except for I protein and a minor cluster containing both human and bovine CCK and SFTPA1 and HCoV-OC43 and BCoV I protein. For this minor cluster containing both hosts and coronaviruses codon usage statuses, a second split was found (node B, 95% HPD 2.07E5-1.55E8), resulting in a cluster with SFTPA1 only and another cluster with CCK and HCoV-OC43 and BCoV I, and for this last one a third split event (node C, 95%HPD 2.04E5-3.54E6) led to CCK and HCoV-OC43/ BCoV I exclusive clusters. The RSCU distance of human and bovine CCK and SFTPA1 mRNAs were 0.136 and 0.221, respectively.

694
Human coronavirus evolution

Discussion
Codon usage optimization and deoptimization based on DCAI values for ORF1 nsps, observed for both the human enteric and respiratory tracts, might be a consequence of a balance between synthesis efficiency and fine-tuning codon usage adaptation to the new host codon usage after a bovine-to-human coronavirus spillover. Though these proteins are coded in the same ORF, the distinct roles they play during RNA replication and sgmRNAs transcription might demand not only different synthesis efficiencies but also, in some cases, compensatory or concerted codon usage evolution, as in the case of the proteases PLpro and 3C-like in nsps 3 and 5, respectively, which can process ORF1 poly-Brandão 695 protein and release from it all subunits (Ziebuhr and Snijder, 2007). The analysis of coronaviruses non-structural proteins of the replicase class allows deep phylogenies to be estimated (Snijder et al., 2003) and results, thus, in a more representative range of evolutionary data to assess ancient virus/hosts relationship when associated to structural proteins data as herein.
Regarding the structural proteins, the different degrees of optimization and deoptimization found based on human enteric and respiratory tracts, besides the translation efficiency, might also be due to both an immune escape efficiency, as in the case of HE and S, as a lower CAI might lead to lower protein synthesis and consequently lower exposure to the immune system (Bahir et al., 2009) and a fine-tuning codon adaptation leading to a more efficient receptor binding to the human tissues due to the primary S and accessory HE roles on this function (Popova and Zhang, 2002).
As seen in the MCC tree (Figure 3), the first split (Node A) of hosts (H. sapiens sapiens/B. taurus taurus) and HCoV-OC43/BCoV showed a 95% HPD from 1.54E8 to 2.44E5 years ago, ranging from the Kimmeridgian age of the Late Jurassic to the Middle Pleistocene.
Taking node A as a first split and thus as a consequence of a first contact between the codon usage of an ancestor coronavirus with the codon usage of an ancestor host, the lower limit (1.54E8 ya) brings the ancestor coronavirus codon usage status to an age compatible with the proposed ancient origin of coronaviruses as being 2.93E8 y.a. (Wertheim et al., 2013), while the upper limit (2.44E5) is related to a time compatible with early humans, in agreement with the suggested interspecies transmission of a betacoronavirus prior to the HCoV-OC43/BCoV split (Vijgen et al., 2006). Such a large time span might be due to the lack of data from hosts and coronaviruses in between these upper and lower limits, but it places an ancestor betacoronavirus as coevolving with a diversity of dinosaurs (Langer et al., 2010) in the Late Jurassic and reaching early humans with until unknown intermediate hosts during this large time span. It is worthy of note that this time span overlaps with the one found for node B (95% HPD 2.07E5-1.55E8), meaning that the first ancestor host/ ancestor betacoronavirus contact might have been stable for circa 150 million years before reaching early humans.
As for node C, the 95% HPD 2.04E5-3.54E6 embraces human evolution from Australopithecus spp to H. sapiens sapiens (McHenry, 1994), what could finally represent the first sign of BCoV spillover from an ancestor ruminant host to the human lineage after a first contact with the respiratory tract (represented by SFTPA1 in Figure 3).
The discrepancy of HPDs values, when compared to previous dates on the HCoV-OC43/BCoV split and the emergence of all coronaviruses, might be a consequence of both the use of full genomes data and the selection unit used in this survey, i.e., codon usage, instead of subgenomic data based on nucleotide evolution as proposed by others (Vijgen et al., 2005b;Vijaykrishna et al., 2007;Munir and Cortey, 2015).
All coding regions/mRNAs from an ancestor coronavirus (except for HCoV-OC43 nsp7 in both respiratory and enteric human tracts, and for nsp15 on the human respiratory tract, DCAIs=0) experienced optimization or deoptimization, as suggested in Figure 1, probably after Node A ( Figure 3). This process of codon usage evolution resulted in CAIs approaching the CAI limits for human coronaviruses as calculated herein (0.22-0.611 for the respiratory and 0.244-0.472 for the enteric tract) during codon usage evolution by natural selection, as shown in the Nc x %GC3s analysis ( Figure 2). The association of data on fluctuations in codon usage optimization with analysis of the selection regime and a temporal analysis, both based on codon usage, as used in this investigation, might be of value for a deeper understanding of tempo and modes of viruses and hosts coevolution.
Having crossed the longer codon usage distance from the bovine to human respiratory tract (0.221) when compared to the enteric tract (0.136), HCoV-OC43 became a highly respiratory-specialized virus with high fitness to this new replication site and predating the proposed event around the year 1890 (Vijgen et al., 2005b;Bidokhti et al., 2013).
Though nsp14 is a coronavirus 3'-5' exonuclease (Denison et al., 2011), whose proofreading activity lowers the mutation rate of these viruses when compared to other RNA viruses, the mutant spectrum phenomenon is well documented in HCoV-OC43 and BCoV (Vabret et al., 2006;Borucki et al., 2013), and as a result, a plethora of synonymous mutations that power codon usage diversity is available for the optimization or deoptimization of codon usage in different genes via natural selection or drift as well.
An in important limitation to these arguments is that codon usage studies only allow speculations after virus attachment and entry, two processes intimately related to membrane receptor specificities that cannot be assessed in organisms for which at least gene data are not available. Also, the full set of interspecies jumps for the HCoV-OC43 ancestors has not been assessed here, as the focus was the proposed recent bovine-to-human spillover (Vijgen et al., 2005b), and this might have limited the detection of further nodes of codon usage status split with coronaviruses and different hosts.
As a conclusion, via codon usage through natural selection resulting in immune escape balanced with protein synthesis efficiency, an ancestor coronavirus might have followed human evolution with no codon usage barrier fitness deep in the human lineage. 696 Human coronavirus evolution