Decoding the codon usage patterns in Y-domain region of hepatitis E viruses

Background Hepatitis E virus (HEV) is a positive-sense RNA virus belonging to the family Hepeviridae. The genome of HEV is organized into three open-reading frames (ORFs): ORF1, ORF2, and ORF3. The ORF1 non-structural Y-domain region (YDR) has been demonstrated to play an important role in the HEV pathogenesis. The nucleotide composition, synonymous codon usage bias in conjunction with other factors influencing the viral YDR genes of HEV have not been studied. Codon usage represents a significant mechanism in establishing the host-pathogen relationship. The present study for the first time elucidates the detailed codon usage patterns of YDR among HEV and HEV-hosts (Human, Rabbit, Mongoose, Pig, Wild boar, Camel, Monkey). Results The overall nucleotide composition revealed the abundance of C and U nucleotides in YDR genomes. The relative synonymous codon usage (RSCU) analysis indicated biasness towards C and U over A and G ended codons in HEV across all hosts. Codon frequency comparative analyses among HEV-hosts showed both similarities and discrepancies in usage of preferred codons encoding amino acids, which revealed that HEV codon preference neither completely differed nor completely showed similarity with its hosts. Thus, our results clearly indicated that the synonymous codon usage of HEV is a mixture of the two types of codon usage: coincidence and antagonism. Mutation pressure from virus and natural selection from host seems to be accountable for shaping the codon usage patterns in YDR. The study emphasised that the influence of compositional constraints, codon usage biasness, mutational alongside the selective forces were reflected in the occurrence of YDR codon usage patterns. Conclusions Our study is the first in its kind to have reported the analysis of codon usage patterns on a total of seven different natural HEV hosts. Therefore, knowledge of preferred codons obtained from our study will not only augment our understanding towards molecular evolution but is also envisaged to provide insight into the efficient viral expression, viral adaptation, and host effects on the HEV YDR codon usage. Supplementary Information The online version contains supplementary material available at 10.1186/s43141-022-00319-2.


Background
Hepatitis E virus (HEV) is the cause of both epidemic and sporadic hepatitis cases in humans [1,2]. HEV is a positive-sense, single-stranded RNA virus, belonging to the family Hepeviridae. The 7.2 kb genome of HEV, with short 5′ and 3′ non-coding regions (NCR), consists of three partially overlapping open reading frames (ORFs) [3]. The 5′ most ORF (ORF1) encodes the non-structural polyprotein which is organized into seven functional domains including the Y-domain region (YDR) [4,5], 3' most ORF (ORF2) codes for the viral capsid protein [6,7], and ORF3 encodes the phosphoprotein responsible for viral regulation [8][9][10]. The non-structural ORF1 Y-domain region (YDR) critical residues have been demonstrated to play critical role in the HEV life cycle [11].
Previous reports on codon usage have determined various factors governing the codon usage patterns which include mutational pressure, translational selection, G + C content secondary structure of protein, selective transcription replication, hydrophilicity, and hydrophobicity of the protein and the external environment [28,[34][35][36]. Among these, compositional constraints under natural selection and mutational pressure are two major paradigms in shaping the codon usage patterns in organisms [37][38][39]. However, in viruses, mutational pressure rather than natural selection is found to be the major factor influencing codon usage variation [40][41][42][43].
As YDR indispensability in HEV pathogenesis has been demonstrated [11], thus, it is important to determine the distinctive genetic features that are prevalent in their genomes. Using an interdisciplinary systems biology approach, we attempted to explain the codon usage bias of HEV-hosts in conjunction with evolutionary forces (compositional, mutational, selection) accountable for shaping the YDR codon usage patterns. The present study is the first in its kind which have reported the detailed codon usage analysis on a total of 7 hosts in HEV YDR. Therefore, knowledge obtained from the presented study will not only augment our understanding towards molecular evolution but is also envisaged to provide insight into the efficient viral expression, viral adaptation and host effects on HEV [31,44].

Heat map construction
The heat map was constructed using the online software tool Morpheus (https:// softw are. broad insti tute. org/ morph eus/ docum entat ion. html). Heat map is one of the most commonly used visualization in the science field because it allows us to find patterns in our data, compact a large amount of information into a small space, and are a natural representation of a matrix.

Sequence data acquisition
The YDR sequences were accumulated from the National Centre for Biotechnology information (NCBI). The retrieved sequences were selected based on the following inclusion criteria: (a) The strain (GenBank Accession number: NC_001434.1) was used as reference strain; (b) sequences were included from different hosts encompassing human, rabbit, pig, mongoose, wild boar, camel, and monkey; (c) sequences from same or different regions at varying time intervals were considered to avoid repetition in analysis; and (d) sampling dates of the sequences were clearly stated. Accumulated sequences from NCBI were edited using the Bioedit v.7.2 sequence analysis software (http:// bioed it. softw are. infor mer. com/7. 2/). The sequences were further manually edited to exclude ambiguous portions to obtained non-structural ORF1 gene product YDR before proceeding for the final alignment. Multiple alignments for YDR sequences datasets were carried out using Clustal X2 Algorithm (http:// www. clust al. org/ clust al2/) [17]. The complete list of the sequences used for various host organisms are listed as additional files in the supplementary information (Additional file 1: S1 Table, Additional file 2: S2 Table, Additional file 3: S3 Table, Additional  file 4: S4 Table, Additional file 5: S5 Table, Additional file 6:  S6 Table, Additional file 7: S7 Table, Additional file 8: S8  Table).

Nucleotide composition analysis
Nucleotide composition analysis of the YDR was calculated using MegaX software. The overall nucleotides occurrence frequency (A%, C%, T/U%, and G%), overall occurrence of nucleotide frequency at the third position of codon (A3%, C3%, U3%, and G3%) and overall occurrence of nucleotides frequencies of G+C at different codon positions were determined. The AUG and UGG codons were not considered for the analysis as they do not exhibit codon usage bias. The termination codons (UAG, UGA, UAA) were also excluded from the analysis since they do not encode any amino acid.

Relative synonymous codon usage (RSCU) analysis
The ratio between the observed and expected usage frequency of a codon is described as the RSCU value if all synonymous codons are used equally for any specific amino acid [18]. The RSCU index was determined as follows: where RSCU is the relative synonymous codon usage value, G ij is the observed number of the ith codon for the jth amino acid that has an "ni" type of synonymous codon. The RSCU values of the YDR were calculated using MegaX to determine the codon usage characteristics without the effect of amino acid composition and coding sequence length. Codons with RSCU values (> 1.6) and (< 0.6) were considered as "over-represented" and "under-represented" codons, respectively, whereas codons having the RSCU values (1) were regarded as not biased (average level codon). Moreover, less-abundant (RSCU < 1) and more-abundant (RSCU > 1) used codons were also determined.

Relationship between overall nucleotide composition and nucleotide composition at the 3rd codon position
The correlation between A, T, G, C, GC, and 3rd codon position of its counterparts (A3, T3, G3, C3, GC3) were assessed. This was carried out to analyze whether if natural selection/ mutation pressure individually contributed or if both collaboratively influenced the evolution of YDR in HEVs.

Compositional features of YDR
The nucleotide composition values for YDR were calculated to analyze the effect of compositional constraints on codon usage (Table 1) (Fig. 1).

HEV
The nucleotide composition trend was in order C > U > G > A, with an average of 30.169%, 26.631%, 24.357%, and 18.841%, respectively. Synonymous codons at the third position followed the trend C3S > U3S > G3S > A3S. The overall GC content was higher than that of AU, with 54.526% observed, compared with 45.472%, respectively, which indicates a GC-biased composition (Additional file 1: S1 Table).

Human
The nucleotide composition trend was in order C > U > G > A, with an average of 28.022%, 27.654%, 25.003%, and 19.319%, respectively. Synonymous codons at the third position followed the trend U3S > C3S > G3S > A3S. The overall GC content was higher than that of AU, with 53.025% observed, compared with 46.973%, respectively, which indicates a GC-biased composition (Additional file 2: S2 Table).

Rabbit
The nucleotide composition trend was in order C > U > G > A, with an average of 29.816%, 27.777%, 24.277%, and 18.127%, respectively. Synonymous codons at the third position followed the trend C3S > U3S > G3S > A3S. The overall GC content was higher than that of AU, with 54.093% observed, compared with 45.904%, respectively, which indicates a GC-biased composition (Additional file 3: S3 Table).

Mongoose
The nucleotide composition trend was in order C > U > G > A, with an average of 28.287%, 27.777%, 25.229%, and 18.705%, respectively. Synonymous codons at the third position followed the trend U3S > C3S > G3S > A3S. The overall GC content was higher than that of AU, with 53.516% observed, compared with 46.482%, respectively, which indicates a GC-biased composition (Additional file 4: S4 Table).

Pig
The nucleotide composition trend was in order C > U > G > A, with an average of 28.048%, 27.485%, 24.933%, and 19.532%, respectively. Synonymous codons at the third position followed the trend U3S > C3S > G3S > A3S. The overall GC content was higher than that of AU, with 52.981% observed, compared with 47.617% respectively, which indicates a GC-biased composition (Additional file 5: S5 Table).

Wild boar
The nucleotide composition trend in HEV was in order C > U > G > A, with an average of 28.391%, 27.014%, 25.485%, and 19.108%, respectively. Synonymous codons at the third position followed the trend U3S > C3S > G3S > A3S. The overall GC content was higher than that of AU, with 53.876% observed, compared with 46.122%, respectively, which indicates a GC-biased composition (Additional file 6: S6 Table).

Camel
The nucleotide composition trend in HEV was in order U > C > G > A, with an average of 28.671%, 27.662%, 24.755%, and 18.910%, respectively. Synonymous codons at the third position followed the trend U3S > C3S > G3S > A3S. The overall GC content was higher than that of AU, with 54.417% observed, compared with 47.581 respectively, which indicates a GC-biased composition (Additional file 7: S7 Table).

Monkey
The nucleotide composition trend in HEV was in order U > C > G > A, with an average of 29.510%, 28.287%, 23.241%, and 18.960%, respectively. Synonymous codons at the third position followed the trend U3S > C3S > G3S > A3S. The overall GC content was higher than that of AU, with 51.528% observed, compared with 48.47%, respectively, which indicates a GC-biased composition (Additional file 8: S8 Table). Thus, the overall initial compositional findings revealed that YDR was richly endowed with C and U nucleotides. It was observed that the least chosen nucleotide in YDR was A. Moreover, the GC contents were significantly higher than that of AU contents (since AT content was <50%) in YDR.

Patterns of codon usage in YDR
RSCU analysis was performed to assess the codon usage patterns and preferences for synonymous codons in the YDR. The RSCU values were computed for every codon in each gene sequence to decrypt the extent to which C/U-ended codons were preferred. The results are mentioned in Table 2 (Fig. 2).

Monkey
Among the 27 preferred codons, 18 preferred codons were U/C-ending (U-ending: 10; C-ending: 8) and 9 were G/A -ending (G-ending: 5; A-ending: 4) ( In line with compositional analysis, the RSCU analysis confirmed the codon biasness towards U-and C-ended codons. The RSCU pattern clearly indicated that the selection of preferred codons showed common attributes as well as differences among HEV and HEVhosts (Table 2). It was observed that some of the codons showed similar preference among HEV and HEV-hosts, while for other codons, HEV showed preference differed from that of its hosts or vice-versa. Thus, the codon which is most common among HEV and HEV-hosts, is considered as the most preferred codon that codes for a particular amino acid. Because the optimal codon selection in viruses largely depends on their hosts, we next compared the codon usage frequency of HEV with its hosts by correlating their RSCU patterns.

Relationship among HEV-hosts by comparing codon usage frequency
Since a particular amino acid is encoded by a preferred codon, the usage of synonymous codons is not random. Thus, we calculated the frequency of the preferred codons for each amino acid using the RSCU analysis (Additional file 9: S9 Table, Additional file 10: S10 Table, Additional file 11: S11 Table, Additional file 12: S12 Table, Additional file 13: S13 Table, Additional file 14: S14  Table, Additional file 15: S15 Table and Additional file 16:  S16 Table), to analyze the relationship among HEV and its hosts. This was done to understand the influence of selection pressure from hosts on codon usage patterns of HEV. A list of preferred codons encoding amino acids with higher frequency as compared to other synonymous codons for HEV, and all the hosts were computed and compared as mentioned in Table 3. The observed 4 amino acids Phe, His, Gln, and Glu showed similar usage of preferred codons (UUU, CAU, CAG, and GAG) among HEV and its hosts, which implicates an evidence of mutual codon preference. While few amino acids also showed differences in their choice of preferred codons. HEV and other HEV-hosts (human, rabbit, mongoose, pig, wild boar, camel) shared evidence of preferred codons (GUC, UGC, and CGU) for encoding the amino acids Val, Cys, and Arg, respectively, except for monkey which used different set of preferred codons (GUU, UGU, and CGC). Moreover, this phenomenon was also observed in other hosts, i.e., preferred codons encoding amino acids was different in specific host in comparison to other HEV-hosts and HEV. Firstly, HEV and HEV-hosts (human, mongoose, pig, wild boar, camel, and monkey) shared evidence of preferred codon for CCU which encoded Pro, except for rabbit which preferred CCC over CCU. Secondly, HEV and HEVhosts (human, mongoose, rabbit, pig, camel, and monkey) shared evidence of preferred codon for AAC for encoding Asn, except for wild boar, which preferred AAU over AAC. Thirdly, HEV and HEV-hosts (human, mongoose, rabbit, pig, wild boar, and monkey) shared evidence of preferred codon for GGC for encoding Gly, except for camel which preferred GGU over GGC (Table 3).
In detail, among the 18 preferred codons in HEV, 13 were common between HEV and human; 11 were common between HEV and rabbit; 15 were common between HEV and mongoose; 13 were common between HEV and pig; 13 were common between HEV and wild boar; 12 were common between HEV and camel; and 8 were common between HEV and monkey (Table 3). Therefore, the abovementioned codons were common between HEV and respective hosts, indicating coincident codon usage portion, i.e., these preferred codons were commonly shared between the virus and host. However, discrepancies were also observed within the preferred codons between HEV and its hosts, i.e., dissimilar usage of preferred codons. Thus, the ratio of coincident/antagonist preferred codons was 13/5 between HEV and human; 11/7 between HEV and rabbit; 15/3 between HEV and mongoose; 13/5 between HEV and pig; 13/5 between HEV and wild boar; 12/6 between HEV and camel; and 8/10 between HEV and monkey. Thus, codon usage pattern of HEV YDR is a mix of coincidence and antagonism with respect to its hosts.
Thus, for a particular amino acid, if a preferred codon in HEV showed similarity with its host cell, this phenomenon is termed as "mutual codon preference of host-pathogens". This implies that similar codon usage pattern among HEV and HEV-hosts could help the virus to synthesize the amino acid and corresponding proteins in a more efficient manner, thus helping the pathogen to thrive in its host cells. On the contrary, the difference in preferred codon among HEV and HEV-hosts suggests lack of shared codon preference, causing reduction in the translation efficiency of the corresponding amino acids.
A heat map was constructed using RSCU values of various HEV strains and its hosts (Fig. 3), which revealed that HEV codon preference neither completely differed nor completely showed similarity with its hosts, indicating a mixture of similar and dissimilar codon preferences (Fig. 3). Moreover, the top five most and least frequent used codons were also identified which showed common attributes and differences in codon usage patterns of HEV isolates (Table 4).
Thus, our results clearly indicated that the synonymous codon usage of HEV is a mixture of the two types of codon usage: "coincidence and antagonism. "

Effect of natural selection in shaping codon usage patterns
It has been suggested that the frequencies of nucleotides A and U /T should be equal to that of C and G at the third position of the codon if mutational pressure affects the synonymous codon usage bias [28]. However, huge variations were noted in the nucleotide base composition in case of all the hosts, signifying that synonymous codon usage bias could majorly be influenced by natural  (Table 1). From these findings, it was clear that compositional constraints under mutation pressure combined with natural selection shaped the HEV YDR across all its hosts.

Discussion
Inspection of factors governing protein evolution is essential for various research fields, including comparative genomics, molecular evolution, and structural biology. With this study, we implemented a systematic survey of the evolutionary pressures (i.e., mutational bias and natural selection) across the YDR to gain insights into the HEV functional implications in regulation as well as adaptative evolution. Jenkins and Holmes (2003) reported that codon usage bias phenomenon can be influenced by the overall nucleotide composition pattern [37]. Thus, initially, we computed the nucleotide frequencies of the YDR from HEV and its hosts. The HEV YDR revealed an over-representation of C, with overall C/U codon bias pattern in the nucleotide composition. In HEV, the percentage of C was the highest followed by U and G, with A having the lowest value (except hosts, camel, and monkey which followed the trend U > C > G > A). This clearly revealed that there was unequal distribution of A, U, G, and C nucleotides among the YDR codons. Additionally, in HEV and rabbit, the nucleotide values at third codon positions also followed the same trend, i.e., C3 had the highest value, followed by U3, G3, and A3 with the least value (while hosts followed the trend C3 > U3 > G3 > A3). Therefore, it could be interpreted that the initial nucleotide compositional patterns showed more preference towards C-and U-ended codons followed by G/A-ended codons. This is consistent with the recent investigation that has reported U/C rich genome in ORF1 of HEV [45]. However, the overall C/U rich pattern in the nucleotide content in YDR is opposite to the pattern observed in RNA viruses, which showed the prevalence of A/C-rich genomes (HIV, hepatitis C, rubella viruses) [46]. Thus, it could be interpreted that this biasness in YDR was due to the adaptation of common ancestor of modern HEV strains in terms of nucleotide composition requirement of the host during its process of evolution [47].
It has been suggested that particularly in viruses, AUor GC-rich genomes tends to correlate with the RSCU patterns. For instance, AU-or GC-rich composition preferred codons ending with either A and U or G and C, Table 3 Preferred codons for each amino acid in the YDR of HEV and its hosts Comparison of codon usage frequency of preferred codons among HEV and its hosts. All the preferred codons are highlighted indicating the highest codon frequency. Thus, codon usage pattern of YDR was a mix of coincidence and antagonism with respect to its host respectively. These trends, when observed, support the influence of mutational pressure [37]. The RSCU analysis revealed that HEV had comparatively higher codon usage bias towards U-and C-ended codons. The overall RSCU patterns can potentially hide host-specific patterns, so we next calculated the RSCU values for specific hosts. Thus, the comparative analysis was performed among HEV and its hosts, by correlating their RSCU patterns. It was noted that the host-specific codon usage patterns also showed preferred codons ending with U and C. Thus, in line with nucleotide composition analysis, the RSCU analysis further confirmed the codon biasness towards U-and C-ended codons. Thus, it could be interpreted that mutational bias was found to be a major force determining the codon usage patterns of YDR, which probably suggested that compositional constraints influenced the selection of preferred codons. However, it is interesting to mention that though HEV and its hosts was endowed with higher percentage of GC rather than AC, the RSCU analysis revealed a biasness towards U-terminated codons. This suggested that other factors in combination with mutation pressure also existed in the process of HEV evolution. Therefore, selection pressure from hosts contributed to shaping the molecular evolution of HEV at the level of codon usage. The codon usage in virus's genome in accordance with its host codon preferences is an important aspect which determines the evolutionary adaptation of the virus to its host cell. The alteration of codon usage in viral genomes due to the proper information obtained from host genes regulates the virus-host interactions [48]. As viruses are obligate parasites, their optimal codon selection is largely dependent on their host cells translational machinery [49]. A noteworthy variation was observed in the usage for the preferred codons among HEV and HEV-hosts. This implied that the codon usage patterns of HEV as well as the possible fitness of HEV to adapt within its dynamic host range were largely influenced by the selection pressures exerted from HEV-hosts.
In this study, it was observed, that unlike other viruses, that have evolved completely identical to their hosts or completely opposite to their hosts codon patterns [50,51], the HEV evolution showed a mixture of two codon usage patterns. Our results revealed that none of the hosts showed complete resemblance or complete discrepancy to the HEV. The ratio of common/uncommon preferred codons between HEV-Human, HEV-Rabbit, HEV-Mongoose, HEV-Pig, HEV-Wild boar, HEV-Camel, and HEV-Monkey were 13/5, 11/7, 15/3, 13/5, 13/5, 12/6, and 8/10, respectively. Thus, codon usage pattern of HEV YDR showed a mixture of coincidence and antagonism with respect to its hosts. The resemblance in synonymous codon patterns among HEV and its hosts implied that HEV could adapt to its host cells, resulting in its multiplication. This phenomenon suggests that the virus can replicate in host cells due to similarity in usage for preferred Table 4 Most frequent and least used codons among HEV and its natural hosts Codon frequency is given in parentheses following the relative synonymous codon usage Top 5 most frequent used codons HEV ACC (8.8), GAG (7.9), UAC (6.5), GUC (6.4), GGC (6.3),