Inadvertent nucleotide sequence alterations during mutagenesis: highlighting the vulnerabilities in mouse transgenic technology

In the last three decades, researchers have utilized genome engineering to alter the DNA sequence in the living cells of a plethora of organisms, ranging from plants, fishes, mice, to even humans. This has been conventionally achieved by using methodologies such as single nucleotide insertion/deletion in coding sequences, exon(s) deletion, mutations in the promoter region, introducing stop codon for protein truncation, and addition of foreign DNA for functional elucidation of genes. However, recent years have witnessed the advent of novel techniques that use programmable site-specific nucleases like CRISPR/Cas9, TALENs, ZFNs, Cre/loxP system, and gene trapping. These have revolutionized the field of experimental transgenesis as well as contributed to the existing knowledge base of classical genetics and gene mapping. Yet there are certain experimental/technological barriers that we have been unable to cross while creating genetically modified organisms. Firstly, while interfering with coding strands, we inadvertently change introns, antisense strands, and other non-coding elements of the gene and genome that play integral roles in the determination of cellular phenotype. These unintended modifications become critical because introns and other non-coding elements, although traditionally regarded as “junk DNA,” have been found to play a major regulatory role in genetic pathways of several crucial cellular processes, development, homeostasis, and diseases. Secondly, post-insertion of transgene, non-coding RNAs are generated by host organism against the inserted foreign DNA or from the inserted transgene/construct against the host genes. The potential contribution of these non-coding RNAs to the resulting phenotype has not been considered. We aim to draw attention to these inherent flaws in the transgenic technology being employed to generate mutant mice and other model organisms. By overlooking these aspects of the whole gene and genetic makeup, perhaps our current understanding of gene function remains incomplete. Thus, it becomes important that, while using genetic engineering techniques to generate a mutant organism for a particular gene, we should carefully consider all the possible elements that may play a potential role in the resulting phenotype. This perspective highlights the commonly used mouse strains and the most probable associated complexities that have not been considered previously, resulting in possible limitations in the currently utilized transgenic technology. This work also warrants the use of already established mouse lines in further research.


Introduction
Traditionally, techniques involving the introduction of specific mutations/foreign DNA at the site of the targeted gene to either inactivate it or to correct a faulty gene have been one of the widely used approaches in modern biology utilized for functional elucidation of genes. Even today, these are routinely used as standard methods of choice to investigate vertebrate and invertebrate model organisms, such as mouse, plant, zebrafish, drosophila, nematode, and bacteria. In general, to study a gene function, the dominant-negative approach, knock-in, complete, partial, tissue-specific, and conditional knockout approaches are utilized based on the needs of the individual investigation. Moreover, recent advances in techniques involving CRISPR/Cas9 have not only expedited transgenesis but also rejuvenated the field of therapeutics as a potential tool in treating diseases like lung cancer as well as the ongoing pandemic, COVID-19 [5,6,36,48]. Indeed, these techniques have proven to be powerful in understanding the minutiae of gene function, such as how a specifically located amino acid residue in a particular peptide and its corresponding DNA sequence in the gene play a crucial role in determining its function. For example, in knockin mouse model, p53 gene is engineered in a way that it harbors those mutations that are generally found in human sporadic cancer cases having either a mutant or a nonfunctional p53 gene [22]. Unsurprisingly, these mutations in humans cause different syndromes and cancers. Additionally, each respective mutation presents a distinct phenotype in mice, suggesting diversity in the mechanisms of p53 regulation in different microenvironments/tissues/genetic backgrounds. However, one cannot completely explain the difference in phenotypes produced by the same p53 mutation in both organisms based only on the difference in genes, species, and microenvironment.
Currently, we understand that central dogma alone cannot explain the behavior of the cell quite well, and complexity supersedes quantity. We now know that only a very small percentage (~2%) of our genome codes for functional proteins and that most of the genome still is beyond our limited understanding. The conventional view of the mammalian genome is that~25,000 proteincoding genes are dispersed within a quite repetitive and largely non-transcribed sequence. Over the past decade, this view has been challenged by the discovery of several different and essential RNA species in mammalian cells that are termed as non-coding RNAs. This non-coding genome lies mixed and interspersed with the coding genome in such an intricate manner that today it is an extremely daunting task to discriminate between the two [51]. For instance, for functional proteins, coding regions tend to be much longer, and presence of an ORF (open reading frame) of at least 300 nucleotides (100 aa) is commonly used to define a transcript as "coding," whereas many long transcripts with known non-coding functions may also typically contain multiple ORFs. These ORFs may give rise to proteins, might be translated inefficiently, or may even produce a non-functional protein which is rapidly degraded by proteasomes. These gray areas in defining coding and non-coding elements remain unexplored and may open new avenues of research. Even though we have begun to understand the signatures and properties of this tessellated non-coding entity, yet it is very early to anticipate or understand its full complexity.

The problem
The whole biology and engineering of "knocking out" genes become a little more complex per se due to the presence of important regulatory elements in the form of non-coding RNAs like miRNAs, lncRNAs, and natural antisense transcripts (NATs) inside and outside of the traditionally defined coding sequence (Fig. 1). Hence, it would be incorrect to state that knocking out a gene by the available traditional approaches will produce a phenotype that can precisely be attributed to the loss of that gene only. Until the end of last century and even currently, scientists have engineered numerous knockouts by deleting or modifying exon(s), e.g., by inserting reporter genes, by trapping the promoters and coding sequences, and by truncating the large part of protein by inserting a stop signal. However, the effect of unintentional alteration of several non-coding genes present within/outside the introns, and sometimes within exons, has not been taken into account in the process of knockout mouse generation. Moreover, the unintentional disruption of natural antisense transcripts (NATs) present in the non-coding strand of DNA during knockout generation further complicates the matter as they participate in various cellular regulatory processes via the cis or trans mechanisms, for instance, Cftr gene knockout mouse (Cftr −/− ) which was generated by inserting an inframe mutation in exon 10, to produce a truncated protein [47]. These Cftr knockout mice displayed a very strong phenotype, limiting their viability to a maximum of 40 days. The mouse Cftr gene has 28 exons, and there are several long intronic regions in the gene. Interestingly, a report published by Hill et al. on introns from CFTR demonstrated that introns alone are capable of coordinating the expression of functionally related genes [20]. They overexpressed three long intronic sequences (6a, 14b, and 23) from the CFTR gene in epithelial cells (HeLa), in which CFTR is not normally expressed. They observed that the expression of the CFTR introns caused extensive, specific, and highly reproducible transcriptional changes, affecting genes linked to CFTR function. Authors posited that, since these transfected cells do not express the CFTR protein-coding transcript, observed effects were certainly caused by the intronic sequences. Because all three intronic sequences do not include any known miRNAs or predicted stem-loop structures, they seem to act in trans as long ncRNA regulatory elements [20]. Similarly, constructs containing common selection markers/reporter genes like GFP, EGFP, Neo r , LacZ, and DsRed are often left within the target genome postselection [9,21,29,37,62]. However, these genes themselves can become potential targets of miRNAs of host origin, e.g., Mus musculus as discussed later. Therefore, it would not be wrong to assume that the resulting phenotype can be attributed to the combined effect of "altering the specific coding gene" as well as the "other non-coding genes" that get affected inadvertently due to the disruption by genetic engineering method used to generate the knockout organism. This work attempts to highlight the presence and/or disruption of these noncoding elements.

Analysis
Coding region or mRNA sequences of the transgenes were retrieved from the NCBI nucleotide database and used as target sequences for analysis. The custom miRNA prediction tool available at miRDB, an online database for miRNA target prediction [7], was utilized to search for Mus musculus miRNAs potentially targeting the mRNAs generated from commonly used reporter genes, Cre recombinase (Table 1), and human genes expressed in transgenic mouse models ( Table 2). An arbitrary minimum cutoff value of 60 was selected for the target SCORE for selection of miRNAs in cases where several miRNAs with a wide range of scores were retrieved. A search of previously published literature was performed for knockout/mutant mice in which the introduction of specific mutations/foreign DNA at the site of the targeted gene had also inadvertently caused the disruption of lncRNAs or NATs. The affected genes and the co-disrupted non-coding elements were analyzed and complied with the publications which have utilized the mice (Table 3).

Commonly used foreign genes targeted by Mus musculus miRNAs
Neomycin resistance gene (Neo r ) is one of the widely utilized selection markers for the cells which are correctly targeted, and the neomycin cassette itself is normally left within the genome post-selection, assuming that it has no adverse effect on the eukaryotic cell biology [21,50,62]. But upon careful observation, it can be seen that the Neo r gene construct itself is a potential target of several miRNAs of the eukaryotic origin or more specifically the miRNAs within the cells of the neomycin cassette containing transgenic mice (Table 1). Similarly, lacZ is another widely used reporter molecule, and its gene is often used in generating transgenic mice. A simple analysis revealed a similar fate of the lacZ gene as another strong target of several murine microRNAs (Table 1). Several other reporter genes that are widely used in mouse transgenic technology such as GFP, EGFP, TdTomato, and DsRed also have been shown as potential targets of murine microRNAs (Table 1). Hence, it can be correctly assumed that any gene that contains the Neo r /lacZ/GFP/EGFP/ TdTomato/DsRed variants can also be considered as de novo targets of microRNAs of murine origin. Interestingly, one of the most widely used recombinase enzyme, Cre, which is used in mice studies for fate mapping, stem cell homing, and gene deletion, is also a potential target of several murine microRNAs (Table 1). Using the miRDB custom prediction tool [7], we searched for potential Mus musculus miRNAs that could target the abovementioned foreign genes that are frequently used in the generation of transgenic mice strains (Table 1). Based on the analyzed data, we propose that the resulting phenotype produced by interfering with the gene of interest may not solely be due to the disruption of that particular gene but due to the combined interference of the gene of interest and the associated non-coding elements. Additionally, these reporter genes or other elements of a targeting vector that are deliberately left in the mouse may very well act as sponges/sinks for the miRNAs or other non-coding RNAs, thus interfering with the normal physiology of the cell.

Co-disruption of natural antisense transcripts (NATs) and long non-coding RNAs (lncRNAs) with the gene of interest in knockout mice
Recent years have seen a rising number of studies investigating the role of natural antisense transcripts (NATs) in eukaryotes. This has shed light on their cis-as well as trans-activity in gene regulation at various levels and NATs have been shown to play a crucial regulatory role in eukaryotic gene expression [3,55,64]. Generally, these are non-protein-coding fully processed mRNAs that are transcribed from the opposite strand of protein-coding sense transcripts [4]. In currently used transgenic techniques, while introducing mutations in the target site of the gene of our interest, we often not only disrupt the sequence of our target gene but also the partially/completely overlapping sequence of genes for NATs on the antisense strand. Although the disruption of NATs may be inadvertent, it interferes with its cis-/trans-activity. Hence, the resulting knockout phenotype would have to be attributed to the disruption of both the target gene and the corresponding overlapping NAT sequence. This should make us reconsider the assignment of the "bonafide mutant for the target gene only" status to the transgenic mice generated in such cases. We performed a literature search for such mice with codisruption of target genes and overlapping NATs and found several such cases (Table 3). For instance, Hoxd-3 knockout mice have been created by insertion of pD3Neo2TK vector carrying 11.7 kb of Hoxd-3 sequence with disruption of Hoxd-3 at nucleotide 82 of exon 1 by an MC1neo poly-A cassette [12]. Murine Hoxd-3 has 3 exons and 2 introns and has a 5' end overlap (4137 bp) with its antisense regulatory element "hoxd3os1" and the disruption of exon1 (size 324 bp) also results in the disruption of intron 2 in "hox-d3os1" due to the overlap. Hence, the resulting phenotype should be attributed to the disruption of both of these elements. Similarly, double-mutant mice were created with a targeted disruption in hoxa-3 and hoxd-3 in which the resulting phenotype would be due to the similar nature of disruption of hoxd-3 [13]. Another example of NATs disruption in genetically engineered mice is "Airn" in Igf2r mutant mouse. Igf2r has 48 exons and has a 28,395 bp overlap with its natural antisense transcript "Airn," a long non-coding RNA. This mouse gene is responsible for silencing the insulin-like growth factor 2 receptor gene and flanking genes in the mice. The overlap spans exon 1, exon 2, intron 1, and a major portion of intron 2. Igf2r knockout mice were created by replacing 0.33 kb of 5' flanking sequence and 38 codons of exon 1 by a neomycin resistance gene (Neo r ) cassette [33]. This would also replace a portion of intron 1 of Airn and hence contribute to the phenotype originally attributed to the disruption of the only Igf2r. Similarly in Dlx-1/2 floxed conditional knockout mice, Dlx-1 has a 3343 bp overlap with its natural antisense transcript "Dlx-1as" spanning exons 2 and 3 and intron 2 completely and a portion of intron 1. These mice have been generated by introducing loxP sites located between exons 1 and 2 of both Dlx-1 and 2 genes (found in the opposite orientation on chromosome 2, 9427 bp apart from each other) [45]. Dlx-1/2 floxed mice were crossed with Olig1-Cre knockin mice which completely excised exons 2 and 3 and intron 2 of each gene and the intervening~10 kbp sequence (which contains Dlx-1as on the complement strand in that region). Therefore, the deletion of entire Dlx-1as would also contribute to the resulting phenotype along with the deletion of Dlx-1 and 2. In Msx-1 conditional KO mice, Msx-1, a 4059 bp long homeobox gene, has a 2187 bp overlap with its natural antisense transcript "Msx1os" spanning portions of exon 2 and the single intron of Msx-1. Conditional KO mice of Msx-1 and 2 have been generated by  Table 2 Human genes expressed in transgenic mice are targeted by murine miRNAs in corresponding transgenic mice. Human genes expressed in transgenic mice become potential targets of Mus musculus miRNAs due to their foreign nature. This miRNAtarget mRNA interaction may often lead to interference with their expression in mice. The potentially targeting miRNAs were retrieved from miRDB using their custom target prediction tool

Genes and accession numbers
Function miRNAs targeting the transcript

Example of mouse strains
References hTNF-α (NM_000594) Regulatory role in inflammatory response and host defense against bacterial infection.

Conclusion
Technically, this is a limitation of the biological system itself that we may never be able to overcome. In most cases, man-made mutations introduced into the mouse genome would ultimately affect both strands of DNA and hence, the non-coding genes, whereas a natural mutation in the form of point mutation may not affect the other strand. However, when a natural mutation/deletion is affecting a large part of a chromosome, we must acknowledge the phenotype as a collective representation of both coding and non-coding gene disruptions. This can also be seen in mice where unknown modifiers from different genetic backgrounds interact with the same targeted gene to contribute to anomalous differences in the phenotype. For example, in the first documented case describing the influence of genetic background on gene expression, diabetes (db) and obese (ob) mutations against a B6 background were shown to only cause obesity and transient diabetes, but, on a C57BLKS/J (BKS) background, they caused obesity and severe diabetes [10,11]. However, in addition to the modifier genes, we might also be seeing the effects of these non-coding genes, which play essential roles in cellular processes that get affected due to genetic deletions. Contrary to this, often we observe that knocking out a gene does not produce expected results. Commonly, this is explained as the gene not being crucial for either development or maintenance. However, one can argue that altering the coding gene at one locus gets compensated by the simultaneous loss of non-coding gene(s) at the same position. The African proverb "When elephants fight, it is the grass that suffers," explains the fate of "noncoding genes" well. Because of our incomplete understanding of the complexity of non-coding entities in the past, there is a strong possibility that these components of the genome were inadvertently affected while engineering knockout mice. Hence, it becomes extremely critical to revisit the old methods of generating knockouts with our current understanding of the concepts and examine the transgenic strategy and affected gene functions more carefully. Nevertheless, the development of strategies to single out a particular gene function without affecting other associated non-coding elements will be a highly complex task. However, it should be noted that this may not be necessarily true for all the knockouts created to date. Our work warrants the use of already established mice lines in further research.