In silico analysis of promoter regions and regulatory elements (motifs and CpG islands) of the genes encoding for alcohol production in Saccharomyces cerevisiaea S288C and Schizosaccharomyces pombe 972h-

Background The crucial factor in the production of bio-fuels is the choice of potent microorganisms used in fermentation processes. Despite the evolving trend of using bacteria, yeast is still the primary choice for fermentation. Molecular characterization of many genes from baker’s yeast (Saccharomyces cerevisiaea), and fission yeast (Schizosaccharomyces pombe), have improved our understanding in gene structure and the regulation of its expression. This in silico study was done with the aim of analyzing the promoter regions, transcription start site (TSS), and CpG islands of genes encoding for alcohol production in S. cerevisiaea S288C and S. pombe 972h-. Results The analysis revealed the highest promoter prediction scores (1.0) were obtained in five sequences (AAD4, SFA1, GRE3, YKL071W, and YPR127W) for S. cerevisiaea S288C TSS while the lowest (0.8) were found in three sequences (AAD6, ADH5, and BDH2). Similarly, in S. pombe 972h-, the highest (0.99) and lowest (0.88) prediction scores were obtained in five (Adh1, SPBC8E4.04, SPBC215.11c, SPAP32A8.02, and SPAC19G12.09) and one (erg27) sequences, respectively. Determination of common motifs revealed that S. cerevisiaea S288C had 100% coverage at MSc1 with an E value of 3.7e−007 while S. pombe 972h- had 95.23% at MSp1 with an E value of 2.6e+002. Furthermore, comparison of identified transcription factor proteins indicated that 88.88% of MSp1 were exactly similar to MSc1. It also revealed that only 21.73% in S. cerevisiaea S288C and 28% in S. pombe 972h- of the gene body regions had CpG islands. A combined phylogenetic analysis indicated that all sequences from both S. cerevisiaea S288C and S. pombe 972h- were divided into four subgroups (I, II, III, and IV). The four clades are respectively colored in blue, red, green, and violet. Conclusion This in silico analysis of gene promoter regions and transcription factors through the actions of regulatory structure such as motifs and CpG islands of genes encoding alcohol production could be used to predict gene expression profiles in yeast species.


Background
The scarcity and rising prices of fossil fuels, geo-political instability in countries that hold most of the proven oil reserves together with apprehension about the environmental harm created by them, have resulted in increasing efforts to search for alternative energy sources [1]. Hence, production and use of bio-fuels for transport fuel has recently attracted significant attention worldwide. Likewise, Ethiopia's sustainable development and the national fuel security can only be realized with increased production and utilization of renewable fuels. Substituting the demand for fossil fuel by locally produced fuels such as bio-ethanol and bio-diesel is paramount importance for the country's economic use of scarce energy resources.
Bio-ethanol has been made since ancient times by fermenting sugars, and most bio-ethanol used for fuel and alcoholic drinks, and most industrial ethanol, is made by this process (Licht 2001; as cited in [2]). Great strides in research together with the development of new yeast strains have led to demands to model a new yeast strain which can withstand and produce at higher levels of alcohol, temperatures, and pH. This requires immense knowledge of the fermentation processes to improve its efficiency which is dependent on various factors, namely, process design, molasses quality, yeast strain, contamination, nutrient availability, and raw material purity [2]. Yeast alcohol is one of the most valuable products originating from the biotechnological industry with respect to both value and amount [3]. Yeast selection for fuel ethanol production over the past two decades and most bio-ethanol-related researches in developing tropical countries have focused primarily on the isolation of local Saccharomyces yeasts and their use for industrial ethanol production [4][5][6].
The development of DNA transformation in yeast has made possible the rapid molecular isolation of many genes from baker's yeast (Saccharomyces cerevisiaea) and fission yeast (Schizosaccharomyces pombe). Concomitantly, our understanding of many aspects of gene structure and regulation of gene expression in these organisms has improved. Although the two organisms are similar in that they are both spore-forming yeasts capable of strong alcoholic fermentation, S. pombe actually has diverged significantly from S. cerevisiaea [7]. In recent years, genome mining and in silico analysis of gene sequences and their products have become a key methodology to identify gene expression patterns, sequences responsible for development of new molecules, leading to the discovery of dozens of novel compounds [8,9]. A variety of computational tools have been developed to support scientists in this field. Most of the available tools are dedicated to the in silico analysis of specific gene and gene products [8].
Gene expression varies among tissues and even different cell types but also in response to specific signals (physiological, environmental, etc.). The main mechanism of transcriptional regulation is orchestrated by proteins called transcription factors (TFs), which promote (as activators) or block (as repressors) the recruitment of the RNA polymerase II (Pol II complex) [10]. The promoter is a DNA sequence that the transcription apparatus recognizes and binds. It indicates which of the two DNA strands is to be read as the template and the direction of transcription [11]. It is a functional region containing complex regulatory elements for determining the transcription initiation of genes [9,10,12,13]. DNAbinding sites or motifs refer to short DNA sequences (typically 4 to 30 base pairs long, but up to 200 bp for recombination sites) that are explicitly bound by one or more DNA-binding proteins or protein complexes [14]. It is often associated with specialized proteins known as transcription factors and is thus linked to transcriptional regulation. A structural feature that has proven useful in the detection of promoters is the so called CpG islands, i.e., regions that are rich in CpGs, which are important because of their strong link with gene regulation [15]. CpG islands are playing an important role in gene regulation through epigenetic changes [16].
Therefore, the aim of this study is to predict promoter and regulatory elements of genes encoding alcohol production in yeast species (Saccharomyces cerevisiaea S288C and Schizosaccharomyces pombe 972h-) thereby providing basic information which could support the effort of improving them for a commercial-scale bio-ethanol production.

Methods
Determination of transcription start sites and promoter regions for genes encoding alcohol production Gene sequences of yeast species (Saccharomyces cerevisiaea S288C and Schizosaccharomyces pombe 972h-) for genes encoding alcohol production were retrieved as FASTA file from NCBI Genome Browser (https://www. ncbi.nlm.nih.gov/gene). Gene sequences starting by ATG (starting codon) were identified, and coding sequences were used in this analysis. However, for S. pombe 972h-, all the sequences retrieved from direct NCBI web were not having the functional gene structure (no ATG in the beginning and many stop codons in the middle). Therefore, sequences used for the current study were retrieved via NCBI Reference Sequences (RefSeq). This section includes genomic Reference Sequences (RefSeqs) from all assemblies on which these genes were annotated, such as RefSeqs for chromosomes and scaffolds (contigs) from both reference and alternate assemblies. To this end, gene sequences were taken as identical protein annotated from PomBase annotation Provider for Eukaryotic Annotation Propagation Pipeline.
Generally, 23 and 21 sequences of alcohol dehydrogenase were retrieved for S. cerevisiaea S288C and S. pombe 972h-, respectively. To determine respective transcriptional start sites (TSSs) for all gene sequences, about 1kb sequences upstream of the start codon were excised from all genes except for ADH1 of S. cerevisiaea S288C which was at 2-kb upstream of start codon. Similarly, all gene sequences of S. pombe 972h-, except three genes (erg27 at 2.5 kb and SPBC8E4.04 and SPBC337.11 at 2 kb), had TSSs at 1-kb sequences upstream of their start codons.
The Neural Network Promoter Prediction (NNPP version 2.2) tool set was used with the minimum standard predictive score (between 0 and 1) [17]. For those regions containing more than one TSS, the one with the highest value of prediction score was considered to have trustable and truthful prediction. Promoter regions were defined as 1-kb region upstream of each TSS. For those regions containing more than one TSS, the highest value of prediction score will be considered so as to have a more accurate prediction.
Determination of common motifs and TFs for genes encoding alcohol production in the promoter region Analysis of conserved motifs for genes encoding for alcohol productions for both yeast species was performed by MEME (Multiple Em for Motif Elicitation) software version 3.5.4 (http://meme.sdsc.edu). This online webbased analysis was performed with minimum and maximum motif width of 6 and 50 residues, respectively, for both yeast species whereas a maximum number of motifs for S. cerevisiaea S288C and S. pombe 972hwere 23 and 21, respectively, which were used to identify probable promoter regulatory elements (motifs), keeping the rest of the parameters at default. The MEME output in HTML showed the motifs as local multiple alignments of the input sequences, as well as in several other formats. Buttons on the MEME HTML output were allowed one or all of the motifs to be forwarded for additional investigation. Descriptions the identification of motifs by TOMTOM [18] web server were designated where numerous sequence databases can be searched for sequences matching the identified motif, in which the output of TOMTOM will include LOGOS on behalf of the alignment of two motifs, the p value and q value (a measure of false discovery rate) of the match [18]. TOMTOM showed that the query motif closely resembles the binding motif in the set of genes encoding for alcohol production promoter regions.

Search for CpG islands for genes encoding alcohol production promoter regions
To search CpG islands, first, the stringent search criteria were used in the Takai and Jones algorithm: GC content ≥ 50%, Obs CpG/ExpCpG ≥ 0.60, and length ≥ 200 bp [20]. For this purpose, the CpG island searcher program (CpGi130) available at web link (http://www.bioinformatics.org/sms/cpg_island.html) was used. The CpG island graphs were plotted using EMBOSS Cpgplot (https://www.ebi.ac.uk/Tools/seqstats/emboss_cpgplot/) which identify and plot CpG islands in nucleotide sequence(s). Secondly, the CLC Genomics Workbench ver. 3.6.5 (http://clcbio.com, CLC bio, Aarhus, Denmark) was used for restriction enzyme MSpI cutting sites (fragment sizes between 40 and 220 bp).

Phylogenetic analysis
Phylogenetic analysis of gene sequences of both yeast species (S. cerevisiaea S288C and S. pombe 972h-) was conducted using the Molecular Evolution Genetic Analysis 6 (MEGA6) tool by the neighbor-joining treemaking method. Similarly, Tajima's neutrality test of selection was conducted using the same software to find nucleotide diversity. The p-distance model was applied with transition and trans-version nucleotide substitution. Bootstrap values of the super tree were computed with 2000 repetitions with uniform rate among sites and complete deletion of gaps/missing data were used to analyze the sequences.

Determination of transcription start sites and promoter regions for genes encoding alcohol production
Promoter region analysis of genes encoding for alcohol productions of both yeast species (S. cerevisiaea S288C and S. pombe 972h-) showed a great variation in the number of TSS. The highest promoter prediction scores (1.0) for TSS of S. cerevisiaea S288C alcohol dehydrogenase were obtained for five gene sequences (AAD4, SFA1, GRE3, YKL071W, andYPR127W) while the lowest promoter prediction scores (0.8) were obtained for three gene sequences (AAD6, ADH5, and BDH2) ( Table 1). In addition, the result of promoter predictions for S. cerevisiaea S288C sequences with score cutoff 0.80 showed that out of twenty-three gene sequences used in this analysis only ADH1 and ADH7 (8.70%) had showed a single TSS while the remaining (91.30%) showed multiple TSS. In these scenarios, TSSs with the highest prediction scores were considered for further uses. TSSs of genes encoding for alcohol production in S. cerevisiaea S288C were mostly located in the upstream region of − 31 to − 1545 bp, with the relatively highest incidence of occurrence in the upstream region of − 1 to − 200 bp (10 sequences, 43%) followed by − 201 to − 400 bp and −  Likewise, twenty-one sequences of genes encoding for alcohol production for S. pombe 972h-(fission yeast) were retrieved from NCBI Genome Browser. Accordingly, the result of TSS and promoter analysis had showed a significant variation in the number of TSS and the distance of TSS from the start codons ( Table 2). The numbers of TSSs were varied from 1 to 3 with majority of sequences (71.43%) having more than one TSS. In particular to this fact, out of twenty-one sequences six, seven, and eight sequences had one, two, and three TSSs respectively. The relative locations of all TSS with respect to start codon were given in Table 2. The nearest TSS were recorded for SPCC13B11.04c (− 39) followed by Adh1 (− 77) while the far-flanged TSS were observed for egr27 (− 2308) followed by SPBC337.11 (− 1636) upstream of the start codons of their respective genes. The current analysis also revealed that the relatively highest frequency of occurrence in the upstream region of − 1 to − 200 bp and − 201 to − 400 bp (6 sequences each, 57.1% share) followed by − 401 to − 600 bp (4 sequences, 19.04%) from the transcription start site, while the lowest occurrences were observed at − 601 to − 800 and − 801 to − 1000 bp (only 1 sequence each) and whereas three sequences had their TSS at above − 1000 bp ( Table  1). The predictive score at a cutoff value of 0.8 ranged from 0.99 to 0.88.

Determination of common motifs and TFs for genes encoding alcohol production in the promoter region
The promoter regions shared by majority of the gene sequences used in the current study revealed that S. cerevisiaea S288C had 100% coverage among the gene sequences at MSc1 with an E value of 3.7e−007 and 15 motif widths (Table 3).
To determine motifs which are functionally important, motifs which were shared by majority of promoter regions of S. cerevisiaea S288C genes encoding for alcohol production were chosen. Accordingly, MSc1 was revealed as the common promoter motif for all (100%) genes that serves as binding sites for transcription factors involved in the expression regulation of these genes. Sequence logo for MSc1 generated by MEME is presented in Fig. 1 Likewise, S. pombe 972hpromoter sequences had 95.23% conserved motif at MSp1 with E value of 2.6e+ 002 and 29 motif widths ( Table 4). The common motifs of MSp1 shared by majority of the genes encoding for alcohol production sequences of S. pombe 972has generated by MEME revealed in Fig. 2.
Furthermore, candidate transcription factor proteins were identified for motifs (MSc1 and MSp1) of both yeast species. They were then compared to the registered motifs in publicly available database of Jaspar2018_ Core_Fungi_Non-Redundant DNA so as to see if they are similar to known regulatory motifs using TOMTOM web application [19]. The output from TOMTOM also links back to the parent motif database for more detailed information on biological functions of the matched motif. As a result, 13 motifs out of 176 common promoter motif/transcription factors were identified for MSc1 while only 9 motifs out of 176 in MSp1 were being found matched with known motifs found in JASPAR MSc motif for S. cerevisiaea S288C  Table 2). In addition, the comparison of the identified transcription factor proteins indicated that almost 88.88% (8 out of 9) of the MSp1 were exactly similar to that of MSc1. This could be due to the fact that these two yeast species (S. cerevisiaea S288C and S. pombe 972h-) shared common promoters and their subsequent transcription factor protein were highly conserved regions of the genes encoding for alcohol production. TOMTOM is a motif comparison algorithm that ranks the target motifs in a given database according to the estimated statistical significance of the match between the query and the target. In similar manner, TOMTOM provides LOGOS that represents the alignment of two motifs and a numeric score for the match between two motifs together with a statistical significance [21]. The total numbers of motifs discovered in S. cerevisiaea S288C for genes encoding alcohol production promoter regions were about 60 out of which relatively, higher distributions of motifs were found also in positive (39) than in negative (21) strands (Fig. 3). The location and distribution of these motifs were ranged from − 998 to − 1 while higher concentration of motifs was found between − 850 and − 50 bp of the transcription start sites (TSSs). In the same view, only 48 motifs were discovered in S. pombe 972hout of which relatively, higher distributions of motifs were found also in negative (25) than in positive (23) strands (Fig. 4). The location and distribution of these motifs were ranged from − 996 to − 2 while higher concentration of motifs was found between − 800 and − 100 bp of the transcription start sites (TSSs).

Search for CpG islands for genes encoding alcohol production in the promoter regions
In the current study, revealed CpG islands were examined in the promoter and gene body regions of both yeast species using both Takai and Jones algorithm using parameters as indicated in this section of this study and CLC Genomics Workbench ver. 3.6.5. Accordingly, as per the stringent criteria of Takai and Jones [17] as indicated in this section, there were only five (ADH1 (Fig. 5), ADH2, ADH5, ZWF1, and BDH2) (21.73%) CpG islands observed in the gene body regions in analogous to only six (ADH1, SFA1, ADH3, ZWF1, BDH2, and YPR127W) out of twenty-three (26.08%) gene sequences used for the analysis in promoter regions of S. Cerevisiaea S288C yeast species. Likewise, only one (adh1) had CpG island in the promoter region and six (adh1, SPBC1773, SPCC13B11.04c, SPAC2E1P3.01, Yak3, and SPBC16A3.02c) CpG islands  were observed in the gene body of genes encoding for alcohol production of S. pombe 972h-.
On the other hand, CLC genomics workbench ver 3.6.1 using restriction enzyme MSpI (C/CGG sequence) cutting sites with standard fragment sizes between 40 and 220 bp revealed that in S. cerevisiaea S288C five (ARA1, BDH2, GCY1, SFA1, and GRE3) and only one (GRE3) CpG islands were found in gene body and promoter regions, respectively. In contrarily, the number of CpG islands in the gene body was only one (SPBC16A3.02c) while four (adh1 (Fig. 6), adh4, SPAP32A8.02, and SPBC337.11) in the promoter regions of S. pombe 972hin alcohol dehydrogenase (Table 5). This indicates the poor occurrence of CpG islands in both gene body and promoter regions which may affect the access of promoter region of genes to their transcription factors, hence preventing their expression.

Phylogenetic analysis
A combined analysis of all the data weighted equally resulted in a single most-parsimonious cladogram (Fig. 7). Statistics for this tree revealed that in general, relationships on the tree are very strongly supported. A phylogenetic tree was generated using the neighbor-joining (NJ) as well as minimum-evolution method of MEGA 6.0. As illustrated in Fig. 7, all sequences from both S. cerevisiaea S288C and S. pombe 972hwere divided into four subgroups (I, II, III, and IV). The four clades are respectively colored in blue, red, green, and violet. The phylogenetic tree indicated that some gene sequences irrespective of source organism clustered together within each sub-group suggest a close evolutionary relationship among the genes rather than the whole species.

Discussion
Sequence-specific DNA-binding transcription factors (TFs) are often termed as "master regulators" which bind to DNA and either activate or repress gene transcription. Determination of a gene's transcriptional start site underlies the identification of the proximal promoter region and thus facilitates the subsequent analysis of gene expression. In the current in silico analysis, majority of the sequences in both species had multiple transcription start sites. This could give the alternative transcription potential for the gene sequence under consideration. However, for better prediction, TSS with a higher cutoff value was considered in the current study. Many authors [22][23][24] have reported the presence of multiple transcription sites in genes encoding for alcohol production. The comparison of the two yeast species with this regard showed that S. cerevisiaea S288C had a nearby site than S. pombe 972h-. This may be due to the fact that S. cerevisiaea S288C heavily relies on genes encoding for alcohol production to convert aldehydes and ketones into alcohols and NADH to NAD + that the yeasts can use for energy [25]. This process of yeasts turning aldehydes and ketones into alcohols is called fermentation. In general, promoter regions are located at the immediately upstream of a transcription start site (TSS) and have a variety of sequence motifs that participate in gene regulation [26]. The establishment and maintenance of temporal and spatial patterns of gene expression are achieved primarily by transcription regulation. Functionally important motifs are usually short conserved sequence pattern associated with distinct functions of DNA often serve as transcription factor biding sites. In the current in silico analysis using MEME, the conserved short sequences of S. cerevisiaea S288C had confirmed to have higher occurrencesthan that of S. pombe 972h-. Common motifs, short DNA segments, are binding sites for transcription factors [27]. In both cases, the probability of finding a well-conserved pattern in random sequences as evidenced by an E value is significantly higher than expected value. It is generally believed that genes having similar expression patterns contain common motifs in their promoter regions [28,29]. Common promoter motifs are the key signatures for a family of coregulated genes and are usually present in the regions where complex protein interactions occur [30]. However, in some cases, single motifs can bind various transcription factors thereby bringing the genes under multiple regulatory controls [31,32]. Extensive studies on 500 bp upstream regions of yeast promoters suggest that regulatory elements are commonly present in those regions [33].
It is well reported that CGIs are highly involved in gene regulatory processes [34]. They are present at or near the gene's transcription start site and are often associated with the promoters of most house-keeping genes and many tissue-specific genes, and thus have important regulatory functions and can be used as gene markers [35].
A phylogenetic tree generated in the current study showed that all sequences from both S. cerevisiaea S288C and S. pombe 972hwere divided into four subgroups. This could be due to the fact that some vital genes for alcohol fermentation of yeast strains. Similar studies by other authors [34,35] have revealed that one organism which is equally amenable to genetic manipulation as is S. cerevisiaea is the fission yeast S. pombe. Although the two organisms are similar in that they are ascospore-forming yeasts capable of strong alcoholic fermentation, S. pombe actually has diverged significantly from S. cerevisiaea [36,37]. In fact, the available information suggests that S. pombe may actually be more closely related to filamentous fungi such as Neurospora and Asperglus than it is to budding yeast [9,38].

Conclusion
The enhancement of DNA transformation in yeast has resulted in revolutionized molecular tools and facilitated ease of thought of aspects of gene structure and regulation of gene expression of many genes from baker's yeast (S. cerevisiaea) and fission yeast (S. pombe). Generally, the regulation of alcohol dehydrogenase enzyme which is critically important for the survival and enhanced efficiency of yeast species can better be examined and explained with the use of increasing technological advancement, genome mining, and in silico analysis of gene sequences. Majority of the genes encoding for alcohol production sequences have multiple TSS in both yeast species suggesting alternative gene regulation. The result of this analysis could be critically important to understand the nature of promoter regions, the motif discovered in line with the transcription factor binding proteins, and the strength of the genome to different transcriptions via following the frequency of CpG islands. The phylogenetic analysis revealed that majority of genes is clustered together irrespective of the source organism suggesting a close evolutionary relationship among the gene rather than the whole species.
In general, this in silico analysis of genes encoding for alcohol production of S. cerevisiaea S288C and S. pombe could be helpful to add knowledge about the species molecular data and supportive to identify gene regulatory elements in the promoter regions. It could also help to predict gene expression profiles in various yeast species which in turn could be helpful to improve efficiencies of these organisms for domestic and commercial production of bio-fuel with higher rate of recovery.