In silico analysis of promoter region and regulatory elements of glucan endo-1,3-beta-glucosidase encoding genes in Solanum tuberosum: cultivar DM 1-3 516 R44

Background Potato (Solanum tuberosum L.) is one of the most important food crops in the world. Pathogens remain as one of the major constraints limiting potato productivity. Thus, understanding of gene regulation mechanism of pathogenesis-related genes such as glucan endo-1,3-beta-glucosidase is a foundation for genetic engineering of potato for disease resistance and reduces the use of fungicides. In the present study, 19 genes were selected and attempts were made through in silico methods to identify and characterize the promoter regions, regulatory elements, and CpG islands of glucan endo-1,3-beta-glucosidase gene in Solanum tuberosum cultivar DM 1-3 516 R44. Results The current analysis revealed that single transcription start sites (TSSs) were present in 12/19 (63.2%) of promoter regions analyzed. The predictive score at a cutoff value of 0.8 for the majority (84.2%) of the promoter regions ranged from 0.90 to 1.00. The locations for 42% of the TSSs were below −500 bp relative to the start codon (ATG). MβGII was identified as the common promoter motif for 94.4% of the genes with an E value of 3.5e−001. The CpG analysis showed low CpG density in the promoter regions of most of the genes except for gene ID102593331 and ID: 102595860. The number of SSRs per gene ranged from 2 to 9 with repeat lengths of 2 to 6 bp. Evolutionary distances ranged from 0.685 to 0.770 (mean = 0.73), demonstrating narrower genetic diversity range. Phylogeny was inferred using the UPGMA method, and gene sequences from different species were found to be clustered together. Conclusion In silico identified regulatory elements in promoter regions will contribute to our understanding of the regulatory mechanism of glucan endo-1,3-beta-glucosidase genes and provide a promising target for genetic engineering to improve disease resistance in potatoes. Supplementary Information The online version contains supplementary material available at 10.1186/s43141-021-00240-0.


Background
Potato (Solanum tuberosum L.) is one of the most widely consumed carbohydrate-rich staple foods in large parts of the world; it is the fourth largest food crop in production [1]. Potato is mainly used as a staple food, but it also has a number of medicinal values. Moderate consumption of the juice from the tubers is used in the treatment of peptic ulcers, bringing relief from pain and acidity [2].
Pathogenesis-related proteins, often called PR proteins, are a structurally diverse group of plant proteins that are toxic to invading fungal pathogens. They are widely distributed in plants in trace amounts, but are produced in much greater concentrations following pathogen attack or stress. PR proteins exist in plant cells intracellularly and also in the intercellular spaces, particularly in the cell walls of different tissues. Varying types of PR proteins have been isolated from each of several crop plants. Different plant organs, e.g., leaves, seeds, and roots, may produce different sets of PR proteins. Different PR proteins appear to be expressed differentially in their hosts in the field when temperatures become stressful, low or high, for extended periods [3].
The several groups of PR proteins have been classified according to their function, serological relationship, amino acid sequence, molecular weight, and certain other properties. PR proteins are either extremely acidic or extremely basic and therefore are highly soluble and reactive. At least 14 families of PR proteins are recognized. Among these pathogenesis-related proteins, glucan endo-1,3-beta-glucosidases (β-1,3-glucanases) are one important hydrolytic enzyme that is abundant in many plant species after infection by different types of pathogens. The amount of them significantly increases and plays a major role in defense reaction against fungal pathogens by degrading the cell wall, because β-1,3-glucan is a structural component of the cell walls of many pathogenic fungi. Glucan endo-1,3-beta-glucosidase appears to be coordinately expressed along with chitinases after fungal infection. This co-induction of the two hydrolytic enzymes has been described in many plant species, including pea, bean, tomato, tobacco, maize, soybean, potato, and wheat [4][5][6][7][8][9][10][11]. In addition to their roles in pathogen defense, glucan endo-1,3-beta-glucosidases have been implicated in cell division, pollen development, pollen tube growth, regulation of plasmodesmata signaling, cold response, seed germination, and maturation [12].
Glucan -1,3-beta-glucosidase forms highly complex and diverse gene families in plants, and a single plant species may have various copies of glucan-1,3-beta-glucosidase genes [12]. The glucan -1,3-beta-glucosidases are the enzymes which can cleave the beta glycosidic linkages of glucans. They can be divided into two groups, exo or endo. The exo-hydrolases catalyze the hydrolysis of the beta-glucan chain by sequentially cleaving glucose residues from the non-reducing end and releasing glucose as the sole hydrolysis product. The endohydrolases cleave β-linkages at apparently random sites along the polysaccharide chain, releasing smaller oligosaccharides [13]. The enzyme glucan-1,3-beta-glucosidase is important to delay the growth of pathogenic fungi and to decrease the damage caused by disease in fruits. The application of this enzyme is possible due to the composition of the cell walls of certain microorganisms which contain β-glucans [14].
Many studies have shown that the synthesis of glucan endo-1,3-beta-glucosidase is stimulated when plants are infected by fungal, bacterial, or viral pathogens, and its concentration also increases dramatically. For instance, mRNA for a tomato glucan endo-1,3-beta-glucosidase accumulated to a higher level in leaves infected with the fungal pathogen Cladosporium fulvum [15], barley infected with powdery mildew [16], maize infected with Aspergillus flavus [17], pepper infected with Phytophthora capsici, wheat infected with Fusarium graminearum [11], chickpea infected with Ascochyta rabiei (Pass.) Labr [18]., and peach infected with Monilinia fructicola [19]. Scientists throughout the world have tried to analyze or predict the regulatory elements of pathogen-related genes in higher plants whose expression products have an inhibitory effect on microorganisms such as fungi. However, only a small percentage of PR genes have been investigated.
To the best of our knowledge, there is no report that evaluates the regulatory elements of glucan endo-1,3beta-glucosidase genes in potato (Solanum tuberosum L). Moreover, owing to the crucial roles of glucan endo-1,3-beta-glucosidase genes in the plant defense system, it is imperative to understand and analyze the promoter region and regulatory elements of glucan endo-1,3-betaglucosidase genes in Solanum tuberosum. The knowledge will contribute to our understanding of the expression profiles and regulatory mechanism of glucan endo-1,3-beta-glucosidase genes. It also provides a promising target for genetic engineering for improved glucan endo-1,3-glucosidase expression in potato and uplifts the level of defense response in potato against fungal pathogens and develops disease-resistant transgenic potato, which is an environmentally friendly approach of a disease control method.

Methods
A total of 27 whole genome shotgun gene sequences of glucan endo-1,3-beta-glucosidase for Solanum tuberosum cultivar DM 1-3 516 R44 were retrieved from the NCBI database available at https://www.nlm.nih.gov/ gene; of these, 19 of them were selected for analysis, while the remaining eight gene sequences were excluded from this analysis because they were not having the functional gene structure (many stop codons appear in the middle and the reading frame was highly fragmented), after checking with CLC Genomics Workbench ver. 3.6.1 (http://clcbio.com, CLC bio, Aarhus, Denmark) ( Table 1).

Finding of transcription start sites and determination of promoter sequence
Glucan endo-1,3-beta-glucosidase gene sequences of Solanum tuberosum cultivar DM 1-3 516 R44 were downloaded in FASTA file from NCBI Genome Browser, and 1-kb DNA sequences upstream ATG were used as an input file for determining the transcriptional start sites (TSSs) for the retrieved genes. The Neural Network Promoter Prediction (NNPP version 2.2) tool set was used with the minimum standard predictive score (between 0 and 1) available at https://www.fruitfly.org/seq_tools/ promoter.html [20]. For those regions containing more than one TSS, the highest prediction score was considered.
Motif discovery and comparison of the discovered motif against a database of known motifs Motif discovery was performed by MEME suite (Multiple Em for Motif Elicitation) software version 3.5.4 available at http://meme-suite. org/tools/meme using minimum and maximum motif width of 6 and 50 bp, respectively, and a maximum number of 3 motifs; the rest of the parameters were kept at default. The MEME output was shown in HTML, as well as in several other formats. The motif with the least E-value was used for comparison against a database of known motifs using TOMTOM and ranked the motifs in the database and produce an alignment for each significant match [21]. TOMTOM reported for each query a list of target motifs, ranked by p-value and q-value of each match [22]. TOMTOM also displayed putative transcription factors (TFs) that resemble the TFs of glucan endo-1,3-beta-glucosidase genes. Finally, after identification of those putative TFs interacting with DNA motif, the role of the TFs was described.

CpG island analysis
Sequences of 2000 bp upstream ATG for each glucan endo-1,3-beta-glucosidase gene of Solanum tuberosum cultivar DM 1-3 516 R44 were downloaded in FASTA format from NCBI (https://www.ncbi.nlm.nih.gov/), and the bioinformatics prediction of CpG islands was analyzed using CLC Genomics Workbench ver. 3.6.1 (available at http://clcbio.com, CLC bio, Aarhus, Denmark). Searching for MspI cutting sites (fragment sizes between 40 and 220 bp) is relevant for the detection of CGIs, because studies using whole genome CpG island libraries prepared for different species revealed that CpG islands Table 1 List of the glucan endo-1,3-beta-glucosidase genes of Solanum tuberosum cultivar DM1-3 156R44 selected for analysis are not randomly distributed but are concentrated in particular regions, because CpG-rich regions are achieved by isolation of short fragments after MspI digestion that recognizes CCGG sites [23]. The parameter setting was as follows, with a guanine and cytosine (GC) content greater than or equal to 55% and observed to expected CpG ratio (Obs CpG/ExpCpG) greater than or equal to 0.65 and length ≥500 bp [24].

Phylogenetic relationship analysis
The phylogenetic analysis was inferred using the UPGMA method [26]. The analysis involved 40 glucan endo-1,3-beta-glucosidase gene sequences selected from Solanum tuberosum, Nicotiana tabacum, Solanum lycopersicum, and Arabidopsis thaliana [26]. The genetic distances were computed using the p-distance method [27]. Codon positions included were 1st+2nd+3rd+Noncoding. All ambiguous positions were removed for each sequence pair (pairwise deletion option). The phylogenetic analysis, genetic distances, conserved sites, variable sites, and base composition of the gene sequences were conducted using the Molecular Evolution Genetic Analysis X32 (MEGA X32) available at https://www. megasoftware.net/ [28].

Finding of transcription start sites and determination of promoter sequence
Transcription start sites (TSSs) predicted for each of the 19 study subjects are presented in Table 2. The prediction showed that the glucan endo-1,3-beta-glucosidase genes of Solanum tuberosum cultivar DM 1-3 516 R44 had TSSs ranging from 1 to 3. The predictive score for the majority 16 (84.2%) of the promoter regions was 0.90 and above. The highest promoter prediction score (1.0) was obtained for two gene sequences only (Pro-102604922 and Pro-102581946) while the lowest promoter prediction score (0.8) was obtained in none of  (Table 2). In addition, the result of promoter predictions for glucan endo-1,3-beta-glucosidase gene sequences with a cutoff value of 0.80 showed that the majority 12 (63.2%) of the gene sequences showed only one TSS, while 7 (36.8%) of them revealed multiple TSSs.
In general, the TSSs of gene sequences were located between the range of −79 and −2900 bp relative to the translation start codon (ATG), with a relatively highest occurrence in the region above −1000 bp (5 sequences), followed by −201 to −400 bp and -601 to −800 bp regions (4 sequences, each), −1 to −200 bp (3 sequences), and −401 to −600 (2 sequences), while the lowest occurrence was observed at −801 to −1000 bp (1 sequence).

Discovery of common motifs and associated TFs in the promoter regions
In the current study, five candidate motifs that were shared by glucan endo-1,3-beta-glucosidase gene promoter sequences of Solanum tuberosum cultivar DM 1-3 516 R44 were discovered ( Table 3). The relative location and spatial distribution of the majority of the discovered common motifs were concentrated between +1 and −500 bp of the TSSs. MEME generated common candidate motifs for 18/19 of the gene promoter sequences. It is also interesting to notice that the discovered motifs were distributed on both positive and negative strands with 30 and 25, respectively, as shown in Fig. 1.
To determine a candidate common promoter motif which is functionally important, a motif which was shared by the majority of promoter regions of Solanum tuberosum glucan endo-1,3-beta-glucosidase genes was selected. Among the five motifs, MβG II was identified as a common promoter motif shared by 94.4% of Solanum tuberosum glucan endo-1,3-beta-glucosidase promoters. A common promoter motif serves as binding sites for transcription factors involved in gene expression and regulation of these genes. A sequence logo for MβGII generated by MEME is presented in Fig. 2. Moreover, further analysis was carried out to get more information on the MβGII motif of the potato (Solanum tuberosum DM 1-3 156 R44) glucan endo-1,3-beta-glucosidase genes. Thus, MβGII was compared to registered motifs in publicly available databases to see if they are similar to known regulatory motifs.

Discovery of matches to the query motif
Among the discovered five common candidate motifs, MβGII with the E value of 3.5e−001 was used as a query motif for comparison against a database of JAS-PAR2018_CORE_vertebrates non-redundant uniprobe_ mouse of known motifs using TOMTOM web application [21]. The analysis showed that the query motif  (Table 4).

CpG island analysis
In the present study, CpG island analysis of the promoter region was investigated using in silico digestion method (using restriction enzyme MspI) and the result showed low CpG density in the investigated regions. Fragments were observed only in gene ID: 102593331 and 102595860 ( Table 5). The presence of low-density CpG islands might be associated with selective gene expression at a specific tissue.

SSR motif occurrence in sequences
In the present study, 265 different SSR motifs ranging in size from 2 to 6 (dimer to hexamer) and in number from 2 to 9 per gene were detected in the gene sequences of Solanum tuberosum cultivar DM 1-3 516 R44 examined, shown in supplementary table 1. Dimer motifs such as ac, at, ag, ca, ct, ga, gt, ta, and tc were found in the majority (95%) of the gene sequences. Assuming the presence of a large number of tandem repeats, their effects are likely to occur in the glucan endo-1,3-beta-glucosidase gene of Solanum tuberosum cultivar DM 1-3 516 R44. Gene sequences with the highest number of dimer repeats are shown in Table 6.
Genetic divergence among gene sequences from different plant species The genetic distance was assessed using 40 gene sequences (supplementary table 2    Phylogenetic relationships of glucan endo-1,3-betaglucosidase gene sequences The phylogenetic tree resulted in seven clusters: cluster I comprised of 9 gene sequences, 3 from Nicotiana tabacum, 2 from Arabidopsis thaliana, 3 from Solanum tuberosum, and 1 from Solanum lycopersicum; cluster II comprised of 8 gene sequences, 5 from Nicotiana tabacum, 2 from Solanum tuberosum, and 1 from Solanum lycopersicum; cluster III comprised of 7 gene sequences, 5 from Solanum tuberosum, 1 from Nicotiana tabacum, and another 1 from Arabidopsis thaliana; cluster IV comprised of 4 gene sequences, 2 from Arabidopsis thaliana, 1 from Nicotiana tabacum, and 1 from Solanum tuberosum; cluster V consisted of 3 gene sequences entirely from Solanum tuberosum; cluster VI comprised of 4 gene sequences, 2 from Nicotiana tabacum, 1 from Solanum lycopersicum, and 1 from Solanum tuberosum; and cluster VII comprised of 2 gene sequences mainly from Solanum tuberosum. Meanwhile, two gene sequences from Solanum tuberosum and one from Arabidopsis thaliana were individually isolated from the clusters (Fig. 3).

Multiple sequence alignment of the gene sequences
The multiple sequence alignment was conducted using the Clustal Omega algorithm available online at https:// www.ebi.ac.uk/Tools/msa/. The result ranges from 24.4% (between ID107820469 and ID102605428) to 95.2% (between ID107803828 and ID107824944) shown in supplementary table 4. The number of conserved sites, variable sites, and the frequency of nucleotide bases is mentioned in Table 7. Gene ID102601178 in Solanum tuberosum had the lowest rate for both conserved sites and variable sites, accounting for 7.5% and 20.7%, respectively, whereas gene ID102589208 in Solanum tuberosum had the greatest value (28.8%) for conserved sites and gene ID832156 in Arabidopsis thaliana had the highest proportion (76.1%) for variable sites.

Discussion
Finding of transcriptional start site (TSS) triggers the prediction of the promoter region and thus simplifies the subsequent analysis of gene expression. In the present in silico analysis, the number of TSSs per gene sequences was 1 to 3, and the majority 12 (63.1%) of the gene sequences had a single transcription start site, consistent with the previous finding by [29], who reported that 62.1% of the gene sequences contained single TSS. However, in most in silico analysis studies, it has been reported that most genes have more than one TSS [30][31][32][33][34]. In the present study, it was also revealed that the locations for 42% of the TSSs were below −500 bp relative to the ATG. However, several authors reported that the location of the TSSs of the majority (>50%) of the gene sequences studied was below −500 bp relative to ATG [35][36][37][38]. Patterns of gene expression (conditionally or temporally) have been linked to transcription regulation [39]. The common promoter motif is short DNA segments that serve as binding sites for TFs involved in gene expression regulation [31]. In the present study, the common promoter motif was found in 18 (94.4%) of the promoter sequences investigated. Some studies reported the sharing of a common promoter motif by all the promoter sequences (100%) [29,32]. The discovery of matches to the query sequence showed that the query motif serves as binding sites for 8 transcription factors, involved in the regulation of gene expression as a receptor, transcription factor, or repressor in various biological processes (Table 4).
Several studies reported that CpG islands (CGIs) play an important role in the regulation of gene expression [40]. DNA of plant species has been shown to contain more CpG dinucleotides than human DNA [41]. Methylation of cytosine at CpG islands has been shown to restrict the access of promoter region of genes to their transcription factors, hence preventing their expression [42]. Consistent with the present analysis, low CpG content was reported in the promoter region of rice PR2 (beta 1,3-glucanase) genes but none is identified in the promoter region of all the families of Arabidopsis thaliana PR gene families [43]. The absence of CpG islands in glucan endo-1,3-beta-glucosidase gene (PR2) might be indicative of tissue-specific gene expression. Ferguson and Jiang [44] also showed that dicots such as potato genome contain low CpG density than monocots. Conversely, Gardiner-Garden and Frommer [45] reported that, in plants, high-density CpG islands tended to lie near the 5′-ends (towards the promoter region) of housekeeping genes which is associated with broad expression of these genes.
In the current study, the cluster analysis showed that the gene sequences from different plant species clustered together. In our results, the range of conserved sites was between 7.5 and 28.8% while the range of variable sites was between 20.7 and 76.1%.
Though the percentage range of variable sites was wider than the conserved sites, the phylogeny showed the opposite relationship. In the present study, the SSR motifs ranged in size from 2 to 6 (dimer to hexamer), and the number of SSR motifs per gene ranged from 2 to 9. The SSR motif analysis also revealed that there is lack of significant variation in the repetition number of the SSR motifs between gene sequences of the different plant species and lack of differences within the repetitive SSR motifs between gene sequences within species. As it is already known, the presence of SSRs within genes can lead to (i) a gain or loss of gene function, (ii) affect transcription and translation, (iii) mRNA splicing, or (iv) export to the cytoplasm. All these effects eventually lead to phenotypic changes [42]. Most often, the length of the simple sequence repeat (SSR) motif does not exceed nine nucleotides and is referred to as short tandem repeats (STRs) or SSRs, or microsatellites. Short tandem repeats are associated with a higher frequency of mutation, affecting DNA sequence composition and length [46]. CGIs are known to concentrate near the transcription start sites (TSSs) of genes. Genes that possess CGIs are often highly expressed in multiple tissues. In the current study, CpG island analysis of the promoter region showed a low density of CpG islands. Possibly, low CpG island density could be one reason for the lack of divergence between gene sequences. According to Prendergast et al. [47], CpG island poor regions are not subjected to evolutionary divergence. Moreover, due to the lack of significant differences in the number of repetitions of SSR motifs between gene sequences of the different plant species and lack of differences within the repetitive SSR motifs between gene sequences within species, the phylogenetic analysis did not show a clear and defined phylogenetic relationship. Therefore, further analysis of CpG islands and their convergence into TSSs of genes and involvement in evolutionary divergence will pave the way for a greater understanding of their roles in gene expression and gene evolution.

Conclusion
The major aim of this work was to explore regulatory elements that can determine the expression of glucan endo-1,3-beta-glucosidase genes of Solanum tuberosum cultivar DM 1-3 516 R44. Consequently, the study showed transcription factors that serve as receptors, activators, and/or repressors of glucan endo-1,3-beta-glucosidase gene. In addition, transcription start sites, promoter regions, SSR motifs, and CpG islands in glucan endo-1,3-beta-glucosidase gene that plays role in the process of gene expression regulation were identified. The phylogenetic analysis revealed that the clustering patterns of the gene sequences were not entirely based on taxa. In general, this in silico analysis would allow for the understanding of regulatory mechanisms involved in glucan endo-1,3-beta-glucosidase gene expression and helps to identify gene regulatory elements in the promoter regions.
Abbreviations TSS: Transcription start site; MβGII: Motif of beta-glucosidase; TFs: Transcription factors; SSR: Simple sequence repeat; MEME: Multiple em for motif elicitation; NCBI: National center for biotechnology information; bp: Base pair; NNPP: Neural network promoter prediction