Genome-wide SNP and InDel analysis of three Philippine mango species inferred from whole-genome sequencing

Background The Philippines is among the top 10 major exporters of mango worldwide. However, genomic studies of Philippine mangoes remain largely unexplored and lacking. Here, we sequenced the whole genome of the three Philippine mango species, namely, Mangifera odorata (Huani), Mangifera altissima (Paho), and Mangifera indica “Carabao” variety using Illumina HiSeq 2500, to identify and analyze their genome-wide variants (SNPs and InDels). Results The high confidence variants were identified by successfully mapping 93–95% of the quality-filtered reads to the Alphonso and Tommy Atkins mango reference genomes. Using these two currently available mango genomes, most variants were observed in M. odorata (4,353,063 and 4,277,287), followed by M. altissima (3,392,763 and 3,449,917), and lastly, M. indica Carabao (2,755,267 and 2,852,480). Approximately 50, 46, and 38% of the variants were unique in the three Philippine mango genomes. The analysis of variant effects and functional annotation across the three mango species revealed 56,982 variants with high-impact effects mapped onto 37,746 genes, of which 25% were found to be novel. The affected mango genes include those with potential economic importance such as 6945 genes for defense/resistance/immune response, 323 genes for fruit development, and 338 genes for anthocyanin production. Conclusions To date, this is the first sequencing effort to comprehensively analyze genome-wide variants essential for the development of genome-wide markers specific to these mango species native to the Philippines. This study provides an important genomic resource that can be used for the genetic improvement of mangoes. Supplementary Information The online version contains supplementary material available at 10.1186/s43141-022-00326-3.


Background
The Philippines is among the top 10 major exporters of fresh and dried mangoes in the world. The country's mango export is valued at USD 91 million and contributes a 4% share of the global market [1,2]. The mango industry also supports about 2.5 M Filipino farmers [3]. In the first quarter of 2021, 97.9 thousand Mt of mangoes were produced in the Philippines and around 83% of which came from the Carabao mango variety (Mangifera indica) [4]. The Carabao mango is the Philippines' export variety which is known as one of the world's finest, superior quality, and sweetest mango varieties. Hence, Carabao is the country's flagship variety in the mango global value chain.
Mango belongs to the kingdom Plantae, order Sapindales, family Anacardiaceae (cashew family), subfamily Anacardioideae, and genus Mangifera. Mangifera indica, the common mango, is a juicy drupe that is usually found in tropical countries. It has varying sweetness and texture across cultivars and has a high incidence of hybridization with other members of its genus. This results in new varieties or species such as Mangifera odorata (M. indica x M. foetida) which is commonly known as Huani in the Philippines [5]. Huani is also known for its characteristic pungent smell and taste of turpentine. Another native species of mango in the country is Mangifera altissima which is locally known as Paho. Its unripe fruits are small and oftentimes used in salads in the Philippines.
Mango has a diploid chromosome (n=20 chromosomes), and its haploid genome size is relatively small (approximately 400 Mb) but complex due to its innate heterozygosity [6,7]. The mango seed exhibits apomixis and can produce one seedling (monoembryony) or multiple seedlings (polyembryony) in one seed. The former is common in varieties originating from India and mainland China [8] while the latter is observed in varieties that evolved in places closer to the equator such as the Philippines [9]. The complex (heterozygous) genome and polyembryonic nature of mango in the Philippines pose a significant challenge in genomics and plant breeding studies. Hence, despite the agricultural and economical importance of mango in the country, the genomic studies of Philippine mangoes remain lacking and largely unexplored.
Recently, the chromosome-level whole-genome sequencing (WGS) of Alphonso [7] and Tommy Atkins [6] was completed, providing high-quality reference genomes for mango. Both varieties are of the same species (M. indica) and are important varieties in the mango international trade. With the availability of WGS data, in-depth genome analysis can be performed to unravel gene networks, reveal intron-exon boundaries, detect transposable elements, discover novel biological processes, develop molecular markers tagging economically important traits for breeding (e.g., insect pest and disease resistance), and identify genome-wide variants such as single-nucleotide polymorphisms (SNPs) and insertionsdeletions (InDels) [10][11][12][13][14]. SNPs and InDels are differences and variations in the genome which can have a huge impact on the biological and physical traits of an organism.
In this study, we sequenced the whole genomes of three Philippine mango species, namely, Mangifera odorata (Huani), Mangifera altissima (Paho), and Mangifera indica 'Carabao' using Illumina HiSeq, to identify and characterize their genome-wide variants (SNPs and InDels). The high confidence variants were identified by successfully mapping the quality-filtered reads to the Alphonso and Tommy Atkins mango reference genomes. This study provides valuable information and resources for mango breeding and genetic studies.

Mango species used and DNA extraction
Three mango species native to the Philippines were used in this study, namely, Mangifera indica Carabao, Mangifera odorata (Huani), and Mangifera altissima (Paho). A high-quality DNA was extracted from three mango trees of the same species using the method of Inglis et al. [15] with modifications. Fresh, young leaves of mango were cut into small pieces (excluding the midrib and leaf veins) and then pulverized using liquid nitrogen for 20 s (2 to 3 cycles). About 150 g of the pulverized tissues was transferred to a microcentrifuge tube and then pre-washed by adding a sorbitol solution pre-added with 2-mercaptoethanol (1% v/v). The tube was centrifuged at 12,000 rpm for 5 min, then the supernatant was discarded. The pulverized tissues were lysed by adding 700 μL of CTAB in the tube, vortexed for 5 s, then heated at 65 °C for 1 h with inversion of the tube every 10 min. The tube was then left at room temperature for 10 min, and 700 μL of 24:1 chloroform to isoamyl alcohol solution (CIA) was added to separate the cellular components. The tube was vortexed for 10 s, then centrifuged at 12,000 rpm for 5 min. Afterwards, the supernatant was transferred to a new tube and 10% of 3M sodium acetic acid and ice-cold isopropanol (2x volume) were added. The tube was incubated for 1 h at −20 °C, then centrifuged at 10,000 rpm for 10 min. The supernatant was discarded, and the pellet (DNA) was washed with 1 mL of ice-cold 70% ethanol, then centrifuged at 10,000 rpm for 10 min. The ethanol was carefully removed, and the pellet was air-dried for 1 h and resuspended by adding 100 μL of Tris-EDTA (pre-added with RNAse). Afterwards, the tube was incubated at 37 °C for 30 min and then stored at −20 °C.
The quality of DNA was checked via gel electrophoresis using 1.5% agarose with SYBR Safe nucleic acid stain (Life Technologies Corporation, USA) and viewed using a gel documentation system (Gel Doc 1000, Bio-Rad Laboratories, USA). DNA samples showing bands were further checked using Epoch Microplate Spectrophotometer and fluorometer (DeNovix QFX Fluorometer), to ensure high-quality DNA that is amenable for the next-generation sequencing.

Whole-genome sequencing
The extracted high-quality DNA from three mango species were submitted for sequencing using the Illumina HiSeq 2500 platform (Macrogen, Korea) with a sequencing coverage of 1X per sample. Three DNA samples were sequenced per mango species. The raw reads of all samples were deposited in the NCBI under the BioProject number PRJNA740276.

Mapping of pre-processed short reads
The pre-processed paired sequences of three samples per mango species were concatenated and then mapped to the recently published mango reference genomes of Alphonso [7] (BioProject PRJNA487154) and Tommy Atkins [6] (BioProject PRJNA450143) using Burrows-Wheeler Aligner tool (BWA) [18]. The bwa index and bwa mem commands were used for indexing of reference genomes and alignment of short reads, respectively. The sequence alignment map (SAM) produced was used to count the mapped reads and determine the alignment rate of short reads to the reference genomes using SAMtools [19] and BamTools [20], respectively.

Variant calling
Using the SAM file from the read mapping step as input, an analysis-ready binary alignment map (BAM) file was generated using the Picard tools [21] following the Sort-SAM, FixMateInformation, MarkDuplicates, and AddOr-ReplaceReadGroups commands. The reference genome was indexed using the SAMtools faidx command and a sequence dictionary was created using the CreateSe-quenceDictionary command of Picard tools. Variants (such as SNPs and InDels) between the three Philippine mango species and reference genomes of Alphonso and Tommy Atkins were detected following the Genome Analysis Toolkit (GATK) Best Practices workflow [22]. The read mapping artifacts were minimized through local realignment around InDels by using the Realigner-TargetCreator and IndelRealigner commands. Variants were called using the HaplotypeCaller command by setting the output mode to EMIT_VARIANTS_ONLY and calling the confidence threshold (stand_call_conf ) to 20. The raw variant call format (VCF) file produced was filtered using the VariantFiltration command following the recommended parameters for SNPs and InDels. Using the SelectVariants -ef command, only the SNPs and InDels that pass the first filtering were printed and considered in the new VCF output. Then, base quality score recalibration was performed using BaseRecalibrator and PrintReads commands, to correct the bias of the per-base estimate of error generated by the sequencing platform. Afterwards, the second round of variant calling and filtering using the HaplotypeCaller and VariantFiltration commands, respectively, was performed to identify highconfidence SNPs and InDels. The final VCFs containing high confidence variants were then used as input to Cir-cosVCF [23] for visualization of variant density in circos plots. The VCFtools [24] was used to create an InDel histogram.

Variant effects, phylogenetic relationship, and kinship analysis
The generated VCFs of the three mango species were analyzed for variant effects on the gene regions using the SnpEff toolbox [25]. The SnpEff functional classes detected in all SNPs and InDels were 3′ and 5′ untranslated region (UTR) variant; downstream and upstream gene variant; intergenic region; intragenic variant; intron variant; splice acceptor, splice donor, and splice region variant; start lost and start retained variant; and stop gained, stop lost, and stop retained variant. The functional classes detected only for SNPs were 5′ UTR premature start codon gain variant, initiator codon variant, missense variant, and synonymous variant. Meanwhile, the functional classes detected only for InDels include 3′ and 5′ UTR truncation, bidirectional gene fusion, conservative inframe insertion and deletion, disruptive inframe insertion and deletion, exon loss variant, frameshift variant, and non-coding transcript variant. Other important information provided by SnpEff are the variant rate details (per chromosome), variant types, base changes for SNPs including transitions (Ts) and transversions (Tv) ratio, allele data, and variant effects by impact which are classified as high, moderate, low, and modifier. Only the SNPs and InDels identified as high impact were considered for further analysis. The generated VCFs were also used to construct a UPGMA phylogenetic tree using VCF2PopTree [26] as well as for kinship analysis using the vcf2kinship command of Rvtests [27] following the identity-by-state (IBS) model.

Gene ontology (GO), GO enrichment, and KEGG analyses of high-impact variants
The protein sequences of gene IDs identified as high impact were retrieved and Gene Ontology (GO) analysis was performed using the BLAST2GO package [28]. The homology of the protein sequences was determined using the UniProtKB/SwissProt protein database via BLASTp analysis (with an e value of 1e-3). The BLAST results were then mapped and annotated to produce the GO annotations from the three domains of molecular function (MF), biological processes (BP), and cellular component (CC) assigned to each protein sequence. GO enrichment analysis of biological processes was performed using agriGO [29,30]. The hypergeometric statistical test method and Yekutieli multi-test adjustment method [with False Discovery Rate (FDR) under dependency] were the parameters used for the analysis. The significance level was set at P < 0.05. KEGG analysis [31] was also performed using the single-directional best hit

Mapping of reads to the reference genomes
Trimming/filtering of the raw sequences produced a total of 22. 8 Table 1).

Identification of SNPs and InDels
By mapping the reads to the Alphonso genome (Table 2)

Distribution of SNPs and InDels
The density and frequency of SNPs and InDels in mango chromosomes (n=20) are presented in Figs. 1 and 2. The Alphonso variety has a decreasing chromosome size; thus, higher SNPs and InDels were observed in chromosome 1 and lowest at chromosome 20 in all mango species used (Fig. 2a, b). On the other hand, for the Tommy Atkins genome, a non-uniform distribution of SNP and InDels across the 20 mango chromosomes was observed in all mango species analyzed (Fig. 2c, d). Chromosome   indica Carabao (Figs. 1 and 2). The detected nucleotide substitutions in the SNPs are classified as transitions (Ts) which involve A/G and C/T substitution, and transversions (Tv) which include A/C, A/T, C/G, and G/T substitutions (Fig. 3). In the Philippine mangoes studied, Ts substitution was the most abundant (70%) compared to Tv substitution (30%) regardless of the reference genome used. With this, the Ts/Tv ratios of the three mango species used ranged from 2.33 to 2.43 upon mapping to the Alphonso and Tommy Atkins genome. In Ts, the number of A/G is almost equal to the C/T type in each mango species, while for Tv, A/T substitution was the highest comprising 35-36% of Tv substitutions (Fig. 3). Similar to SNPs, InDels were also highest in M. odorata and lowest in M. indica Carabao (Figs. 1 and 2). The predominant length of InDels ranged from 1 to 12 bp which accounts for around 92% of the total number of InDels, of which 48% were mononucleotide InDels (Fig. 4).

Shared and unique SNPs and InDels
The three mango species shared 449,112 and 492,271 SNPs relative to the Alphonso and Tommy Atkins reference genomes, respectively (Fig. 5a, b). Likewise, the three species shared 117,998 and 121,266 InDels based on the two reference genomes (Fig. 5c, d) Fig. 5b and d).

Analysis of variant effects
Analysis of the functional classes of identified SNPs are shown in Table 3. Majority of the SNPs observed were in the intergenic (

GO analysis and annotation of high-impact variants
The SNPs and InDels with high-impact effects were functionally annotated and used for GO enrichment analysis. A total of 21 GO-enriched terms for biological process (GO:0008150) were detected in the genes with highimpact variants (Supplemental File 2). GO enrichment analysis showed that regulation of biological processes (GO:0050789), biological regulation (GO:0065007), response to stimulus (GO:0050896), and most especially, cellular process (GO:0009987) and metabolic process (GO:0008152) were the highly enriched biological processes in the three mango species (Fig. 6 and Supplemental File 2). In this study, a total of 56,982 highimpact variants were identified and mapped onto 37,746 genes across the three mango species (Supplemental Table 1). Around 75% (28,337) of these genes containing high-impact variants were well-known, while 25% (9409) remain unknown (Supplemental Table 1). Among the high-impact variants found in well-annotated genes include those with potential economic importance and useful for breeding, i.e., 6945 genes for defense/resistance/immune response to insects and pathogens, 323 genes for fruit development, and 338 genes for anthocyanin production found across the Philippine mango species studied ( Table 5). The complete GO enrichment analysis (with FDR values) is provided in Supplemental File 2, and the complete functional annotation of genes with high-impact variants is provided in Supplemental Files 3A (Alphonso) and 3B (Tommy Atkins).

Analysis of shared and unique genes with high impact variant effects
Compared to the Alphonso genome, 772 and 890 genes with high-impact SNPs and InDels, respectively, were  (Fig. 7a). Compared to Tommy Atkins genome, 624 and 576 genes with high-impact SNPs and InDels, respectively, were found unique to M. odorata, 432 and 577 genes for M. altissima, and 328 and 389 genes for M. indica Carabao (Fig. 7b). Meanwhile, 195 and 197 genes with high-impact variant effects were shared among the three Philippine mangoes using the two reference genomes, respectively (Fig. 7, Supplemental File 4).

Phylogenetic and kinship analyses
In terms of alleles observed in the mango species, M. odorata showed the highest number of alleles (1.5 million), followed by M. altissima (1.3 million), and lastly M. indica Carabao (1.1 million) using the two reference genomes. All allele data (i.e., number of alleles, total heterozygous alleles, total missing alleles, and total polymorphic alleles) are presented in Supplemental Table 2.  Table 3).

Discussion
Genome-wide variant analysis revealed that most variants (SNPs and InDels) were observed in M. odorata (4,353,063 and 4,277,287 for Alphonso and Tommy Atkins genomes, respectively) and least in M. indica Carabao (2,755,267 and 2,852,480 for Alphonso and Tommy Atkins genomes, respectively) ( Table 2). This result is expected as M. odorata is a heterozygous variety and believed to be a cross between M. indica and M. foetida [5]. M. indica and M. foetida belong to separate Mangifera subgenus: Mangifera Mangifera and Mangifera Limus, respectively. Hence, M. odorata showed the highest variation as a hybrid of the two subgenera. It is followed by M. altissima, a highly homozygous, self-pollinating, mango species which belong to the subgenus Mangifera. The Carabao variety, although a heterozygous cultivar, showed the least number of variants which could be explained by its conspecificity with the two reference genomes (M. indica). Phylogenetic and kinship analyses also revealed that M. indica Carabao is more related to M. odorata than with M. altissima, as shown in the clustering in the dendrogram and kinship (IBS) values (Supplemental Fig. 1, Supplemental Table 3). A pioneering effort Many overlapping variants were observed in the three mango species (Fig. 5). These could be utilized for further research of common function or phenotype of Mangifera species. On the other hand, approximately 50, 46, and 38% of the variants were unique to M. odorata, M. altissima, and M. indica Carabao, respectively, upon comparison to the two currently available mango reference genomes (Fig. 5). The unique variants could be used for further characterization and genetic research of specific mango species or varieties. The observed Ts/ Tv ratios are comparable to the findings of Bally et al. [6] for mango, thus indicating the correctness of the workflow used in this study. The high occurrence of Ts (Fig. 3) is termed as "transition bias" and has been reported in many crop species such as rice [33,34], foxtail millet [35], maize [36], tea plant [37], and soybean [38]. The high rate of A/G and C/T substitutions (Fig. 3) is likely attributed to the methylation of C when it is adjacent to G (CpG dinucleotides), forming a 5-methylcytosine that can transition into T upon deamination, thus also causing a G to A substitution on the other hand [38,39]. The number of InDels tends to decrease gradually as the length of InDel increases (Fig. 4). In this study, the predominant InDel length for the mango was 1 to 12 bp with almost half consisting of mononucleotide InDels. In tea plants, the predominant InDel length is 1 to 20 bp with mononucleotide InDels as the most abundant type [37]. More high-impact  variants were observed in InDels than SNPs, leading to a greater number of genes with high-impact InDels (Supplemental Table 1). High-impact variants result in protein truncation or triggering loss/gain of function, frameshift variant, or splice donor variant [40].
In the Philippines, the occurrence of insect pests (e.g., oriental fruit fly, cecid fly) and diseases (e.g., anthracnose, scab, stem-end rot) [41][42][43][44] limits the country from maximizing mango export potential. These biotic constraints are often difficult to control and can affect mango at different developmental stages causing a significant reduction in fruit yield and quality [45,46]. Thus, breeding of mango for resistance can provide a long-term solution for the Philippines. The source reference genomes Alphonso and Tommy Atkins are reported for their long shelf life which is also associated to their considerable resistance to diseases [47][48][49]. This highlights the importance of the identified defense/resistance/immune responserelated genes totaling to 6945 genes ( Table 5, Supplemental Files 3A and 3B). The two reference varieties also express red/pink blush on their fruit peel, in contrast to the Philippine mango species studied which only appear green or yellow throughout their fruit stages until ripening. In recent years, the Philippines has been interested in developing a mango export variety with a red/pink blush appearance to target international markets that prefer this type of mango. The red/pink blush coloration of mango peel is mainly attributed to anthocyanin production [50] wherein genes related to this biochemical process have been identified in this study totaling to 338 genes ( Table 5, Supplemental Files 3A and 3B). KEGG analysis revealed that these genes (including other genes with high impact variants) are involved in the flavonoid biosynthesis pathways which provide precursors for the biosynthesis of anthocyanins (Supplemental Fig. 2).
Analysis of variant effects and functional annotation across the three mango species revealed that 25% of genes containing high-impact variants were found to be novel, or their biological functions have not yet been investigated in mangoes (Supplemental Table 1). Meanwhile, approximately 200 genes with high-impact variants were commonly shared among all mango species which imply consistent gene variations to the two reference genomes (Fig. 7, Supplemental File 4). Analysis of this gene set showed that more than 30% encode proteins related to defense/resistance/immune response against pests and diseases (Supplemental File 4). Among these include the disease resistance proteins At4g27190, At4g27220, At5g63020, and At3g14460 which are proteins reported from Arabidopsis thaliana; RPP proteins (RPP13, RPP8, RPP13-like proteins 1, 2, and 3) which provide resistance against downy mildew caused by Peronospora parasitica [51][52][53]; RGA/RGA-blb proteins (RGA1-blb, RGA3-blb, and RGA4-blb) which are known to confer resistance against the devastating late blight disease caused by Phytophthora infestans [54,55]; RPS (RPS2 RPS4, RPS5, and RPS6) and RPM1 proteins which provide resistance against the pathogen Pseudomonas syringae [56][57][58]; and LRK10L-1.2 protein which confers resistance against leaf rust caused by Puccinia triticina [59,60]. Among these proteins, Lantican et al. [12] reported that the mangospecific orthogroup containing disease resistance protein At4g27220 was observed to have the highest number of members among the orthologous RGA (resistance gene analogs) gene sets in mango. Meanwhile, the RPP13-like protein 1 orthogroup is among the largest families of resistance genes in many crops and was also observed to have the highest frequency of gene duplication events in mango [12]. This suggests that these proteins also contributed to the evolutionary adaptation of mango during selective pressure caused by biotic stresses.

Conclusion
The whole genome of three Philippine mango species M. odorata (Huani), M. altissima (Paho), and M. indica Carabao was successfully sequenced and compared to two currently available mango reference genomes. This revealed the genome-wide variants (SNPs and InDels) including those putative genes with high-impact effects on economically important traits. To date, this is the first sequencing effort to comprehensively analyze genome-wide variants essential for the development of genome-wide markers specific to the Philippine mango species. The availability of this information provides novel genomic resources positioned to revolutionize the mango breeding programs in the Philippines.