Chloroplast genome sequencing is becoming a valuable process for developing several DNA barcodes. At present, plastid DNA barcode for systematics and evolution in flowering plant rely heavily on the use of non-coding genes. The present study was performed to verify the novelty and suitability of the two hotspot barcode plastid coding gene ycf1 and ndhF, to estimate the rate of molecular evolution in the Prunus genus at low taxonomic levels.
Here, 25 chloroplast genomes of Prunus genus were selected for sequences annotation to search for the highly variable coding DNA barcode regions. Among them, 5 genera were of our own data, including the ornamental, cultivated, and wild haplotype, while 20 genera have been downloaded from the GenBank database. The results indicated that the two hotspot plastid gene ycf1 and ndhF were the most variable regions within the coding genes in Prunus with an average of 3268 to 3416 bp in length, which have been predicted to have the highest nucleotide diversity, with the overall transition/transversion bias (R = 1.06). The ycf1-ndhF structural domains showed a positive trend evident in structure variation among the 25 specimens tested, due to the variant overlap’s gene annotation and insertion or deletion with a broad trend of the full form of IGS sequence. As a result, the principal component analysis (PCA) and the ML tree data drew an accurate monophyletic annotations cluster in Prunus species, offering unambiguous identification without overlapping groups between peach, almond, and cherry.
To this end, we put forward the domain of the two-locus ycf1-ndhF genes as the most promising coding plastid DNA barcode in P. persica at low taxonomic levels. We believe that the discovering of further variable loci with high evolutionary rates is extremely useful and potential uses as a DNA barcode in P. persica for further phylogeny study and species identification.
Peach [Prunus persica (L.) Batsch], a member of the Rosaceae family, is one of the most genetically important fruit trees in temperate regions . It belongs to five wild relatives which are generally accepted: Prunus mira, Prunus davidiana Franch, Prunus davidiana var. potaninii Rehd., Prunus kansuensis Rehd., and Prunus ferganensis Kost. and Riab . These wild relatives have attracted attention because they allow a natural diversity panel that offered an opportunity to introduce the traits of interest from native species into the peach texture. Owing to the high similarity and monophyletic clades concept within P. persica may cause a lot of complications in the classification of the species. Therefore, with the recent progress toward the whole genome sequence of peach  along with the assembly of Chloroplast DNA (cpDNA) genome data of P. persica cv. Lovell , provided a new era for the development of comparative cpDNA genome studies and the discovery of DNA barcode genes in peach. Fortunately, the decreasing cost of high-throughput sequencing of the cp genome offers opportunities to gain more cp genome sequences and discover useful particular DNA barcode of coding and non-coding plastid genes in Prunus species .
It is well known that the chloroplasts are the photosynthetic organelles in plant cells, provide energy to green plants and plays an important role in sustaining life . The chloroplast genome is the third-largest genome after the nuclear genome and mitochondria, which encodes many key proteins that are involved in photosynthesis and other metabolic processes . The chloroplast genomes have different features, e.g., maternal inheritance in most angiosperms, and high conservation in genome structure and gene contents . The cpDNA genome is extremely conserved with a self-replicating circular molecule and has a typical quadripartite structure, in which two inverted repeats (IRs) are separated by a large single-copy region (LSC) and a small single-copy region (SSC) . Most of the cpDNA contains approximately 110–133 genes, including protein-coding genes (CDS), ribosomal RNA genes, and transfer RNA genes. The non-coding and coding regions of the chloroplast genome had a diverse signature at both high and low taxonomic levels, making them appropriate for systematic and evolution studies . Albeit, the non-coding region has less functional constraint than the coding region, it offers superior levels of evolutionary rate for phylogenetic and barcoding studies at the subspecies level, while the coding region is highly conserved and suitable only for higher taxonomic levels .
In the past decade, the two protein-coding genes matK and rbcL were chosen as core plant DNA barcodes , while other protein-coding genes, like atpF-H, psbK-I, ropC1, and rpoB, are lacking resolution and have been recommended as supplemental barcodes in diversity within flowering plants . Unfortunately, the discrimination power of discovered barcodes coding genes is too weak to drive through all species, especially in higher plants . Hence, there are no universal barcode loci neither for all plants nor for Prunus species. Dong et al.  proposed that ycf1 is the most promising plastid DNA barcode for land plants and plays an important role in genome evolution. Other evidence supposed that among the protein-coding genes, ycf1 and ndhF are appreciated sources of phylogenetic relationship provide effective information and DNA barcodes-based cpDNA genome for phylogeny and species identification in breeding resources [15,16,17]. Later, Jeon and Kim  suggested that the combination of two-locus ycf1 and ndhF genes is beneficial for deciphering phylogenetic relationships between closely related taxa in Rosaceae. Although these two hot spot genes have superior efficiency in discrimination at the low taxonomic level, still little attention for DNA barcoding and molecular evolution purposes are received [12, 18].
To date, with the documented deficiency of the ycf1-ndhF coding genes in Prunus, we reported for the first time a detailed overview of these hotspot regions to investigate evolutionary relationships between 25 Prunus, Malus, and Pyrus species. Through this research, the performance and efficiency of ycf1-ndhF genes were evaluated as hotspot regions for DNA barcoding and biodiversity which may be helpful in future breeding programs of peach. For this purpose, we achieved a comparative structure variation, the overlapping of ycf1-ndhF genes sequences, and phylogeny analysis within the species level of the ornamental, cultivated, and wild haplotype of P. persica.
Plant materials and DNA extraction
Peach specimens (P. persica) used in this study contained three edible cultivars, one ornamental cultivar, and a wild relative P. mira (Table 1). These five specimens were collected from the field gene bank of Chinese Academy of Sciences (CAS), Wuhan, China, in the juvenile stage in the spring season. Total genomic DNA was extracted from 100 mg of fresh leaves using Plant Genomic DNA Extraction Kit (DP305-03, Tiangen Biotech, Beijing, China) according to the manufacturer’s instructions. The DNA quantity was assessed using a spectrophotometer (Nanodrop 2000, Thermo Fischer, USA). Both the stock and diluted portions were stored at – 80 °C until use.
DNA sequencing, genome assembly, and validation
The Illumina HiSeq 2500 platform was used to sequence the total DNA of the five studied specimens. After sequencing, the raw data was initially screened to remove low-quality regions affecting the data quality and subsequent analysis needed to obtain the expected clean data. The cpDNA genome was assembled by mapping onto the public complete chloroplast genome of P. persica cv. Lovell (GenBank accession HQ336405) , and the genome assembly and alignment analyses were performed using Geneious R10 program (http://www.geneious.com; Biomatters Ltd., Auckland, New Zealand).
Genome annotation and analysis
In the present study, 25 cpDNA genomes were used for annotation, including the 5 peach specimens from our materials, in addition to the 20 cpDNA genomes that were downloaded from NCBI GenBank database. However, Pyrus pyrifolia, Pyrus spinosa, and Malus prunifolia were used as outgroup. These 25 cpDNA genomes representatives all major indigenous of Prunus species. Gene annotation of the 25 cpDNA genomes was performed with the online program Dual Organellar GenoMe Annotator (DOGMA) . Initial annotation, putative starts, stops, and intron positions were determined, and then the draft annotation was inspected and corrected manually by comparison with a homologous gene with the chloroplast genome of P. persica (NC_014697) from the NCBI database.
Identification of ycf1-ndhF genes, sequence editing and alignment
The ycf1-ndhF genes were obtained from the 25 cpDNA genomes using DOGMA analysis to compare the structure sequence, and the multiple sequence alignment was done using MUSCLE v3.70+ fix1-2 , and manually adjusted, as necessary. Nucleotide diversity (π), estimated values of transition/transversion bias (R), and nucleotide substitutions (r) for each sequence were performed using MEGA X program .
The analysis of the consensus phylogenetic tree was performed using 25 nucleotide sequences of ycf1-ndhF genes, including 22 species of Prunus in addition to the 3 species for Pyrus and Malus as an outgroup (Pyrus pyrifolia, Pyrus spinosa, and Malus prunifolia). To gain accurate perspectives on genetic diversity, a graphic demonstration of principal component analysis (PCA) was carried out to display the multi-dimensional genetic relationship and its partition among specimens using the ClustVis web tool for visualizing clustering of multivariate data . The evolutionary history was inferred by using the maximum likelihood method (ML) based on the Tamura-Nei model . The maximum likelihood (ML) tree was computed using MEGA X software. The bootstrap consensus tree inferred from 1000 replicates was taken and searched for the best-scoring ML tree simultaneously to represent the evolutionary history of the 25 specimens tested.
Performance of ycf1 and ndhF genes identifications
At first, to empirically test the regions identified as most appropriate for barcoding in the plastid coding genes of P. persica, automatic genome annotations were performed among the 25 cpDNA genomes of Rosaceae (Fig. 1). According to original gene annotations analysis and our previous data on this material (data not published), several invariable loci in the analyses were ignored due to inadequate identification, e.g., rbcL, matK, ndhA, ycf2, ycf3, ropC1, rpoC2, rpoB, rps16, clpP, psbB, atpF, atpA, trnK-UUU, and trnH-psbA (data not shown). To circumvent the challenges related to a single-locus approach, this study undertook a two-locus analysis with its overlapping or intergenic spacer (IGS) and insertion/deletion as a useful option in delineating closely related peach sequence variations based on the combination of complete two protein-coding genes ycf1-ndhF of the chloroplast genome.
The position of ycf1 in IRA regions varied from 1037 to 1085 bp (Table 1 and Fig. 2). It is worth noting that the ornamental and cultivated species in our study so-called OMT, CMJ, CDH, and CJX gave similar length with the P. persica and P. Kansuensis of 1061 bp, while a slight lower size variation was observed in the two wild types P. davidiana and P. mira with 1046 bp. By comparison, ndhF gene had a much higher position in SSC regions varied from 2098 to 2234 bp (Table 1 and Fig. 2). However, all cultivated, wild type, and ornamental species in our sampling had different ndhF gene length harboring 2231 bp, except for CMJ cultivar which had a slightly lower sequence length with 2228 bp. Overall, the two-loci ycf1-ndhF ranged from 3268 to 3416 bp in length, showed great sequence variation than the single-locus approach due to the variant overlaps of gene annotation and intergenic regions.
Overlapped and intergenic sequences within ycf1 and ndhF genes
In an in-depth look, another remarkable difference identified the overlapped phenomenon and intergenic sequence region in the IRA/SSC border within ycf1-ndhF genes of the plastid genome. The border between four junctions usually differs among plants showed a slight variation in size among the 25 tested specimens (Fig. 3). As a result, the pseudogenes, ycf1 gene present at the IRa/SSC border and is partially located inside the IR region. Among the 25 specimens tested, only 15 showed a variant overlapped region between ycf1 and ndhF genes with the size ranging from 109 to 124 bp (Table 1). By contrast, the intergenic region was identified within only 7 specimens which harbored IGS sequence ranged from 9 to 59 bp (Table 1). Among all, P. mongolica showed the highest IGS sequence variation harbored 59 bp, followed by P serotina, P. humilis, and P. yedoensis with 24, 23, and 16 bp, respectively. While the three specimens P. maximowiczii, P. serrulata, and P. pseudocerasus contained the lowest IGS with 9 bp in length. By contrast, the rest three specimens of Pyrus pyrifolia, Pyrus spinosa, and Malus prunifolia showed the opposite trend with no intergenic region or overlapped ones. Taken together, the two-locus ycf1-ndhF structural domains demonstrated divergence evident in structure variation among the 25 tested specimens.
Sequence divergence of ycf1-ndhF genes
To obtain a comprehensive knowledge on the ycf1 and ndhF sequence divergence among taxa, the averages of nucleotide frequencies were A (33.95%), T/U (33.82%), C (16.12%), and G (16.11%) with an average of AT (33.92%) and GC (16.08%) contents (Table 2). In order to determine the transition/transversion bias (R), the nucleotide substitution pattern was estimated to describe the superior substitution pattern using Kimura 2-parameter analysis with five models (T92+G+I, HKY+G+I, GTR+G+I, TN93+G+I, and K2+G+I). The highest rate of substitutions values (r) for each nucleotide pair was detected in r (GA ± 0.19) and r (CT ± 0.018), revealing high levels of substitutions. By contrast, the lower values of substitution were observed within r (AC; GC; CG; TG ± 0.04), respectively (Table 3). Furthermore, the transition/transversion rate ratios, recorded a higher transition/transversion rate for purine (K1 = 2.57) compared to the transition/transversion rate for pyrimidine (K2 = 2.28). While the overall transition/transversion bias is R = 1.06, which gives support for the dominance of the transitions over transversion in peach germplasm.
Principal component analysis and phylogenetic inference
Both PCA as well as a phylogenetic tree take a sequence data matrix as input where multiple dimensions of ycf1-ndhF genes region data are measured in multiple observations. The PCA plot data as presented in Fig. 4 formed three relatively clustered groups, with the total molecular variation of 65.5% and 19.7%, respectively. The cluster I compressed all ornamental, wild types, and cultivated specimens of peach and almond together with a closer relationship in a particular group, while cluster II assembled jointly all eight members of cherry species in the individual group. Meanwhile, the three species of Pyrus pyrifolia, Pyrus spinosa, and Malus prunifolia were separated individually as outgroup near to the PC2 axis.
To ensure the exact relationship between the 25 specimens tested, the phylogenetic tree was constructed based on the ML tree (Fig. 5). All the 22 Prunus specimens were classified into 3 major clades with highly bootstrap value within the peach, almond, and cherry groups. The three P. persica cultivars, OMT, CMJ, and CDH, were clustered with P. persica cv. Lovell formed a monophyletic clade and gathered into a common clade with P. kansuensi and P. davidiana, while P. mira and CJX cultivars were located in the basal position of the first clade confirming a close genetic relationship to peach. Furthermore, all almond and plum species were excluded together in the second monophyletic clade with a high proportion of joint relationship to the peach clade. The cherry group was further divided into two monophyletic groups. This suggested that there is great genetic diversity within cherry. A unifying clade, clade three, comprised the roots of six members of cherry species combining P. takesimensis, P. serrulate, P. maximowiczii, P. yedoensis, P. pseudocerasus, and P. subhirtella, while a black cherry (P. serotina) was placed independently in the basal position of the cherry clade. By contrast, Pyrus pyrifolia, Pyrus spinosa, and Malus prunifolia were shared individually as an outgroup of the tree. Herein, our results imply that peach underwent a domestication event that separated the cultivated peach from the wild species and cherry.
In recent years, attention has been paid to the advent of high-throughput sequencing. This technology offers opportunities to gain a more suitable plastid DNA barcode for flowering plants through the comparative analysis of full cp DNA genomes . However, at lower taxonomic levels of flowering plants, the problem is that most of the chloroplast coding region genes have insufficient sequence variation rather than the non-coding regions to resolve inter- and intraspecific relationships . We, therefore, turned our attention to two-loci coding plastid genes ycf1 and ndhF coding genes than expected would also meet the criteria needed for maximum utility as a coding hotspot locus in Prunus. Several earlier articles have been proposed that ycf1 and ndhF are useful information for DNA barcode and subject to positive selective pressure due to high variability [15, 18, 24, 25]. According to the interpretation of the recent plastid data in genus Rosa , ycf1 gene has a conversion of the 543th amino acid, while a frameshift mutation was found in the 3′ regions of ndhF genes with a higher substitution rate (R), resulting in numerous conservative and missense mutation. Mainly, there is extremely low-genetic variation due to the decrease of the substitution rate within the genus . As it is well known, this difference occurs because substituting a single-ring structure for another single-ring structure is more likely than substituting a double ring for a single ring . Here, our results showed a higher transition/transversion (R) rate in the DNA sequence variation with transitions occurred more frequently than transversions. Such variance among the rate of transition and transversion is a foundational principle for studies of molecular phylogeny .
Another striking characteristic is the overlapped phenomenon between the ndhF-ycf1genes; this is because of an unequal size variation or absence of overlapping in the expansion and contraction of the IR region , which indicated fast-evolving events. Previous results in Rosaceae  highlight the sizes of overlapped change from 110 bp in P. pyfifolia to 96 bp in P. persica and with 40 bp in P. rupicola. Our data infer a similar feature with obvious differences was observed ranged from 109 to 124 bp, especially between cultivated and wild types. This concept has gained much acceptance and support through recently plastid genome studies [9, 17, 29, 30].
With more direct interest in our results, a positive association in the IGS region was observed within the two genes, owing to sequence divergence in the cpDNA. Since the border of the IR region of cp genomes occasionally harbor insertion or deletion with a broad trend of IGS sequence, this might have led to higher sequence divergence in this region . It has been well known that the non-coding regions are mostly responsible for the cpDNA genome size variation , thus, providing superior levels of variation and may be valuable for developing DNA barcodes to estimate phylogeny at a subspecies level [7, 31].
Peach taxonomy and phylogeny are often the subjects of controversy and the major obstacle in peach breeding. To verify the sensitivity of our phylogenetic tree, we compared our results with the recent genome evolution study . During the evolutionary history of a certain lineage, we believe that one of the controversial issues raised in P. persica species is the relationship between cultivated and wild taxa. At present, the wild peach germplasm can offer many useful genes for peach improvement. Evidence suggested that P. mira is the oldest progenitor of peach , and it is considered ideal wild peach germplasm for improving cultivated peach plants . It is worth noting that the ornamental cultivar is phylogenetically closely related to the edible cultivar, which supports the previous finding that most of the ornamental peach cultivars originated directly from P. persica . As seen in our chloroplast phylogenomic tree, the cultivated peach species were derived from the three wild species presented in this study, P. kansuensis, P. davidiana, and P. mira. However, P. kansuensis shows a closer relationship with peach cultivars than P. mira. This is consistent with the previous genome resequencing analysis [2, 8], which supported the view that P. kansuensis is closer to P. persica than P. mira and P. davidiana. Several scholars have pointed out that P. davidiana is more primitive than P. kansuensis. This trend was supported by earlier evolutionary of genome re-sequencing in peach , which reveals that compared to P. davidiana and P. dulcis, there are increased inbreeding levels in the three-peach species (P. persica, P. kansuensis, and P. mira). Our data assume P. mira as the most closely related to P. mongolica and P. dulcis, and support the hypothesis of the hybrid origin of peach with almond [3, 34]. Furthermore, the current analyses strongly support the monophyly of P. pseudocerasus as the rootstock for Chinese cherry species .
Under the constraint background, ycf1-ndhF genes recover relationships among Prunus including peach, almond, and cherry, which have a taxonomic group with extremely poor sequence divergence. We believe that discovering further variable coding loci with high evolutionary rates is extremely useful and potential to be used as a coding DNA barcode in P. persica at low taxonomic levels. We tentatively put forward this study that might draw the attention of other scientists who have been working on assessing the evolutionary relationships among peach species.
With the rapid progress of NGS technologies, a large number of cpDNA genome sequences have been developed during the last two decades, which is beneficial for genome evolution and developing several DNA barcodes in plants. The present study highlighted to check the resolution and sensitivity of two DNA barcoding hotspot locus ycf1 and ndhF genes, which can offer a new approach to resolve the phylogeny and systematics for closely associated species in Prunus. Noteworthy, our results revealed that the two-locus ycf1-ndhF was varied from 3268 to 3416 bp in length. We obtained a great sequence variation in the two-locus compared to the single-locus approach due to the significant structure variation, overlaps gene annotation, and intergenic regions. Collectively, our results of the PCA and the phylogenetic tree analysis indicate that accurate monophyletic annotations clade offer obvious classification without overlapping clusters between peach, cherry, and almond. The current study, therefore, recommends the usage of the two barcoding hotspot locus ycf1 and ndhF genes approach in delineating the Prunus genus at the varietal level and species identification.
Availability of data and materials
The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.
Large single-copy region
Small single-copy region
Chinese Academy of Sciences
Dual Organellar GenoMe Annotator
Principal component analysis
Arus P, Verde I, Sosinski B, Zhebentyayeva T, Abbott AG (2012) The peach genome. Tree Genet Gen 8:531–547
Cao K, Zheng Z, Wang L, Liu X, Zhu G, Fang W, Cheng S, Zeng P, Chen C, Wang X, Xie M (2014) Comparative population genomics reveals the domestication history of the peach, Prunus persica, and human influences on perennial fruit crops. Genome Biol 15:415
Verde I, Abbott AG, Scalabrin S, Jung S, Shu S, Marroni F et al (2013) The high-quality draft genome of peach (Prunus persica) identifies unique patterns of genetic diversity, domestication and genome evolution. Nat Genet 45:487–494
Jansen RK, Saski CA, Lee S, Hansen AK, Daniell H (2010) Complete plastid genome sequences of three rosids (Castanea, Prunus, Theobroma): evidence for at least two independent transfers of rpl22 to the nucleus. Mol Biol Evol 28:835–847
Khan A, Asaf S, Khan AL, Al-Harrasi A, Al-Sudairy O, AbdulKareem NM, Khan A, Shehzad T, Alsaady N, Al-Lawati A, Al-Rawahi A (2019) First complete chloroplast genomics and comparative phylogenetic analysis of Commiphora gileadensis and C. foliacea: Myrrh producing trees. PLoS One 14(1):e0208511
Jeon, J.H., and Kim, S.C., (2019). Comparative Analysis of the complete chloroplast genome sequences of three closely related East-Asian wild roses (Rosa sect. Synstylae; Rosaceae). Genes 10:23
Li Y, Zhang J, Li L, Gao L, Xu J, Yang M (2018) Structural and comparative analysis of the complete chloroplast genome of pyrus hopeiensis—“wild plants with a tiny population”—and three other pyrus species. Int J Mol Sci 19(10):p.3262
Pervaiz T, Sun X, Zhang Y, Tao R, Zhang J, Fang J (2015) Association between Chloroplast and Mitochondrial DNA sequences in Chinese Prunus genotypes (Prunus persica, Prunus domestica, and Prunus avium). BMC Plant Biol 15:1–10
Wang S, Shi C, Gao LZ (2013) Plastid genome sequence of a wild woody oil species, Prinsepia utilis, provides insights into evolutionary and mutational patterns of Rosaceae chloroplast genomes. PLoS One 8:e73946
Song Y, Dong W, Liu B, Xu C, Yao X, Gao J, Corlett RT (2015) Comparative analysis of complete chloroplast genome sequences of two tropical trees Machilus yunnanensis and Machilus balansae in the family Lauraceae. Front Plant Sci 6:662–662
Amar MH, Hassan AH, Biswas MK, Dulloo E, Xie ZZ, Guo WW (2014) Maximum parsimony based resolution of inter-species phylogenetic relationships in Citrus L. (Rutaceae) using ITS of rDNA. Biotechnol Biotechnol Equip 28:61–67
Korotkova N, Nauheimer L, Ter-Voskanyan H, Allgaier M, Borsch T (2014) Variability among the most rapidly evolving plastid genomic regions is lineage-specific: implications of pairwise genome comparisons in Pyrus (Rosaceae) and other angiosperms for marker choice. PLoS One 9(11):e112998
Meng D, Xiaomei Z, Wenzhen K, Xu Z (2019) Detecting useful genetic markers and reconstructing the phylogeny of an important medicinal resource plant, Artemisia selengensis , based on chloroplast genomics. PLoS One 14(2):e0211340
The author declares that he has no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Amar, M.H. ycf1-ndhF genes, the most promising plastid genomic barcode, sheds light on phylogeny at low taxonomic levels in Prunus persica.
J Genet Eng Biotechnol18, 42 (2020). https://doi.org/10.1186/s43141-020-00057-3