ycf1-ndhF genes, the most promising plastid genomic barcode, sheds light on phylogeny at low taxonomic levels in Prunus persica

Background Chloroplast genome sequencing is becoming a valuable process for developing several DNA barcodes. At present, plastid DNA barcode for systematics and evolution in flowering plant rely heavily on the use of non-coding genes. The present study was performed to verify the novelty and suitability of the two hotspot barcode plastid coding gene ycf1 and ndhF, to estimate the rate of molecular evolution in the Prunus genus at low taxonomic levels. Results Here, 25 chloroplast genomes of Prunus genus were selected for sequences annotation to search for the highly variable coding DNA barcode regions. Among them, 5 genera were of our own data, including the ornamental, cultivated, and wild haplotype, while 20 genera have been downloaded from the GenBank database. The results indicated that the two hotspot plastid gene ycf1 and ndhF were the most variable regions within the coding genes in Prunus with an average of 3268 to 3416 bp in length, which have been predicted to have the highest nucleotide diversity, with the overall transition/transversion bias (R = 1.06). The ycf1-ndhF structural domains showed a positive trend evident in structure variation among the 25 specimens tested, due to the variant overlap’s gene annotation and insertion or deletion with a broad trend of the full form of IGS sequence. As a result, the principal component analysis (PCA) and the ML tree data drew an accurate monophyletic annotations cluster in Prunus species, offering unambiguous identification without overlapping groups between peach, almond, and cherry. Conclusion To this end, we put forward the domain of the two-locus ycf1-ndhF genes as the most promising coding plastid DNA barcode in P. persica at low taxonomic levels. We believe that the discovering of further variable loci with high evolutionary rates is extremely useful and potential uses as a DNA barcode in P. persica for further phylogeny study and species identification.

of peach [3] along with the assembly of Chloroplast DNA (cpDNA) genome data of P. persica cv. Lovell [4], provided a new era for the development of comparative cpDNA genome studies and the discovery of DNA barcode genes in peach. Fortunately, the decreasing cost of high-throughput sequencing of the cp genome offers opportunities to gain more cp genome sequences and discover useful particular DNA barcode of coding and noncoding plastid genes in Prunus species [5].
It is well known that the chloroplasts are the photosynthetic organelles in plant cells, provide energy to green plants and plays an important role in sustaining life [6]. The chloroplast genome is the third-largest genome after the nuclear genome and mitochondria, which encodes many key proteins that are involved in photosynthesis and other metabolic processes [7]. The chloroplast genomes have different features, e.g., maternal inheritance in most angiosperms, and high conservation in genome structure and gene contents [8]. The cpDNA genome is extremely conserved with a self-replicating circular molecule and has a typical quadripartite structure, in which two inverted repeats (IRs) are separated by a large single-copy region (LSC) and a small singlecopy region (SSC) [9]. Most of the cpDNA contains approximately 110-133 genes, including protein-coding genes (CDS), ribosomal RNA genes, and transfer RNA genes. The non-coding and coding regions of the chloroplast genome had a diverse signature at both high and low taxonomic levels, making them appropriate for systematic and evolution studies [10]. Albeit, the noncoding region has less functional constraint than the coding region, it offers superior levels of evolutionary rate for phylogenetic and barcoding studies at the subspecies level, while the coding region is highly conserved and suitable only for higher taxonomic levels [11].
In the past decade, the two protein-coding genes matK and rbcL were chosen as core plant DNA barcodes [12], while other protein-coding genes, like atpF-H, psbK-I, ropC1, and rpoB, are lacking resolution and have been recommended as supplemental barcodes in diversity within flowering plants [13]. Unfortunately, the discrimination power of discovered barcodes coding genes is too weak to drive through all species, especially in higher plants [14]. Hence, there are no universal barcode loci neither for all plants nor for Prunus species. Dong et al. [12] proposed that ycf1 is the most promising plastid DNA barcode for land plants and plays an important role in genome evolution. Other evidence supposed that among the protein-coding genes, ycf1 and ndhF are appreciated sources of phylogenetic relationship provide effective information and DNA barcodes-based cpDNA genome for phylogeny and species identification in breeding resources [15][16][17]. Later, Jeon and Kim [9] suggested that the combination of two-locus ycf1 and ndhF genes is beneficial for deciphering phylogenetic relationships between closely related taxa in Rosaceae. Although these two hot spot genes have superior efficiency in discrimination at the low taxonomic level, still little attention for DNA barcoding and molecular evolution purposes are received [12,18].
To date, with the documented deficiency of the ycf1-ndhF coding genes in Prunus, we reported for the first time a detailed overview of these hotspot regions to investigate evolutionary relationships between 25 Prunus, Malus, and Pyrus species. Through this research, the performance and efficiency of ycf1-ndhF genes were evaluated as hotspot regions for DNA barcoding and biodiversity which may be helpful in future breeding programs of peach. For this purpose, we achieved a comparative structure variation, the overlapping of ycf1-ndhF genes sequences, and phylogeny analysis within the species level of the ornamental, cultivated, and wild haplotype of P. persica.

Plant materials and DNA extraction
Peach specimens (P. persica) used in this study contained three edible cultivars, one ornamental cultivar, and a wild relative P. mira (Table 1). These five specimens were collected from the field gene bank of Chinese Academy of Sciences (CAS), Wuhan, China, in the juvenile stage in the spring season. Total genomic DNA was extracted from 100 mg of fresh leaves using Plant Genomic DNA Extraction Kit (DP305-03, Tiangen Biotech, Beijing, China) according to the manufacturer's instructions. The DNA quantity was assessed using a spectrophotometer (Nanodrop 2000, Thermo Fischer, USA). Both the stock and diluted portions were stored at -80°C until use.

DNA sequencing, genome assembly, and validation
The Illumina HiSeq 2500 platform was used to sequence the total DNA of the five studied specimens. After sequencing, the raw data was initially screened to remove low-quality regions affecting the data quality and subsequent analysis needed to obtain the expected clean data. The cpDNA genome was assembled by mapping onto the public complete chloroplast genome of P. persica cv. Lovell (GenBank accession HQ336405) [4], and the genome assembly and alignment analyses were performed using Geneious R10 program (http://www.geneious.com; Biomatters Ltd., Auckland, New Zealand).

Genome annotation and analysis
In the present study, 25 cpDNA genomes were used for annotation, including the 5 peach specimens from our materials, in addition to the 20 cpDNA genomes that were downloaded from NCBI GenBank database. However, Pyrus pyrifolia, Pyrus spinosa, and Malus prunifolia were used as outgroup. These 25 cpDNA genomes representatives all major indigenous of Prunus species. Gene annotation of the 25 cpDNA genomes was performed with the online program Dual Organellar Gen-oMe Annotator (DOGMA) [19]. Initial annotation, putative starts, stops, and intron positions were determined, and then the draft annotation was inspected and corrected manually by comparison with a homologous gene with the chloroplast genome of P. persica (NC_ 014697) from the NCBI database.
Identification of ycf1-ndhF genes, sequence editing and alignment The ycf1-ndhF genes were obtained from the 25 cpDNA genomes using DOGMA analysis to compare the structure sequence, and the multiple sequence alignment was done using MUSCLE v3.70+ fix1-2 [20], and manually adjusted, as necessary. Nucleotide diversity (π), estimated values of transition/transversion bias (R), and nucleotide substitutions (r) for each sequence were performed using MEGA X program [21].

Phylogenetic inference
The analysis of the consensus phylogenetic tree was performed using 25 nucleotide sequences of ycf1-ndhF genes, including 22 species of Prunus in addition to the 3 species for Pyrus and Malus as an outgroup (Pyrus pyrifolia, Pyrus spinosa, and Malus prunifolia). To gain accurate perspectives on genetic diversity, a graphic demonstration of principal component analysis (PCA) was carried out to display the multi-dimensional genetic relationship and its partition among specimens using the ClustVis web tool for visualizing clustering of multivariate data [22]. The evolutionary history was inferred by using the maximum likelihood method (ML) based on the Tamura-Nei model [23]. The maximum likelihood (ML) tree was computed using MEGA X software. The bootstrap consensus tree inferred from 1000 replicates was taken and searched for the best-scoring ML tree simultaneously to represent the evolutionary history of the 25 specimens tested.

Performance of ycf1 and ndhF genes identifications
At first, to empirically test the regions identified as most appropriate for barcoding in the plastid coding genes of P. persica, automatic genome annotations were performed among the 25 cpDNA genomes of Rosaceae (Fig. 1). According to original gene annotations analysis and our previous data on this material (data not published), several invariable loci in the analyses were ignored due to inadequate identification, e.g., rbcL, matK, ndhA, ycf2, ycf3, ropC1, rpoC2, rpoB, rps16, clpP, psbB, atpF, atpA, trnK-UUU, and trnH-psbA (data not shown). To circumvent the challenges related to a single-locus approach, this study undertook a two-locus analysis with its overlapping or intergenic spacer (IGS) and insertion/deletion as a useful option in delineating closely related peach sequence variations based on the combination of complete two proteincoding genes ycf1-ndhF of the chloroplast genome. The position of ycf1 in IRA regions varied from 1037 to 1085 bp (Table 1 and Fig. 2). It is worth noting that the ornamental and cultivated species in our study so-called OMT, CMJ, CDH, and CJX gave similar length with the P. persica and P. Kansuensis of 1061 bp, while a slight lower size variation was observed in the two wild types P. davidiana and P. mira with 1046 bp. By comparison, ndhF gene had a much higher position in SSC regions varied from 2098 to 2234 bp ( Table 1 and Fig. 2). However, all cultivated, wild type, and ornamental species in our sampling had different ndhF gene length harboring 2231 bp, except for CMJ cultivar which had a slightly lower sequence length with 2228 bp. Overall, the two-loci ycf1-ndhF ranged from 3268 to 3416 bp in length, showed great sequence variation than the singlelocus approach due to the variant overlaps of gene annotation and intergenic regions. Overlapped and intergenic sequences within ycf1 and ndhF genes In an in-depth look, another remarkable difference identified the overlapped phenomenon and intergenic sequence region in the IRA/SSC border within ycf1-ndhF genes of the plastid genome. The border between four junctions usually differs among plants showed a slight variation in size among the 25 tested specimens (Fig. 3). As a result, the pseudogenes, ycf1 gene present at the IRa/SSC border and is partially located inside the IR region. Among the 25 specimens tested, only 15 showed a variant overlapped region between ycf1 and ndhF genes with the size ranging from 109 to 124 bp (Table 1). By contrast, the intergenic region was identified within only 7 specimens which harbored IGS sequence ranged from 9 to 59 bp (Table 1). Among all, P. mongolica showed the highest IGS sequence variation harbored 59 bp, followed by P serotina, P. humilis, and P. yedoensis with 24, 23, and 16 bp, respectively. While the three specimens P. maximowiczii, P. serrulata, and P. pseudocerasus contained the lowest IGS with 9 bp in length. By contrast, the rest three specimens of Pyrus pyrifolia, Pyrus spinosa, and Malus prunifolia showed the opposite trend with no intergenic region or overlapped ones. Taken together, the two-locus ycf1-ndhF structural domains demonstrated divergence evident in structure variation among the 25 tested specimens.

Sequence divergence of ycf1-ndhF genes
To obtain a comprehensive knowledge on the ycf1 and ndhF sequence divergence among taxa, the averages of nucleotide frequencies were A (33.95%), T/U (33.82%), C (16.12%), and G (16.11%) with an average of AT (33.92%) and GC (16.08%) contents (Table 2). In order to determine the transition/transversion bias (R), the nucleotide substitution pattern was estimated to describe the superior substitution pattern using Kimura 2parameter analysis with five models (T92+G+I, HKY+ G+I, GTR+G+I, TN93+G+I, and K2+G+I). The highest rate of substitutions values (r) for each nucleotide pair was detected in r (GA ± 0.19) and r (CT ± 0.018), revealing high levels of substitutions. By contrast, the lower values of substitution were observed within r (AC; GC; CG; TG ± 0.04), respectively (Table 3). Furthermore, the transition/transversion rate ratios, recorded a higher transition/transversion rate for purine (K1 = 2.57) compared to the transition/transversion rate for pyrimidine (K2 = 2.28). While the overall transition/transversion bias is R = 1.06, which gives support for the dominance of the transitions over transversion in peach germplasm.

Principal component analysis and phylogenetic inference
Both PCA as well as a phylogenetic tree take a sequence data matrix as input where multiple dimensions of ycf1-ndhF genes region data are measured in multiple observations. The PCA plot data as presented in Fig. 4 formed  To ensure the exact relationship between the 25 specimens tested, the phylogenetic tree was constructed based on the ML tree (Fig. 5). All the 22 Prunus specimens were classified into 3 major clades with highly bootstrap value within the peach, almond, and cherry groups. The three P. persica cultivars, OMT, CMJ, and CDH, were clustered with P. persica cv. Lovell formed a monophyletic clade and gathered into a common clade with P. kansuensi and P. davidiana, while P. mira and CJX cultivars were located in the basal position of the first clade confirming a close genetic relationship to peach. Furthermore, all almond and plum species were excluded together in the second monophyletic clade with a high proportion of joint relationship to the peach clade. The cherry group was further divided into two monophyletic groups. This suggested that there is great genetic diversity within cherry. A unifying clade, clade three, comprised the roots of six members of cherry species combining P. takesimensis, P. serrulate, P. maximowiczii, P. yedoensis, P. pseudocerasus, and P. subhirtella, while a black cherry (P. serotina) was placed independently in the basal position of the cherry clade. By contrast, Pyrus pyrifolia, Pyrus spinosa, and Malus prunifolia were shared individually as an outgroup of the tree. Herein, our results imply that peach underwent a domestication event that separated the cultivated peach from the wild species and cherry.

Discussion
In recent years, attention has been paid to the advent of high-throughput sequencing. This technology offers opportunities to gain a more suitable plastid DNA barcode for flowering plants through the comparative analysis of full cp DNA genomes [7]. However, at lower taxonomic levels of flowering plants, the problem is that most of the chloroplast coding region genes have insufficient sequence variation rather than the non-coding regions to resolve inter-and intraspecific relationships [18]. We, therefore, turned our attention to two-loci coding plastid genes ycf1 and ndhF coding genes than expected would also meet the criteria needed for maximum utility as a coding hotspot locus in Prunus. Several earlier articles have been proposed that ycf1 and ndhF are useful information for DNA barcode and subject to positive selective pressure due to high variability [15,18,24,25]. According to the interpretation of the recent plastid data in genus Rosa [9], ycf1 gene has a conversion of the 543th amino acid, while a frameshift mutation was found in the 3′ regions of ndhF genes with a higher substitution rate (R), resulting in numerous conservative and missense mutation. Mainly, there is extremely lowgenetic variation due to the decrease of the substitution rate within the genus [26]. As it is well known, this difference occurs because substituting a single-ring structure for another single-ring structure is more likely than substituting a double ring for a single ring [27]. Here, our results showed a higher transition/transversion (R) rate in the DNA sequence variation with transitions occurred more frequently than transversions. Such variance among the rate of transition and transversion is a foundational principle for studies of molecular phylogeny [28]. Another striking characteristic is the overlapped phenomenon between the ndhF-ycf1genes; this is because of an unequal size variation or absence of overlapping in the expansion and contraction of the IR region [16], which indicated fast-evolving events. Previous results in Rosaceae [15] highlight the sizes of overlapped change from 110 bp in P. pyfifolia to 96 bp in P. persica and with 40 bp in P. rupicola. Our data infer a similar feature with obvious differences was observed ranged from 109 to 124 bp, especially between cultivated and wild types. This concept has  gained much acceptance and support through recently plastid genome studies [9,17,29,30]. With more direct interest in our results, a positive association in the IGS region was observed within the two genes, owing to sequence divergence in the cpDNA. Since the border of the IR region of cp genomes occasionally harbor insertion or deletion with a broad trend of IGS sequence, this might have led to higher sequence divergence in this region [30]. It has been well known that the non-coding regions are mostly responsible for the cpDNA genome size variation [11], thus, providing superior levels of variation and may be valuable for developing DNA barcodes to estimate phylogeny at a subspecies level [7,31].
Peach taxonomy and phylogeny are often the subjects of controversy and the major obstacle in peach breeding. To verify the sensitivity of our phylogenetic tree, we compared our results with the recent genome evolution study [8]. During the evolutionary history of a certain lineage, we believe that one of the controversial issues raised in P. persica species is the relationship between cultivated and wild taxa. At present, the wild peach germplasm can offer many useful genes for peach improvement. Evidence suggested that P. mira is the oldest progenitor of peach [2], and it is considered ideal wild peach germplasm for improving cultivated peach plants [32]. It is worth noting that the ornamental cultivar is phylogenetically closely related to the edible cultivar, which supports the previous finding that most of the ornamental peach cultivars originated directly from P. persica [33]. As seen in our chloroplast phylogenomic tree, the cultivated peach species were derived from the three wild species presented in this study, P. kansuensis, P. davidiana, and P. mira. However, P. kansuensis shows a closer relationship with peach cultivars than P. mira. This is consistent with the previous genome resequencing analysis [2,8], which supported the view that P. kansuensis is closer to P. persica than P. mira and P. davidiana. Several scholars have pointed out that P. davidiana is more primitive than P. kansuensis. This trend was supported by earlier evolutionary of genome re-sequencing in peach [8], which reveals that compared to P. davidiana and P. dulcis, there are increased inbreeding levels in the three-peach species (P. persica, P. kansuensis, and P. mira). Our data assume P. mira as the most closely related to P. mongolica and P. dulcis, and support the hypothesis of the hybrid origin of peach with almond [3,34]. Furthermore, the current analyses strongly support the monophyly of P. pseudocerasus as the rootstock for Chinese cherry species [35].
Under the constraint background, ycf1-ndhF genes recover relationships among Prunus including peach, almond, and cherry, which have a taxonomic group with extremely poor sequence divergence. We believe that discovering further variable coding loci with high evolutionary rates is extremely useful and potential to be used as a coding DNA barcode in P. persica at low taxonomic levels. We tentatively put forward this study that might draw the attention of other scientists who have been working on assessing the evolutionary relationships among peach species.

Conclusion
With the rapid progress of NGS technologies, a large number of cpDNA genome sequences have been developed during the last two decades, which is beneficial for genome evolution and developing several DNA barcodes in plants. The present study highlighted to check the resolution and sensitivity of two DNA barcoding hotspot locus ycf1 and ndhF genes, which can offer a new approach to resolve the phylogeny and systematics for closely associated species in Prunus. Noteworthy, our results revealed that the two-locus ycf1-ndhF was varied from 3268 to 3416 bp in length. We obtained a great sequence variation in the two-locus compared to the single-locus approach due to the significant structure variation, overlaps gene annotation, and intergenic regions. Collectively, our results of the PCA and the phylogenetic tree analysis indicate that accurate monophyletic annotations clade offer obvious classification without overlapping clusters between peach, cherry, and almond. The current study, therefore, recommends the usage of the two barcoding hotspot locus ycf1 and ndhF genes approach in delineating the Prunus genus at the varietal level and species identification.