The chloroplasts genomic analyses of Rosa laevigata, R. rugosa and R. canina

Background Many species of the genus Rosa have been used as ornamental plants and traditional medicines. However, industrial development of roses is hampered due to highly divergent characteristics. Methods We analyzed the chloroplast (cp) genomes of Rosa laevigata, R. rugosa and R. canina, including the repeat sequences, inverted-repeat (IR) contractions and expansions, and mutation sites. Results The size of the cp genome of R. laevigata, R. rugosa and R. canina was between 156 333 bp and 156 533 bp, and contained 113 genes (30 tRNA genes, 4 rRNA genes and 79 protein-coding genes). The regions with a higher degree of variation were screened out (trnH-GUU, trnS-GCU, trnG-GCC, psbA-trnH, trnC-GCA,petN, trnT-GGU, psbD, petA, psbJ, ndhF, rpl32,psaC and ndhE). Such higher-resolution loci lay the foundation of barcode-based identification of cp genomes in Rosa genus. A phylogenetic tree of the genus Rosa was reconstructed using the full sequences of the cp genome. These results were largely in accordance with the current taxonomic status of Rosa. Conclusions Our data: (i) reveal that cp genomes can be used for the identification and classification of Rosa species; (ii) can aid studies on molecular identification, genetic transformation, expression of secondary metabolic pathways and resistant proteins; (iii) can lay a theoretical foundation for the discovery of disease-resistance genes and cultivation of Rosa species.


Background
Rosaceae is a large and diverse family with 100 genera and 3000 species. Rosa is a typical genus of the Rosaceae family. Rosa chinensis Jacq., Rosa laevigata Michx. and Rosa rugosa Thunb are documented in the 2015 version of Chinese Pharmacopoeia [1].
Plants of the genus Rosa are distributed in the temperate and subtropical regions of the Northern hemisphere [2,3]. The genus Rosa has garnered increasing attention as a medicinal agent recently [4][5][6]. Due to the potential economic and medicinal value of peonies, it is important to understand the genetic relationships within species for future application of germplasm resources.
In conventional taxonomy, the genus Rosa is divided into four subgenera (Hulthemia, Rosa, Platyrhodon, and Hesperhodos), and the subgenus of Rosa is divided further into 10 sections (Pimpinellifoliae, Gallicanae, Caninae, Carolinae, Rosa, Synstylae, Chinenses [syn. Indicae], Banksianae, Laevigatae, and Bracteatae [7,8]. Despite numerous recent studies examining phylogenetic relationships in the genus Rosa, relationships remain obscure because of: (i) hybridization in nature and in the garden, and low levels of chloroplast and nuclear genome variation [9][10][11]; (ii) phylogenetic analyses only based on a small number of non-coding chloroplast sequences show low internal resolution [12,13]. Rosa laevigata, R. rugosa and R. canina have been employed in traditional Chinese medicine (TCM) formulations. However, several sympatric species of Rosa have been used in TCM formulations, and the diversity of medicinal materials can affect the quality and safety of medicinal materials severely.
Chloroplasts are the descendants of ancient bacterial endosymbionts. They are the common organelles of green plants, and have an essential role in photosynthesis [14]. In general, inheritance of the cp genome is patrilineal in gymnosperms, but maternal in angiosperms [15]. The cp genome is conservative in structure, contains a large single-copy (LSC) region, small single-copy (SSC) region, and two inverse repeat (IR) regions. The cp genome is an ideal research model for the study of molecular identification, phylogeny, species conservation, and genome evolution [16,17]. Over the last decade, researchers have gained more in-depth understanding of chloroplasts, including their origin, structure, evolution, genetic engineering, as well as forward and reverse genetics [18][19][20]. In addition, the development of sequencing technology has greatly promoted chloroplast study [21,22], now generating massive chloroplast genome sequence data, helping to overcome the previously unresolved relationships. Moreover, it also provides genomic information such as structure, gene order, content, and mutations in which the critical information of species identification is provided [23][24][25][26][27].
In previous studies, chloroplast genomes provided the effective information for identifying Rosa species [28,29]. The chloroplast genomes of two species from the genus Rosa, R. chinensis and R. rugosa, which have been collected in Chinese Pharmacopoeia 2015 were published. In the present study, the remainder of the recorded species of the genus Rosa in the Chinese Pharmacopeia, including two used in TCM (R. laevigata, R. rugosa) and a traditional medicine used worldwide (R. canina) were identified based on the chloroplast genome. The structural characteristics, phylogenetic relationships, interspecific divergence among R. laevigata, R. rugosa and R. canina were documented.

DNA sequencing, assembly and validation of the cp genome
The fresh leaves of R. laevigata and R. rugosa plants were collected in Shennongjia (Hubei Province, China). The dried flowers of R canina were purchased at a medicinal market in Beijing, China. The cetyltrimethylammonium-bromide method was used to extract the whole genomic DNA of tree peonies [30]. The DNA concentration was measured using a ND-2000 spectrometer (Nan-oDrop Technologies, Wilmington, DE, USA). A shotgun library (250 bp) was constructed according to manufacturer (Vazyme Biotech, Nanjing, China) instructions.
Sequencing was accomplished with the X ™ Ten platform (Illumina, San Diego, CA, USA) using the double terminal sequencing method (pair-end 150). The amount of raw data from the sample was 5.0 G, and > 34 million paired-end reads were attained.
Raw data were filtered by Skewer-0.2.2 [31]. Chloroplast-like reads were predicted from clean-reads by BLAST [32] searches using the sequences of the reference Rosa chinensis. Then, the cp reads was used to assemble sequences by SOAPdenovo-2.04 [33]. Finally, sequences were extended and gaps filled with SSPACE-3.0 and Gap-Closer-1.12 [34,35]. To validate the accuracy of junction splicing, random primers were designed to test the four junctions of the sequence by polymerase chain reaction.

Gene annotation and sequence analyses
Sequence annotation was achieved by CpGAVAS [36]. DOGMA (http://dogma .ccbb.utexa s.edu/) and BLAST were used to check the results of annotation [37]. All transfer tRNA genes with default settings were detected by tRNAscanSEv1. 21 [38]. The structural features of the cp genome were drawn by OGDRAWv1.2 [39]. MEGA5.2 was used to define relative use of synonymous codons [40].

Comparison of cp genomes
The cp genomes of Rosa species were completed by mVISTA [41] (Shuffle-LAGAN mode) using the genome of R. chinensis as the reference. Tandem Repeats Finder [42] was used to detect tandem repeats, forward repeats, and palindromic repeats as tested by REPuter [43]. Detection of simple sequence repeats (SSRs) was done by Misa.pl [44] using search parameters of mononucleotides set to ≥ 10 repeat units, dinucleotides ≥ 8 repeat units, trinucleotides and tetranucleotides ≥ 4 repeat units, and pentanucleotides and hexanucleotides ≥ 3 repeat units.

Phylogenetic analyses
Phylogenetic trees were constructed using the genomic sequences of 21 chloroplasts. The sequences were aligned using clustalw2. Construction of an unrooted phylogenetic tree was achieved using the neighbor-joining (NJ) approach with MEGA5.2 [40] with bootstrap replicates of 1000. Hibiscus rosa-sinensis was set as the outgroup.

DNA features of the chloroplasts of R. laevigata, R. rugosa and R. canina
The size of the cp genomes ranged from 156 333 bp to 156 533 bp. Among them, the largest cp genome was of R. rugosa (156 533 bp) and the smallest cp genome was of R. laevigata (156 333 bp). The total guanine + cytosine (G + C) content of the three genomes was 37.3%.
R. laevigata, R. rugosa and R. canina had a cp genome with a similar structure: LSC region, SSC region, and a pair of inverted repeats (IRA/IRB). For R. laevigata, R. rugosa and R. canina, the length of the LSC region of the cp genome varied from 85 452 bp to 85 657 bp, and the G + C content from 35.2% to 35.3%; the length of SSCregion distribution was from 18 742 bp to 18 785 bp, and the G + C content was from 31.3 to 31.4%. The IR region had a length distribution from 26 048 bp to 26 053 bp, and the G + C content was 42.7% (Table 1). The DNA G + C content is an important indicator of species affinity [45], and R. laevigata, R. rugosa and R. canina have highly similar cpDNA G + C content. The DNA G + C content of the IR regions was higher than that of LSC and SSC regions, which is similar to that seen with other angiosperms [46]. In general, the relatively high DNA G + C content of the IR regions is attributable to rRNA genes and tRNA genes [47,48]. After annotation, the sequences of the whole cp genome of R. laevigata, R. rugosa and R. canina was submitted to the National Center for Biotechnology Information database (NCBI), the GenBank accession number in Table 1.
A physical map of the cp genomes of R. laevigata, R. rugosa and R. canina was drawn according to annotation results using OGDraw [39] (Fig. 1). A total of 113 genes were contained in the cp genome of R. laevigata, R. rugosa and R. canina: four rRNA genes, 30 tRNA genes, and 79 protein-coding genes ( Table 2). Most genes could be divided crudely into three groups: "self-replicationrelated", "photosynthesis-related", and "other" (Table 2) [49].
In all anticipated genes of the cp genomes of R. laevigata, R. rugosa and R. canina, introns were discovered in 17 genes: six tRNA genes and 11 protein-encoding genes ( Table 3). The tRNA genes with introns were trnK-UUU, trnL-UAA, trnV-UAC, trnI-GAU, trnG-UCC and trnA-UGC . The 11 coding genes with introns were rps12, rps16, rpl16, rpl2, rpoC1, ndhA, ndhB, ycf3, petB, clpP and petD. Three of the 17 intron-containing genes were inserted by three introns (rps12, ycf3, clpP). The remainder of the genes were inserted by only one intron. Of these, trnH-UUU contained the largest intron (2500 bp), which contained the whole matK. Similar to other angiosperms, rps12 of chloroplasts in R. laevigata, R. rugosa and R. canina resulted from trans-splicing activity. The 5′ end of rps12 was in the LSC region, and the 3′ end was in the IR region.

Analyses of long repetitive sequences and SSRs
For R. laevigata, R. rugosa and R. canina, interspersed repeated sequences (IRSs) were evaluated in the cp genomes with a repeat-unit length of ≥ 30 bp. These comprised forward repeats, reverse repeats, complementary repeats, and palindromic repeats. Fifty 50 IRSs were found in R. rugosa; 60 IRS in R. laevigata and 50 IRS in R. canina. Among all types of IRS, the sequence lengths of 20-29 bp occurred most frequently. IRS analyses of the cp genomes of R. laevigata, R. rugosa and R. canina are shown as Fig. 2.
SSRs are disposed to slipped-strand mispairing, which is a key mutational mechanism for generating SSR polymorphisms [50]. SSRs at the intra-specific level in the cp genome are variable, so they are used regularly as genetic markers in studies of evolution and population genetics [51][52][53]. We found 63 SSRs in R. rugosa, 62 SSRs in R canina, and 65 SSRs in R. laevigata (Fig. 3).

Genomic sequences
To ascertain differences in the genomic sequences of chloroplasts of R. laevigata, R. rugosa and R. canina, we used the sequence in R. chinensis as a reference (Fig. 4). Variability in the IR region of the cp genomes was considerably lower than that of LSC and SSC regions. In addition, most of the protein-coding genes of chloroplasts were highly conserved, except for the large variation in protein-coding genes of some genes (e.g., rps19, petB, and ycf2). Regions with a higher degree of variation among chloroplast genomic sequences were usually located in intergenic regions, such as the spacers for: trnH-GUU; trnS-GCU and trnG-GCC ; psbA-trnH, trnC-GCA and petN; trnT-GGU and psbD; petA and psbJ; ndhF and rpl32; psaC and ndhE. Identification of such higherresolution loci was necessary for use as barcodes for species identification.

Comparison of IR regions in the cp genomes of R. laevigata, R. rugosa, R. canina and R. chinensis
Gene location was relatively conservative in R. laevigata, R. rugosa, R. canina and R. chinensis. In these four species, rps19 was located in the LSC region, rpl2 in the IRa region, and ndhF in the SSC region. However, the coding region of ycf1 was at the border of SSC/IRb, and spanned the LSC region and IRb region, so the IRa/ SSC boundary (5′ end was lost) region created a pseudogene. The region of mutations in the ycf1 pseudogene in the IRa/SSC region was 1106-1118 bp (Fig. 5). The double-strand break repair theory is considered to be the main mechanism for expansion and contraction of the IR region. Large shrinkages of the IR region are relatively rare.

Phylogenetic analyses
There have been many efforts to reconstruct the phylogenetic trees of plants of the genus Rosa. Several scholars have proposed that the extant classification system was artificial [12,13], and that the interspecies relationships of Rosa are still ambiguous. The availability of the complete genomes of chloroplasts can provide further information for reconstruction of robust phylogeny for Rosa. A NJ tree was constructed for the cp genomes of 18 species of the Rosaceae family (Fig. 6). Species from the Rosa genus were monophyletic clade. Furthermore, R. laevigata, R. rugosa, R. canina and R. chinensis could be effectively divided into different sub-clades, and differentiated from each other efficiently. In which, R. Chinensis have a closer relationship with R. rugosa.

Discussion
We identified the cp genomes of R. laevigata, R. rugosa and R. canina in this study, which are used in TCM formulations. The cp genomes of R. laevigata, R. rugosa and R. canina showed high similarities in terms of genome size, gene classes, gene sequences, codon usage, and distribution of repeat sequences. This is partly because of the extremely low levels of sequence divergence observed across the Rosa genus [54,55]. Some intergenic regions were identified with high degree of variation, which will be used as barcodes for species identification. We also investigated introns in all anticipated genes of three Rosa species. Intron and/or gene losses have been reported for cp genomes [56][57][58]. Introns have important roles in regulation of gene expression [59], and they can control gene expression temporally and in a tissue-specific manner [60,61]. Scholars have reported on the regulation mechanisms of introns for gene expression in plants and animals [62][63][64]. However, the connotations between intron loss and gene expression using the transcriptome for genus Rosa have not been published. More experimental work on the roles of introns shall be needed for future work. Comparative analysis of gene location in R. laevigata, R. rugosa, R. canina and R. chinensis revealed a pseudogene of ycf1, which may provide a basis for studying variations in the cp genomes of higher plants or algae. Phylogenetic analyses revealed that Rosa genus belonged to monophyletic clade (Fig. 6), while their intrafamily relationships were almost in agreement with those from a study by Zhang and Marie et al. [13,65]. However, the exact phylogenetic location of some base taxons needs further verification, such as that the phylogenetic relationship of R. rugosa and R. chinensis in here contradicts what was previously reported, two R. rugosa species were clustered into two different clades. The possible reason is: complicates phylogeny reconstruction in roses was complicated by interspecific hybridization, some studies have suggested that there were frequent interspecific hybridization in the Rosa genus [11,[66][67][68][69]. Indeed, several contradictions between plastid and nuclear gene  14, 16*, 20, 22, 23a, 33, 36 Small subunit of ribosomal proteins rps2, 3, 4, 7a, 8, 11, 12*,a, 14, 16*, 18, 19 DNA  phylogenies of Rosa genus were discovered in previous study [55]. In addition, publications of numerous names given to morphological variants and hybrids, result in Rosa taxonomy further complication [70]. Further identification of plant material or sequencing of those hybrids could explain why conspecific samples sometimes fall into distinct clades [12].

Conclusions
The whole cp genomes of R. laevigata, R. rugosa and R. canina was sequencing and analysis in this study. The status of the major taxa within the genus Rosa was consistent with our results for sequencing of cp genomes. R. laevigata, R. rugosa, R. canina and R. chinensis could be differentiated from other Rosa species efficiently. Our data reveal that cp genomes can be used for the identification and classification of Rosa species. Our results can aid studies on molecular identification, genetic transformation, and lay a theoretical foundation for the discovery of disease-resistance genes and cultivation of Rosa