Machine learning integration of multi-modal analytical data for distinguishing abnormal botanical drugs and its application in Guhong injection

Han, Zhu; Zhao, Jiandong; Tang, Yu; Wang, Yi

doi:10.1186/s13020-023-00873-y

Research
Open access
Published: 02 January 2024

Machine learning integration of multi-modal analytical data for distinguishing abnormal botanical drugs and its application in Guhong injection

Zhu Han¹,
Jiandong Zhao⁴,
Yu Tang ORCID: orcid.org/0000-0003-1998-3521^1,3 &
…
Yi Wang^1,2,3

Chinese Medicine volume 19, Article number: 2 (2024) Cite this article

638 Accesses
Metrics details

Abstract

Background

Determination of batch-to-batch consistency of botanical drugs (BDs) has long been the bottleneck in quality evaluation primarily due to the chemical diversity inherent in BDs. This diversity presents an obstacle to achieving comprehensive standardization for BDs. Basically, a single detection mode likely leads to substandard analysis results as different classes of structures always possess distinct physicochemical properties. Whereas representing a workaround for multi-target standardization using multi-modal data, data processing for information from diverse sources is of great importance for the accuracy of classification.

Methods

In this research, multi-modal data of 78 batches of Guhong injections (GHIs) consisting of 52 normal and 26 abnormal samples were acquired by employing HPLC-UV, -ELSD, and quantitative ¹H NMR (q¹HNMR), of which data obtained was then individually used for Pearson correlation coefficient (PCC) calculation and partial least square-discriminant analysis (PLS-DA). Then, a mid-level data fusion method with data containing qualitative and quantitative information to establish a support vector machine (SVM) model for evaluating the batch-to-batch consistency of GHIs.

Results

The resulting outcomes showed that datasets from one detection mode (e.g., data from UV detectors only) are inadequate for accurately assessing the product's quality. The mid-level data fusion strategy for the quality evaluation enabled the classification of normal and abnormal batches of GHIs at 100% accuracy.

Conclusions

A quality assessment strategy was successfully developed by leveraging a mid-level data fusion method for the batch-to-batch consistency evaluation of GHIs. This study highlights the promising utility of data from different detection modes for the quality evaluation of BDs. It also reminds manufacturers and researchers about the advantages of involving data fusion to handle multi-modal data. Especially when done jointly, this strategy can significantly increase the accuracy of product classification and serve as a capable tool for studies of other BDs.

Introduction

According to the report, over 800 botanical investigatory new drug (IND) applications and pre-IND meeting requests have been submitted to nearly every review division of the FDA from 1984 to 2018 [1], and the World Health Organization (WHO) has estimated that perhaps 80% of people are dependent largely on botanical products for their primary health care needs [2]. However, to date, only two botanical drugs (BDs) have been approved by FDA for marketing as prescription drugs. One of the main contributing factors impeding the approval process of botanical products is the chemical complexity, for which metabolomic analysis constantly requires laborious efforts. In addition, BDs purportedly exert therapeutic effects by means of synergistic interactions, so reaching a comprehensive chemical analysis is a prerequisite to ensure the potency of BDs.

Despite the rapid advances in analytical methods, holistic standardization of botanical products continues to be a major challenge, as different types of compounds always encompass distinct physicochemical properties. As a result, the standardization of BDs increasingly entails the combination of analytical techniques featuring different principles to capture botanical constituents to the greatest extent. In practice, LC-based methods, for instance, LC-UV, are the most used approaches by virtue of the relatively abundant instrumentation along with the high sensitivity of UV [3, 4]. A large number of research for the quality evaluation of BDs have been carried out using LC-based methodologies [5,6,7]. Nonetheless, it is often considered impractical to fulfill comprehensive chemical analysis for BDs as there is a need for identical reference materials (RMs) for the identification of analytes. Basically, BDs integrity increases with the number of markers measured qualitatively and quantitatively. In contrast to LC-based methods, the requirement of identical RMs does not exist in NMR. NMR is regarded as a relatively insensitive method, which possesses more universal detection ability and is capable of performing multi-target analysis on a single sample [8,9,10]. Owing to the increasing availability of NMR, indeed, there are some reports involving the integration of LC-UV, -MS, and NMR for the standardization of complex matrices [11, 12, 28]. However, from a perspective of the gap between the high demand for healthcare products and the quality consistency of different batches of BDs, the combination of commonly used methods (e.g., LC-based pharmaceutical quality control) and emerging techniques [e.g., quantitative ¹H NMR (q¹HNMR)] for providing scientific evidence to the development of BDs is still considered underexplored [27].

On the other hand, concomitant with the combination is the effective processing of experimental datasets generated by different analytical techniques. In the field of quality assessment of BDs, data fusion has proven to be a powerful approach for integrating different kinds of information to assist in the overall understanding of a product. Mid-level data fusion, as one of three level data fusion methods (low, mid, and high), in which multi-type features are often extracted from processed data and fused into a new array to find complementary information [13]. Some studies have utilized the mid-level data fusion strategy for classification. Chang et al. [14] established a mid-level data fusion encompassing GC-FID, UV-Vis, ATR-FT-IR, and HPLC-DAD to produce a better classification result than that from a single technique in the quality assessment of belamcandae rhizome antiviral injection. Zhang et al. [15] collected NIR and MIR spectra, from which the data was organized via low- and mid-level data fusion methods, to rapidly detect the extraction process of a traditional Chinese medicine called Xiao’er Xiaoji Zhike Oral Liquid. Additionally, with the rapid development of artificial intelligence, machine learning algorithms, including but not limited to random forest (RF) [16, 17], support vector machine (SVM) [18], k-nearest neighbor (KNN) [19], have been adopted as powerful tools for feature extraction and data fusion. Of note, previous studies regarding data fusion mainly involved techniques which unveil functional groups of compounds of interest. The present study also applied magnetic examination (e.g., NMR) to provide an alternative for characterizing structures from a point of view of whole molecule.

As a proof of concept, this study developed a quality assessment method that utilizes a mid-level data fusion strategy to integrate data containing qualitative and quantitative information for the standardization of Guhong injection (GHI) that is a botanical drug derived from a sterile aqueous solution made of safflower extract and aceglutamide used for treating ischemic stroke [20, 21]. First, HPLC-UV and -ELSD were used for the qualitative analysis with the help of identical RMs, while q¹HNMR was applied for the qualitative and quantitative detection of constituents. Notably, some of the constituents (e.g., amino acids) identified in this study by NMR are in large part undetectable to UV and ELSD detectors. Secondly, qualitative features extracted from chromatographic fingerprints and quantitative features obtained from q¹HNMR were used for Pearson correlation coefficient (PCC) analysis and partial least square-discriminant analysis (PLS-DA). SVM, a typical machine learning approach, was used to solve the problem of binary classification like discrimination of normal and abnormal samples [22]. Finally, both extracted qualitative and quantitative features were organized as a new dataset for SVM modeling. Compared with PCC analysis and PLS-DA of individual features from HPLC-UV, HPLC-ELSD, and q¹HNMR, SVM with fused features reached a classification accuracy of 100% for classifying normal batches and individually prepared abnormal samples of GHIs. This quality control strategy can be not only regarded as a reliable approach for identifying chemical components and distinguishing abnormal batches of GHIs but also serve as a useful tool for the standardization of other BDs featuring complex matrices.

Materials and methods

Reagents

Normal batches of GHIs were provided by Guhong Pharmaceutical Co., Ltd. (Tonghua, China), labeled from N1 to N52 (Additional file 1: Table S1). Half of the normal batches were added with HCl or fructose manually as abnormal batches (also see Additional file 1: Table S2 for details). HPLC-grade solvents were purchased from Merck (Darmstadt, Germany), and methanol-d₄ (99.8 atom %D) with 0.03% (v/v) tetramethylsilane (TMS) was purchased from Cambridge Isotope Laboratories Inc. (Andover, MA, USA). Methyl 3,5-dinitrobenzoate (99.40%) was purchased from Sigma-Aldrich Co.LLC (Switzerland) as the internal calibrant for q¹HNMR analysis. Chemical reference standards were purchased from Yuanye Biotechnology Co. Ltd (Shanghai, China) and Sigma-Aldrich Co. LLC (Switzerland).

Sample preparation

Sample preparation for HPLC-UV analysis

Guhong injection (1.0 mL) was diluted by 20% methanol (4.0 mL), and centrifuged for 10 min at 10000 rpm·min⁻¹. The supernatant was used for HPLC-UV analysis.

Sample preparation for HPLC-ELSD analysis

Guhong injection (1.0 mL) was diluted by 70% acetonitrile (4.0 mL), and centrifuged for 10 min at 10000 rpm min⁻¹. The supernatant was then used for HPLC-ELSD analysis.

Sample preparation for q¹HNMR analysis

6.03 mg of methyl 3,5-dinitrobenzoate was accurately weighed and dissolved in 10 mL of methanol-d₄. 600 μL of the prepared deuterated solution was added to freeze-dried GHIs, which were transferred into 5 mm NMR tubes for subsequent q¹HNMR analysis.

HPLC–UV analysis

Agilent 1100 HPLC system (Agilent Co., USA) equipped with VWD detector was used for HPLC–UV analysis. The chromatographic separation was accomplished with Waters Altantis@T3 column (4.6 × 250 mm, 5 μm), and the mobile phase consisted of 0.1% formic acid (A) and 70% acetonitrile (B). The gradient elution was programmed as follows: 0–12 min, 4% B; 12–20 min, 4–18% B; 20–30 min, 18–19% B; 30–43 min, 19–34% B; 43–47 min, 34–48% B; 47–56 min, 48–100%B. The total run time was 70 min. The flow rate was 0.9 mL·min⁻¹, the column temperature was maintained at 35 ℃, and the injection volume was 10 μL.

The HPLC-UV analysis method was validated by precision, repeatability, and stability tests, where the relative standard deviation (RSD) of the average relative retention time (RRT) and relative peak area (RPA) of each characteristic peak with respect to the reference peak were employed for evaluation. To be specific, precision was determined by analyzing the same sample six times. Repeatability was evaluated by the analysis of six parallel prepared samples consecutively. Stability was confirmed by testing the same sample at the time intervals of 0, 3, 6, 12, 18, and 24 h.

HPLC-ELSD analysis

HPLC-ELSD analysis was performed on an Agilent 1260 HPLC system (Agilent Co., USA). A Prevail Carbohydrate-ES (250 × 4.6 mm, 5 μm) column was used for chromatographic separation. Deionized H₂O (A) and acetonitrile (B) were used as mobile phases. The gradient was as follows: 0–25 min, 88–85% B; 25–45 min, 85–70% B; 45–49 min, 70–60% B; 49–50 min, 60–50% B and 50–55 min, 50% B. The flow rate was set at 0.7 mL min⁻¹ and the column temperature was set at 35 °C. The evaporator temperature was 60 °C and the nebulizer temperature was 50 °C for ELSD, respectively. The nitrogen flow rate was set as 1.2 L min⁻¹ and the gain value was 1.

Method validation was accomplished by investigating precision, repeatability, and stability. Precision was estimated by six consecutive injections of a sample. Repeatability was also evaluated by analyzing six parallel prepared samples consecutively. Stability was assessed by testing the same sample at the time points of 0, 4, 8, 12, 20, and 24 h.

q¹HNMR analysis

q¹HNMR analysis was performed on JEOL ECZ-500 (Akishima, Tokyo, Japan). Automatic shimming and adjusting 90° pulse length before each sample acquisition. The ¹HNMR spectra parameters were set as follows: the number of scans was 16, acquisition time (AQ) was 3.2 s, and pulse width was 20.0 ppm. To ensure fully quantitative conditions for the target signals, the relaxation delay time (D₁) was set to 60.0 s.

The q¹HNMR spectra were phase-adjusted, baseline-corrected, and unified to TMS at 0.000 ppm using MestReNova 14.0.0 software from Mestrelab Research S.L. (Santiago de Compostela, Spain). Due to the complex chemical composition and additives’ interference of injections that caused the overlap of spectra peaks, eight main characteristic peaks were selected for quantification. The obtained peak areas were utilized to calculate the content of the components according to Eq. (1) [23].

$$C_{x} { = }\frac{{N_{IC} * A_{x} * M_{x} }}{{N_{x} * A_{IC} * M_{IC} }} * C_{IC}$$

(1)

where N represents the number of integrated hydrogens, A is the absolute integral value, M is the molar mass, C represents the mass concentration, IC is the internal calibrant, and x is the target analyte or molecule.

Feature processing

RPA was calculated by dividing the area under the curve (AUC) of the target peak by the AUC of the reference peak in HPLC-UV chromatograms, while the log value of the relative peak area (RLPA) was obtained by dividing the log value of the AUC of the target peak by the log value of the AUC of the reference peak (see Figs. 1 and 2 and the text for details). RPA and RLPA from each batch were separately utilized for creating qualitative feature tables for subsequent analysis and modeling.

The effective chemical shift ranges of each q¹HNMR spectrum were determined after phase adjusting and baseline correction. Compounds were characterized based on 1&2D NMR in conjunction with reference standards. The characteristic peaks (see Table 1 for details) of identified compounds were used for the content calculation by an internal standard method (see Fig. 3, Table 1, and the text for details). The content calculated of the identified compounds from each batch was tabulated for further analysis and modeling.

Table 1 Characterized compounds by q¹HNMR and their characteristic signals

Full size table

PCC analysis and PLS-DA

In this study, RPA, RLPA, and the content of characterized compounds were separately used for PCC analysis to determine the similarity between each batch.

$$r = \frac{{\sum\limits_{i = 1}^{N} {({\rm A}_{i} } - \overline{\rm A}) * ({\rm B}_{i} - \overline{\rm B})}}{{\sqrt {\sum\limits_{i = 1}^{N} {({\rm A}_{i} } - \overline{\rm A})^{2} } * \sqrt {\sum\limits_{i = 1}^{N} {({\rm B}_{i} - \overline{\rm B})^{2} } } }}$$

(2)

r represents Pearson correlation coefficient, A_i represents the reference value and B_i represents the target value, $\overline{A}$ and $\overline{B}$ represent the mean value of the standards and target compounds, respectively.

Apart from PCC analysis, RPA, RLPA, and the content of characterized compounds of each batch were also input into SIMCA software (version 14.1, MKS Umetrics AB, Umeå, Sweden) to accomplish PLS-DA to evaluate the classification effect of different types between batches from an overall perspective.

Mid-level data fusion

In this study, a mid-level data fusion strategy was established aimed at improving classification accuracy for distinguishing abnormal batches. The strategy of feature extraction and mid-level data fusion was summarized in Scheme 1. First of all, multi-batch samples were prepared and analyzed by HPLC-UV, HPLC-ELSD, and NMR to obtain chromatographic fingerprints and nuclear magnetic spectra. Second, RPA and RLPA as qualitative features of HPLC-UV and -ELSD, respectively, along with content of compounds identified as quantitative features of q¹HNMR, were extracted for creating feature tables. Third, the qualitative and quantitative features of each batch were fused as a new data matrix for subsequent modeling analysis. Machine learning is an advisable choice to enhance the efficiency of data processing and the accuracy of classification. Moreover, SVM, a classic machine learning method, is commonly used for dealing with the problem of binary classification. Collectively, SVM was applied in this study to deal with the multi-modal data for a more accurate quality evaluation of BDs. The SVM model was established by MATLAB (Version: R2023a). All input data was taken by data normalization, and radial basis function was applied for training the SVM classifier. Tenfold cross-validation was used for the model validation and a hyperparameter was found to minimize tenfold cross-validation losses by using automatic hyperparameter optimization.

Results and discussion

HPLC-UV analysis

In our previous study, the HPLC-UV fingerprint of GHI was established with a total of 27 peaks labeled, among them, 26 ones were identified by LC-MS [21]. As a continuous work for the standardization of BDs, the present study acquired the HPLC-UV chromatograms (Additional file 1: Fig. S1) of another 52 normal and 26 in-house developed abnormal batches of GHIs according to the previously established method. As shown in Fig. 1A, the peak shape of peak 2 in the abnormal batch was distorted, while the AUC of peak 16 significantly changed between the normal and abnormal batches with a relative deviation of 9.12%. In addition, peak 16, which stands for the main bioactive marker of GHI, namely hydroxysafflor yellow A, showed good separation from other peaks, appropriate signal intensity, and reasonable retention time, and was hence selected as the reference peak for RPA calculation. Accordingly, RPA was defined as the value of the AUC of the selected peak (A_t) over the AUC of the reference peak (A_s) as shown in Fig. 1B. Taking into account the degree of separation and the signal intensity, 14 peaks of peak 2, 4, 7, 8, 9, 10, 11, 13, 17, 21, 23, 24, 26, and 27 were determined for creating a qualitative feature table of the HPLC-UV data of the 78 batches (Additional file 1: Table S3), which were used for the subsequent analysis including PCC analysis and PLS-DA. Similarity evaluation of the HPLC-UV data was conducted by the PCC analysis and RPA of N49 was randomly chosen as the reference dataset for calculating the PCC values (Additional file 1: Table S4). With 0.9 set as the threshold value, only A20 was identified as an abnormal one among the 26 abnormal samples (Fig. 1C). Then, PLS-DA was used for evaluating the class of samples with the RPA values as the dependent and the sample types as independent variables as shown in Fig. 1D. Nine normal samples among the two dashed ellipses were not well clustered with the majority of other normal samples. Meanwhile, N31, A26, and A24 were clustered out of the 95% confidence interval. Besides, a permutation test was involved for validating the validity of PLS-DA (Additional file 1: Fig. S2). To confirm the reliability of HPLC-UV fingerprints, method validation was completed by precision, repeatability, and stability analysis, the results were described in supplement information (method validation of HPLC-UV fingerprints).