Materials
Five ginsenoside standards were purchased from National Institutes for Food and Drug Control (Beijing, China). HPLC-grade acetonitrile was purchased from Fisher (USA). Purified water was prepared using a Milli-Q purification system (Millipore, USA). Other chemicals were of analytical purity.
A total of 106 cultivated AG samples with ages ranging from 2 to 4 years were purchased from Kangmei Pharmaceutical Co., Ltd., which originated from 5 different areas of Jilin, Liaoning, Shandong, Beijing, and Shanxi provinces. These samples were harvested from the planting bases, rinsed in water and air-dried locally according to the Good Manufacturing Practice (GMP). The cultivation ages were provided by the planting bases and the reliability of the cultivation age could be guaranteed. All herbal medicines were authenticated by Professor Hua Yan. The voucher was deposited in the Institute for Control of Chinese Traditional Medicine and Ethnic Medicine (Beijing, China).
Preparation of standard and sample solutions
Each sample was levigated into powder and sieved through a 60-mesh screens. One gram of the pulverized AG and 25 ml methanol was weighed accurately and transferred to a 50 ml conical flask. The suspension was maintained for 1 h followed by ultrasound-assisted extraction for 30 min at room temperature. After additional methanol was added to restore its original weight, the extract was fully blended and the supernatant was filtered through a 0.45 μm membrane filter, which will be used later as test solution. 20 μl of the filtrate was subjected to HPLC for analysis.
Ginsenoside standards Rg1, Re, Rb1, Rd and pseudo F11 were prepared separately in a similar way to yield stock solution of 0.5 mg/ml in methanol. 2 ml of each reference solution were combined and diluted to 25 ml, which was then filtered through a 0.45 mm filter membrane for retention time analysis.
HPLC conditions
HPLC was performed on a Waters 2695–2996 instrument equipped with a diode-array detector, automatic injector, thermostatically controlled column compartment, an online degasser, and an Alltech ES2000 evaporative light scattering detector (ELSD). Separation was accomplished using a Grace Vydac 208TP C8 column (250 × 4.6 mm, 5 μm) at 30 °C. The detection wavelength of the UV absorption spectrum was set to 203 nm. The mobile phase was composed of CH3CN (A) and H2O (B). The gradient elution program with 1 ml/min flow rate was used as follows: 0–10 min, 20% A; 10–11 min, 25% A; 11–33 min, 33% A; 33–38 min, 46% A; 38–40 min, 80% A; 40–45 min, 100% A; 45–55 min returned to 20% A. The drift tube temperature and the nitrogen gas flow rate of ELSD were set to 110 °C and 2.5 l/min, respectively.
Calibration curve of ginsenoside standards
A series of reference mixtures containing Rg1, Re, Rb1, Rd, and pseudo F11 were obtained by diluting the stock solutions and injected in triplicate into HPLC. The standard curves of Rg1, Re, Rb1 and Rd were constructed based on the areas of the UV absorption peak in the chromatograph (dependent variable) and the corresponding amounts of the analyte (independent variable), while the calibration curve of pseudo F11 was built using the area values in the ELSD chromatograph. The content of five saponins in the sample was calculated using the corresponding calibration curves.
Determination of ethanol and aqueous extractives
The weight percent of ethanol extractives was determined by hot extraction method. The content of aqueous extractives was calculated using cool extraction method. Both the two methods are performed according to the Chinese Pharmacopoeia [10].
Data preparation
To conduct a comprehensive assessment of ML models, the 106 AG samples were divided into a training set and two external test sets based on their cultivation regions. The training set was used to optimize the hyperparameters of the models and consisted of 64 samples (2–4 years old) collected from Jilin, Liaoning, and Shandong provinces. The two external test sets were used to evaluate the performance and generalizability of the trained models, and were made up of 42 samples (4 years old) originating from Beijing (test set 1, 25 samples) and Shanxi (test set 2, 17 samples) province, respectively.
A total of nine features were used to represent each sample. Seven of them were chemical properties including the content of five saponins (Rg1, Re, Rb1, Rd and F11) and the content of alcohol and aqueous extractives as described above. The other two features were the length and weight of each sample that represent the physical profile, where the lengths of the AG were approximate values since the tap root and the fiber root in most of the samples were truncated. All data used in this study are available in Additional file 1: Table S1.
Feature distribution of the collected data sets
PCA was carried out to compare the distribution of the training set and two external test sets. All features were first standardized by subtracting the mean value and divided by the standard deviation. After dimensionality reduction, most of the information in the original descriptors was compressed into the first three principal components (PCs), which were plotted as scatters in a three-dimensional (3D) space and projected into three two-dimensional (2D) planes to compare the diversity of the three data sets intuitively.
Model building
To obtain a robust and effective model, four machine learning algorithms including elastic net (EN), k-nearest neighbors (KNN), support vector machine (SVM), and multi-layer perceptron (MLP) were selected and to construct predictive models. Hyperparameter optimization was performed for all models by using a grid search strategy with fivefold cross-validation. The implementation was conducted using Python (ver. 3.8.10) and the scikit-learn library (ver. 0.24.1).
EN is a regulated linear regression model. As opposed to multivariate linear regression, two penalty terms, L1 and L2, were added to the loss function of EN to reduce model complexity. These penalties were tuned by alpha (λ1 + λ2) and l1_ratio [λ1/(λ1 + λ2)] in the ElasticNet function, where λ1 and λ2 are the weights of L1 and L2 regularization, respectively.
KNN regression is a conceptually simple yet useful method based on feature similarity [22]. Two steps are required to predict the label of a query sample. First, the distances between the query sample and the samples in the training set are computed to determine its k nearest neighbors. Then the mean or median value of the labels of these neighbors was taken to be the final prediction.
Support vector regression (SVR) is a branch of the SVM. Unlike most regression algorithms that aim to minimize the prediction error, the goal of SVR is to minimize the generalization error bound which is determined by a subset of training data called support vectors. In addition, kernel technique can be applied to handle nonlinear tasks. Herein, radial basis function (RBF) was used as the kernel function. Two parameters, the regularization parameter C and the kernel coefficient gamma, were tuned to obtain the optimal model [23].
MLP is a class of feedforward neural networks composed of three types of layers: an input layer, one or more hidden layers, and an output layer [24]. Each layer possesses several neurons and is connected with each other through layers of nodes. The neurons in the hidden layer are activation functions applied to the sum of weighted inputs and propagate the results to nodes in the next layer. The training process was to update the connection weights iteratively by backpropagation. In this work, the neurons in the hidden layers adopted tanh as the activation function. The regularization term alpha, size of the hidden layer, and learning rate were optimized for better model performance.
Model performance evaluation
Two statistical metrics, coefficient of determination (R2, Eq. 1) and mean squared error (MSE, Eq. 2), were computed to evaluate and compare the performance of the four regression models:
$$R^{2} = 1 - \frac{{\mathop \sum \nolimits_{i = 1}^{n} \left( {y_{i}^{pred} - y_{i}^{true} } \right)^{2} }}{{\mathop \sum \nolimits_{i = 1}^{n} \left( {y_{i}^{pred} - y_{i}^{true\;mean} } \right)^{2} }},$$
(1)
$$MSE = \frac{{\mathop \sum \nolimits_{i = 1}^{n} \left( {y_{i}^{pred} - y_{i}^{true} } \right)^{2} }}{n},$$
(2)
where \(y_{i}^{pred}\), \(y_{i}^{true}\), \(y_{i}^{true\;mean}\) are predicted, actual, and mean value of the growth years of AG, respectively. R2 measures the proportion of the total variance in dependent variable (Y) that is explained by the independent variable (X), and was used in cross-validation and model optimization. MSE was used to evaluate the average of the square of the error between actual and estimated values, and was used for model evaluation on the test sets.
Determination of applicability domain
Usually, the training set only represents a small subset of the whole feature space because of the limitation of the training samples. For samples that differ substantially from the samples in the training set, the prediction results would be highly unreliable and thus should be excluded. In this regard, applicability domain of a model can be defined to estimate whether a new sample is similar enough to the training samples. TAD (Eq. 3) was used here as the threshold of AD:
$$TAD = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \overline{d}_{ki} + Z\sigma ,$$
(3)
where \(\overline{d}_{ki}\) is the average of Euclidean distances between the training data i and its k nearest neighbors in the training set, σ is the standard deviation of the average distances aforementioned, Z is an empirical parameter [22, 25, 26]. When a query sample j comes in, the threshold of sample (TS, Eq. 4) can be computed and compared to TAD to decide whether the query is within the AD:
$$TS_{j} = \overline{d}_{kj} ,$$
(4)
where \(\overline{d}_{kj}\) is similar with the \(\overline{d}_{ki}\) expect for using query data j instead of training data i. If TS > TAD, i.e., the sample was far from the k nearest neighbor points, the predictive result may be marked as unreliable, and vice versa. The k and Z values determine how strict the AD is defined. For example, as Z decreases, the TAD value declines and the AD shrinks, meaning fewer data points are expected to be reliable. The impact of k and Z were investigated by estimating prediction errors of test samples within the AD. Z values were increased from 0.5 to 3 with step 0.2 and k varied from 1 to 12.