Method design
DILI dataset was collected from public databases and published literatures. Nine machine learning models and a deep learning model were constructed with combined DILI dataset. A better performance model would be chosen to screen hepatotoxic compounds in TCM-WMC.
DILI dataset collection
The compounds in DILI combined dataset were retrieved from the DILIrank [18], LiverTox [19], LTKB [20], Hepatox [21]. The annotations in DILIrank were assigned four different severity classes by considering DILI-related market withdrawals and warnings [18]. LiverTox contains comprehensive and evidence-based information on drug, dietary supplement, and herbal-induced liver injury [19]. Liver Toxicity Knowledge Base (LTKB) contains drugs whose potential to cause DILI in humans using the FDA-approved prescription drug labels [20]. Hepatox is a data base on the hepatotoxic drugs file published every year in Gastroentérologie Clinique et Biologique [21]. The keywords of "liver damage", "Drug-induced liver injury (DILI)”, “hepatotoxicity”, “liver toxicity”, “liver failure”, “liver injury”, “hepatitis”, “jaundice”, “cholestasis”, “liver protection”, “hepatoprotective”, “hepatoprotection”, “Herb-induced liver injury (HILI)" were searched in PubMed (https://www.ncbi. nlm.nih.gov/pubmed/), Nature(https://www.nature.com/), Science Online (http://www.sciencemag.org/), Elsevier Science Direct (https://www.Sciencedirect.com), Springer (https://link.springer.com/), Wiley (https://onlinelibrary.wiley.com/), Oxford Academic (https://academic.oup.com /journals/) and other publishers’ databases to search the relevant literatures with DILI dataset. The search time was limited to 1999–2021. Duplicates from different sources and compounds without structures were excluded.
AI model construction
Chemical structures of compounds were coded with SMILES (simplified molecular input line entry system). PaDEL-Descriptor software [22] was used to calculate the molecular descriptor and fingerprint of each compound based on SMILES string. PaDEL-1D and 2D descriptors of all compounds were calculated using PaDEL-Descriptor software (Yap, 2011). PaDEL-1D and 2D contained 1444 descriptors, including atom type electrotopological state (Estate) descriptors, Crippen’s logP, and molecular linear free energy.
The machine learning (ML) methods of SGD (Stochastic Gradient Descent), kNN (k-Nearest Neighbor), SVM (Support Vector Machine), NB (Naive Bayes), DT (Decision Tree), RF (Random Forest), ANN (Artificial Neural Network), Adaboost, LR (Logistic Regression) were adopted to build liver injury AI models. Two restricted Boltzmann machines (RBM) of deep belief network (DBN) were also constructed in this research. All these AI methods were trained on the same dataset, which was randomly divided into training set and test set at a ratio of 3:1 approximately. The workflow for the study of screening hepatotoxic compounds in TCM-WMC based AI methods was showed in Fig. 1.
Statistics for model evaluation measures
Five important model evaluation measures for ML methods, including classification accuracy (Eq. 1), Precision (Eq. 2), Recall (Eq. 3), F1 score (Eq. 4), and area under the curve (AUC) of receiver-operating characteristic (ROC) were applied to assess the performance of each model. Therein, AUC represented the area under the ROC (Receiver operating characteristic) curve and the coordinate axis, CA represented the classification accuracy, Precision was how close the measured values that were to each other, Recall represented the recall rate. And the calculation formula of F1 score was as the Eqs. (4).
$$A{\text{cc}}uracy = \frac{TP + TN}{{TP + TN + FP + FN}}$$
(1)
$$P{\text{recision}} = \frac{TP}{{TP + FP}}$$
(2)
$$Se{\text{n}}sitivity = {\text{Re}} call = \frac{TP}{{TP + FN}}$$
(3)
$$F1 = \frac{{2 \times \Pr {\text{e}}cision \times {\text{Re}} call}}{{\Pr {\text{e}}cision + {\text{Re}} call}}$$
(4)