| The absorption,distribution,metabolism,excretion,and toxicity(ADMET)of compounds are key factors in determining whether a compound can become a drug candidate.Evaluation of ADMET in preclinical drug studies is critical to reduce the failure rate of new chemical entities(NCEs)in clinical trials.Using the“early failure,cheap failure”strategy and evaluating ADMET properties at an early stage of drug development can remove unsuitable compounds in time,thus reducing the cost of expensive late-stage failures.Experimental evaluation of ADMET properties is costly and time-consuming,and therefore it is very important to develop reliable ADMET prediction methods and models.In addition,theoretical prediction methods are not limited by the number and type of tested compounds,and can reduce reliance on animal experiments and avoid the resulting ethical and legal restrictions.In recent years,great progress has been made in the field of ADMET prediction,but the number of ADMET properties that can be reliably predicted is still quite limited.There are two main reasons for this situation.First,the lack of high-quality experimental data makes it difficult to train high-precision prediction models.Some ADMET experimental data is not readily available,which leads to insufficient training data,leading to problems such as model overfitting and insufficient generalization ability.Secondly,due to the complexity of some machine learning algorithms,the trained models are often regarded as black boxes,and therefore it is difficult for us to understand the relationship between molecular structures and predicted results,thus severely hindering their application in drug design.In recent years,emergence of new artificial intelligence(AI)technologies,such as multitask learning,pre-training,and transfer learning,has helped to ease these challenges in ADMET prediction,which can improve the generalization ability of models by sharing features between different tasks.At the same time,explainable AI(XAI)methods are expected to provide interpretability for models,thus improving the reliability of results and applications.Therefore,new AI technologies provide effective solutions to long-standing challenges in ADMET prediction.This study primarily explores the applications of AI and XAI technologies in ADMET prediction,and the main research contents and results are as follows:(1)In the first part of our study,we investigated the application of descriptor-based methods in ADMET predictions.In Section 2.1,seven machine learning(ML)algorithms were used to construct a series of CYP450 inhibitor prediction models to distinguish inhibitors from non-inhibitors for five major CYP450 subtypes(1A2,2C9,2C19,2D6,and 3A4).The results show that the e Xtreme Gradient Boosting(XGBoost)model performed the best in predicting CYP450 inhibition on the external test set.We also examined the impact of different descriptors on the performance of the XGBoost model and found that the combination of the Pubchem molecular fingerprints(Pub FP)and Pa Del molecular descriptors(Pub FP+Pa Del)provided the most accurate representation of molecules.The XGBoost model based on Pub FP+Pa Del achieved the best prediction accuracy on the external test set,with the accuracies of 97.4%,90.1%,82.3%,92.8%,and89.4%for 1A2,2C9,2C19,2D6,and 3A4,respectively.Furthermore,the SHAP method was used to interpret these optimal models and identify the key molecular descriptors and fingerprints.In Section 2.2,we comprehensively evaluated the performance of 16 ML algorithms in quantitative structure-activity relationship(QSAR)modeling.The results show that support vector machines with Radial Basis Function(rbf-SVM)and XGBoost have the strongest predictive ability across 14 property prediction tasks,with an average coefficient of determination(R~2)of 0.831.In addition,we investigated the performance of the ensemble models by integrating the prediction results from multiple ML algorithms,and the results indicate that the ensemble of two or three algorithms from different categories can effectively improve the prediction accuracy of the model.(2)In the second part of our study,a Multi-task Graph Attention(MGA)network framework is proposed to learn both regression and classification tasks simultaneously.MGA has achieved excellent performance on 31 toxicity tasks and can extract the general toxicity features for ring substructures.Furthermore,we transfer the generic toxicity features learned by MGA to two external toxicity tasks,and the results show that it can improve the prediction accuracy of the model.This suggests that MGA can indeed learn the general toxicity features and transfer them to new toxicity tasks.Our experiments also demonstrate that the customized toxicity fingerprints generated based on the general toxicity features can be utilized to construct high-precision toxicity prediction models using other ML algorithms,not limited to neural networks.Additionally,MGA provides a novel way to detect structure alert,and this analysis can assist in discovering the relationships between different toxicity tasks.(3)A knowledge-based BERT(K-BERT)pre-training model is proposed and it can extract molecular features like a computational chemist.This study compares the performance of K-BERT with several other state-of-the-art ML methods on 15 druggable small datasets.The results show that K-BERT achieves comparable or even better performance than other methods,indicating that the pre-training strategy of K-BERT is effective and suitable for druggability prediction.Moreover,we observed that pre-training improves the model’s ability to extract molecular features,which is consistent with the effect of data augmentation.Through the pre-training of atomic feature prediction tasks,molecular feature prediction tasks,and contrastive learning tasks,K-BERT can extract molecular features and generate a general molecular fingerprint named K-BERT-FP.Our experiments reveal that K-BERT-FP exhibits comparable predictive power to the MACCS fingerprints on 15 druggability datasets,and can capture molecular size information that MACCS fingerprints cannot.K-BERT can tailor pre-training tasks to specific downstream tasks to generate customized fingerprints.We pre-trained K-BERT on the CHIRAL1 dataset,enabling K-BERT-FP to capture chirality information that cannot be captured by traditional molecular fingerprints,and K-BERT-FP shows stronger predictive power for the chirality-related tasks than traditional molecular fingerprints.(4)In the fourth part of our study,we propose a novel graph neural network interpretability method called substructure mask explanation(SME).This approach aligns with the chemist’s expert knowledge,providing a more understandable and chemically informed interpretability for the field of cheminformatics.We illustrate the application of SME in four tasks:water solubility,genotoxicity,cardiotoxicity,and blood-brain barrier permeability.The results show that it provides intuitive and chemical interpretability for the four tasks.By analyzing functional group attribution values assigned by SME in the whole data set,we can explore how functional groups affect model prediction and help mine structure-activity relationship,and provide guidance and suggestions for structural optimization.Both the experimental results and the actual case of structure optimization demonstrate the validity and rationality of SME attribution-based guidance.Furthermore,SME can help diagnose model prediction problems and provide direction for further optimization of the prediction models.In addition,the recombination of BRICS substructures based on SME attributions provides a new way to generate conditional molecules without additional training.(5)In the fifth part of our study,an online platform for ADMET prediction named Inno-ADMET was constructed based on MGA and SME and integrated into Carbon Silicon AI’s Drug Flow drug design software platform.Previously,our MGA-based ADMETlab 2.0 software platform has received 1.5 million visits from 120countries and more than 400 academic citations,highlighting the importance of MGA-based ADMETlab 2.0 for drug discovery and academic research.However,model interpretability is still lacking in ADMETlab 2.0.Based on ADMETlab 2.0,an SME-based interpretability module was added and a new ADMET prediction platform was developed. |