Font Size: a A A

Theoretical Prediction Of Drug Toxicity Based On Machine Learning Approaches

Posted on:2018-06-07Degree:DoctorType:Dissertation
Country:ChinaCandidate:T L LeiFull Text:PDF
GTID:1314330542951153Subject:Pharmacology
Abstract/Summary:PDF Full Text Request
Toxicity is one of the main reasons for the failure of drug candidates during development,so evaluation of the toxicity for drug candidates in the early stages of drug development and exclusion of the compounds with relatively high toxicity would effectively improve the efficiency and success rate of drug development.However,in vitro and in vivo experimental testing for most toxicity endpoints has the disadvantage of high laborintensity,high time consumption and cost inefficiency,and therefore it is quite demanding to develop efficient and robust in silico prediction models for high-throughput toxicity screening.In this thesis,by using a number of machine learning algorithms,we developed a series of in silico prediction models for acute toxicity,respiratory toxicity and urinary tract toxicity.The performance and applicability of these machine learning algorithms were discussed.The research results and conclusions are as follows:(1)Based on a comprehensive dataset of rat oral acute toxicity with 7,385 compounds,relevance vector machine(RVM),support vector machine(SVM),k-nearest neighbor regression(k-NN),random forest(RF),local approximate Gaussian process(laGP),multilayer perceptron ensemble(MPLE)and eXtreme gradient boosting(XGBoost)algorithms were employed to construct a series of regression prediction models.The modified chi-square statistics were used to reduce the data dimension of the hybrid set of molecular descriptors and fingerprints(PubchemFP or SubFP).The RVM with the Laplace kernel function achieved the best prediction performance(qext2=0.640?0.659).In addition,we constructed several consensus prediction models.The best consensus model could yield accurate predictions for the test set(qext2=0.689).In addition,we also analyzed the important molecular descriptors and molecular fingerprints related to acute toxicity.(2)A dataset of various respiratory toxicity endpoints in mouse was employed to develop multiple regression and classification prediction models by using a number machine learning approaches,including RVM,SVM,regularized random forest(RRF),XGBoost,naive Bayes(NB)and linear discriminant analysis(LDA).In order to determine the optimal subset of molecular descriptors,a four-tier strategy(normalization-chi-square filtering-univariate rfSBF filtering-recursive feature elimination based on RF)was used to reduce the data dimension of the original set of molecular descriptors.Among all of the prediction models,the model developed by SVM with the Laplace kernel function achieved the best quantitative predictions for the test set(qext2=0.707),and the XGBoost model gave the best classification predctions for the compounds in the test set(MCC=0.644,AUC=0.8935 sensitivity=82.24%,specificity=83.21%,and global accuracy=82.62%).In addition,several approaches were used to analyze the application domains of the models.By using the leverage method,41 response outliers(hi>0.004),23 structurally influential outliers(standard deviation>3)and 31 influential compounds(Cook's distance>0.00388)were determined.Finally,the structural features of the compounds that were predicted with large errors by the best regression model and those of the compounds misclassified by the best classification model were systematically analyzed.(3)Based on a dataset of various urinary tract toxicity endpoints in mouse,several algorithms(RVM,SVM,RRF,C5.0,XGBoost,Adaboost.Ml,SVMBoost and RVMBoost)were used to build multiple regression and classification prediction models.The optimal subset of molecular descriptors for regression and classification were selected by using recursive feature elimination based on RF.Among all of the prediction models,the rbfSVMBoost regression model achieved the best quantitative predictions for the test set(qext2=0.845),and the rbfSVMBoost classification model gave the best qualitative predictions for the test set(MCC=0.787,AUC=0.893,sensitivity=89.58%,specificity=94.12%,and global accuracy=90.77%).In addition,several approaches were used to analyze the application domains of the models.By using the leverage method,3 response outliers(hi>0.762),4 structurally influential outliers(standard deviation>3)and 10 influential compounds(Cook's distance>0.02797)were determined.Finally,the structural features of the compounds that were predicted with large errors by the best regression model and those of the compounds misclassified by the best classification model were systematically analyzed.(4)In addition,we also tested the performance and applicability of several new machine learning methods.The performance of RVM,XGBoost and SVMBoost is satisfactory,and that of RRF and laGP is relatively unacceptable,which needs to be improved.
Keywords/Search Tags:Quantitative Structure-Activity Relationship, Quantitative Structure-Toxicity Relationship, QSAR, QSTR, Toxicity Prediction, Acute Toxicity, Respiratory Toxicity, Urinary Tract Toxicity, Support Vector Machine, Relevance Vector Machine, Random Forest
PDF Full Text Request
Related items