Font Size: a A A

Prediction Of CYP450 Enzyme-Substrate Selectivity Based On The Network-based Label Space Division Method

Posted on:2021-03-25Degree:MasterType:Thesis
Country:ChinaCandidate:X Q ShanFull Text:PDF
GTID:2480306503965639Subject:Biology
Abstract/Summary:PDF Full Text Request
CYP450 enzymes are widely found in bacteria,fungi,plants,and animals.In humans,CYP450 enzymes are responsible for the REDOX reactions of exogenous compounds,endogenous substrates,and 90% of common drugs,playing an important role in ensuring efficacy and controlling drug toxicity.In the development of new drugs,it is important to predict the metabolic pathways of drugs to prevent drug to drug interactions.A drug can be metabolized by subtypes of multiple CYP450 enzyme systems,so the problem of CYP450 enzyme-substrate selectivity prediction can be defined as a multi-label classification task.In the research of this kind of problem,the traditional multi-label classification methods,such as Multilabel k-Nearest Neighbor algorithm(ML-k NN)and Random k-Labelsets for Multilabel Classification algorithm(RAk EL),do not consider the correlation between labels.Other multi-label classification methods,such as the Label Powerset method,create a subtree for each subset.As the number of labels increases,its subsets increase exponentially,which causes the problem of overfitting.None of the above algorithms can be well used for the predictions of the metabolic problems between substrates and the subtypes of multiple CYP450 enzymes,thus we chose the Network-based Label Space Division(NLSD)for modeling.The NLSD method incorporated relevant structures between labels into the training set,thus learning k representative classifiers,which is an integrated method of multilabel classifiers.This method makes good use of the correlation information between labels and avoids the problem of overfitting.In this study,we generated four types of characteristics to describe substrate properties,including physiochemical property descriptors(PC),mol2 vec descriptors(M2V),extended connection fingerprints(ECFP),and MACCSkey fingerprints(MACCS).Through the verification of 15 different feature combinations based on the baseline model,ML-k NN,we obtained the optimal feature combination of "PC+M2V+ECFP".Based on the optimal feature combination,we use 5 algorithms as base classifiers to form the NLSD-base classifier algorithm,including Multi-Layer Perceptron(MLP),e Xtreme Gradient Boosting(XGB),Extra Tree(EXT),Random Forest(RF),and Support Vector Machine(SVM).After 10 times repeated 5-fold cross-validation and 10 times repeated hold-out validation method,the performance of the six models we built(ML-k NN,NLSD-MLP,NLSD-XGB,NLSD-EXT,NLSD-RF,and NLSDSVM)are all better than the previous work.Among them,NLSD-XGB achieves the best performance with the average Top-1 prediction success of91.1%,the average Top-2 prediction success of 96.2%,and the average Top-3 prediction success of 98.2%.When compared with the previous work,NLSD-XGB shows a significant improvement over 11% on Top-1 in 10 times repeated 5-fold cross-validation test and over 14% on Top-1 in 10 times repeated hold-out validation method.Also,we verified the single-class prediction ability of the bestperforming model,NLSD-XGB,in this study.The results showed that compared with the single-label prediction accuracy of the NLSD-XGB model,the cumulative accuracy of seven predictions made by seven binary models,XGBoost,was lower,which further demonstrates the necessity of applying multi-label learning technology in this study.To the best of our knowledge,the Network-based Label Space Division model is firstly introduced in drug metabolism and performs well in this task.
Keywords/Search Tags:CYP450, Drug metabolism, Multi-label classification, Network-based Label Space Division
PDF Full Text Request
Related items