Due to serious side effects and pathogen resistance for drugs,researchers need to screen and discover new drugs with good efficacy persistently.Nowadays,the innovation of domestic new drug research and development is at a critical period.It is urgent to develop proper method to discover drugs with specific therapeutic effect.By using data-driven methods,correspondence between the molecular structure and specific clinical therapeutic effect can be established.Thereby,compounds that may have target effects can be discovered from a massive compound database.In this paper,drug information for 1132 drugs with seven classes therapeutic effects,that are needed to be innovated,is collected from databases,such as KEGG,Drug Bank,Pub Chem,etc.Five types of drug information,that is most relevant with drug molecular structure and therapeutic effect,is collected as original drug information set.After the drug information is checked by consulting literatures,four molecular sets containing different drug structure information are obtained.According to detail drug information in four molecular sets,the better molecular set is determined to predict unknown drugs.In order to preferably classify drugs,it is necessary to digitally describe drug molecular structure.A Chemo Py-RDKit(C-R)molecular description is proposed,and it is compared with four existing different molecular descriptions.In terms of classification methods,the performance of five common supervised algorithms is compared.Then,according to comparison result,they are fused by Dempster-Shafer evidence theory.In addition,external validation molecular set is used to verify the performance of classification method,so that its accuracy can be ensured.Finally,the best classification result is achieved based on molecular set that contains 844 molecular structures most relevant to drug efficacy.At the same time,the results of single classifiers demonstrate that the highest classification accuracy is obtained by the proposed C-R description.Moreover,the highest recognition rate is achieved by support vector machine(SVM)among five single classification methods.Compared with SVM,the method obtained by fusing SVM and random forest achieves further improvement in classification performance.These results prove that correspondence between the drug structure and therapeutic effect can be extracted by data-drive methods.Compounds with target effect can be discovered from a massive compound database.A reliable prediction for unknown drugs is able to be provided.Thus,the early drug development can be processed faster and more economically. |