| New drug development is a long and complex process,including target confirmation,target molecule screening and optimization,preclinical animal experiments,and clinical trials.With the dramatically development of software and hardware,computers are able to assist the drug research and development in all stages.We mainly apply machine learning methods to the screening of target molecules and the estimation of pharmacokinetics properties.The research mainly includes two aspects:one is to predict the DNA binder based on machine learning methods,which is helpful for molecular screening of DNA-targeting and molecular filtering of DNA-binding proteins(Chapter 2);the other is to predict the human oral exposure based on machine learning methods,which can assist in lead compound optimization and drug candidate selection(Chapter 3).Although proteins are the primary drug targets,DNA remains a valuable target in cancer therapy,anti-infection and antiviral research.In the process of searching for small molecules targeting DNA-binding proteins,in order to exclude false positives caused by the small molecules binding to DNA,it is necessary to supplement small molecule and DNA binding experiments.The development of a DNA binder prediction method can screen or filter compounds that may bind to DNA before purchasing or designing synthetic compounds,thereby reducing the risk of failure and waste of time and resources.In Chapter 2,based on the small molecule and DNA binding data collected in the Ch EMBL database,we used machine learning algorithms and neural network algorithms to construct multiple classification models and a final consensus model.The consensus model has good prediction performance on both the training set and the test set,the AUC value of the five-fold cross validation of the training set is0.947,and the AUC value of the test set is 0.916.Next,we utilize similarity-based thresholds to define the applicability domain.Finally,the explanatory analysis of the model is carried out.Similar to the previous study,condensed aromatic hydrocarbons are easy to bind to DNA,and the lamellar structure is easy to insert between DNA base pairs.Meanwhile,the alert structures extracted based on SARpy software supplemented the fragment types found in previous studies.In addition,we validated the predictive accuracy of the model and discovered novel DNA-binding agents through predictions and experiments on an in-house compound library.In conclusion,this paper constructs a DNA binder prediction model based on machine learning,which can be used to virtually screen or filter compounds that may bind to DNA,and assist the development of drug molecules targeting DNA and DNA-binding proteins.Estimation of pharmacokinetic properties in humans is one of the main purposes of non-clinical research during drug development prior to the first human clinical trials.Currently commonly used methods are allometric scaling,physiological-based pharmacokinetics,and machine learning-based methods.With ever-expanding public or commercial datasets,more and more people are turning their attention to machine learning methods with excellent predictive potential.In Chapter 3,based on human and rat oral exposure data collected in the Pharmapendium database,we used machine learning algorithms to construct multiple regression models and a final consensus model.The consensus model has good predictive performance on both training and test sets.The R~2 value of the five-fold cross validation for the training set was 0.674 and the R~2 value for the test set was 0.670,and the percentages of compounds within two-,three-,and five-fold error of the training set were 58.4%,74.5%,and 83.1%,respectively.Through the analysis of human oral doses,it is found that the applicable range of this model is the condition when the human dose is greater than 1 mg,and when the dose is in the range of 1-100 mg,the model show the best performance.Finally,an explanatory analysis of the model reveals some fingerprint fragments that may be important for prediction.In conclusion,this study constructed a human oral exposure prediction model based on machine learning,which can help drug researchers optimize lead compounds or select drug candidates. |