Font Size: a A A

Studies On New Methods Of Chemo-bio Informatics And Their Applications To Medicine

Posted on:2014-10-26Degree:DoctorType:Dissertation
Country:ChinaCandidate:D S CaoFull Text:PDF
GTID:1221330431497843Subject:Pharmaceutical Engineering
Abstract/Summary:PDF Full Text Request
With the advent of the post-genomic era, drug discovery has been transformed from the traditional development mode to the development mode based on systems pharmacology. Based on this mode, huge amounts of data need to be analyzed and processed. Chemo-bio informatics is a burgeoning integrated systems subject which aims at studying information content and information flow in the drug-related system. Introducing chemo-bio informatics into drug research process will largely accelerate the research process of new drugs, shorten the research period and reduce the research cost. Chemo-bio informatics is closely related to each link in drug discovery process, such as drug target identification, lead compound discovery, structure modification and optimization, pharmacokinetics research, drug preclinical and clinical research, and monitoring of adverse drug reactions. However, faced with such complex data, there are several difficult problems in the chemo-bio informatics research:(1) Establishing a high-quality QSAR model usually needs to consider several modeling problems, such as outlier detection, feature selection, and nonlinear models etc.(2) How to effectively extract data information from different levels or scales and then integrate them to construct a hypothesis-testable model has become a grand challenging in chemo-bio informatics and systems biology.(3) High-throughput screening data and drug clinical data usually include more complex data characters, including data nonlinearity, extensive missing values, a mixture of different data types, badly unbalanced data sets and multiple classes etc;(4) How to extract and collect drug information and provide pharmacologists with easy-to-use tools has become an urgent problem to solve in pharmacoinforamtics;(5) How to extract network features and establish high performance network prediction models needs to be urgently studied and solved. To address these questions, several novel chemo-bio informatics methods are developed. Based on these methods, we study two important problems (drug ADMET estimation and drug-target interaction prediction) in drug discovery. This thesis mainly consists of two parts:basic research (Chapter Ⅱ to Ⅴ) and practical application (Chapter Ⅵ and Ⅶ). Basic research includes the research of novel chemo-bio informatics methods and drug information extraction methods. Practical application includes drug ADMET estimation and drug-target interaction prediction. The main contents of this thesis are:1. We briefly introduce chemo-bio informatics and its research content, compare the difference between chemo-bio informatics and chemoinformatics and bioinformatics. Drug information should be systematically extracted in chemo-bio informatics research. That is to say, we should extract drug information from molecular level, cellular level and organism even higher level, and then integrate them to collectively understand drug action. In view of the importance of statistical learning algorithms in chemo-bio informatics, we introduce the commonly used data mining approaches. With the development of network modeling in chemo-bio informatics, we introduce how to construct a network model and analyze some difficult questions. Finally, we summarize difficult problems in chemo-bio informatics. In the following sections, we will develop various new chemo-bio informatics approaches to solve these difficult problems.(Chapter Ⅰ)2. We propose two new methods to detect outliers and select informative features based on the distribution of model features. By studying the distribution of prediction errors of samples, we find the statistics of the distribution of prediction errors can effectively distinguish normal samples and various outliers. We develop a Monte-Carlo method used for detecting outliers, which can not only simultaneously identify various types of outliers, but also reduce the risk the masking brings about. Compared with other outlier detection or robust regression methods, we demonstrate the reliability of our method. Considering the interaction between sample space and feature space, we construct a consistent framework to simultaneously detect outliers and select informative features. We aim at select informative features by the statistical distribution of model coefficients, and detect outliers by the statistical distribution of prediction errors of samples. A back elimination strategy is used to capture the interaction between sample space and feature space. This method is applied to analyze simulated data and QSAR data, and promising results are achieved.(Chapter II)3. We propose several novel pharmaceutical data mining methods by further studying kernel methods and kernel fusion algorithms. According the modularity of kernel methods, kernel functions and modeling methods can independently be considered. By selecting different kernel functions and modeling methods, we can construct different kernel models to satisfy various needs. Based on this, we develop a SMILES-based string kernel support vector machine (SVM) to classify toxic data. Compared with other kernel functions and molecular representation methods, we demonstrate the reliability of our method. This method does not need to calculate molecular descriptors, and thereby is easy to use. By considering different modeling methods, we develop a kernel k-neighbor algorithm on the framework of kernel methods. By combing different kernel functions, the kernel k-neighbor algorithm can effectively overcome the shortcoming of the original k-neighbor algorithm. When redundant features appear in kernel feature space, the performance of kernel methods will be seriously influenced. To overcome this problem, we develop a two-step algorithm by performing kernel principal component analysis (KPCA) in kernel feature space. Thus, KPCA is used to remove those uninformative features, and linear SVM is then used to establish prediction models. The two-step algorithm is applied to analyze QSAR data, and promising results are achieved.(Chapter Ⅲ)4. We propose several new methods used for pharmaceutical and omics data, by further studying decision tree (DT) and DT-based ensemble algorithms. In view of the importance of feature selection in pharmaceutical data modeling, we develop a general framework based on DT and DT-based ensemble algorithms to select informative features. By further analyzing the theory of random forest (RF) algorithm, we develop a feature importance sampling-based adaptive RF (fisaRF) algorithm. The proposed fisaRF is used to classify QASR data, and the results show the fisaRF obtains better performance than RF. By fully mining the advantages of DT model, we propose a Monte-Carlo tree (MCT) algorithm to analyze the patterns hidden in metabolomics data. MCT is applied to analyze two metabolomics data, and obtains clear classification patterns. Finally, we develop a novel tree kernel Fisher discriminant analysis, and then apply it to metabolomics data analysis. Significant performance improvement can be observed.(Chapter Ⅳ)5. In view of the importance of molecular representing in chemo-bio informatics, we develop four software packages to extract the features of complex molecules, and establish a web-based server used for calculating molecular features. Four software packages are as follows:(1) ChemoPy software package used for calculating drug descriptors;(2) ProPy software package used for calculating protein sequence features;(3) PyNet software package used for calculating network descriptors;(4) PyDPI software package used for calculating the descriptors from drug-target interaction pairs and protein-protein interaction pairs. Applying these packages and web-based server can well help pharmacologist and biologist to represent and analyze complex data.(Chapter Ⅴ)6. We develop several in silico prediction methods to predict drug ADMET and physicochemical properties, and finally establish a drug ADMET database and web-based online prediction platform. In view of the importance of aqueous solubility in drug discovery, we propose three models used for predicting drug aqueous solubility. By analyzing selected molecular descriptors, we find some factors influencing aqueous solubility. Based on modified RF in section4and substructure fingerprint, we predict human maximum recommended daily dose, and find some feature fragments related to drug toxicity. This method can in advance estimate human maximum recommended daily dose in phase I human clinical trials. Based on RF and molecular fingerprint, we develop a more general model framework to predict the toxicity of chemical compounds. By developing2D-QSAR model, we evaluate the inhibitory effect of100structurally diverse natural products on the uptake of estrone-3-sulfate (E3S) by OATP1B1. Several structure factors influencing inhibitory effect are recognized. By the developed2D-QSAR model, we obtain some in-depth insights into natural product-drug interactions. Finally, to facilitate estimating drug ADEMT for pharmacologists, we establish a drug ADMET database and web-based online prediction platform.(Chapter VI)7. Based on chemogenomics, we propose a genome-scale prediction method to predict drug-target interactions. With the help of Kj binding constant, we divide the total drug-target pairs into positive samples and negative samples. Chemogenomics approach is used for representing drug-target interactions. A drug-target pair can be represented by simultaneously considering drug descriptors and protein descriptors. Random forest is used to construct the final prediction model. Prediction results and further analysis demonstrate the reliability of our proposed chemogenomics framewok. By predicting unknown drug-target pairs and analyzing their network, we show the therapeutical polypharmacology. The proposed method provides an effective way to study the behavior of drug and target. Web-based prediction server PreDPI-Ki is finally established to facilitate the use of our method. Based on the above chemogenomics framework, we introduce the network descriptors to significantly improve the prediction performance of our model. The comparison of different prediction methods on four standard data sets demonstrates the importance of network features. By combining chemical, biological and network features, we construct a prediction model to systematically screen all drug-target pairs, and find several novel drug-target interactions which have been experimentally verified. This provides an alternative starting point towards repurposing old drugs and identifying targets. Finally, we assess pressor mechanism of26photochemical compounds using in silico compound-protein interaction prediction.(Chapter Ⅶ)...
Keywords/Search Tags:Chemo-bio informatics, Chemoinformatics, Bioinformatics, Systems pharmacology, Drug discovery, Machine Learning, Data mining, QSAR, Molecular representing, Kernel methods, Decision tree, Networkprediction
PDF Full Text Request
Related items