Font Size: a A A

Studies On A Few Key Problems Of QSAR/QSPR Modeling Based On The OECD Principles

Posted on:2014-01-26Degree:DoctorType:Dissertation
Country:ChinaCandidate:X ChenFull Text:PDF
GTID:1481304322467144Subject:Applied Chemistry
Abstract/Summary:PDF Full Text Request
ABSTRACT: The main works of this dissertation are to study a few key problems in QSAR/QSPR (Quantitative Structure-Activity Relationship) modeling according to the requirements of the OECD (Organization for Economic Co-operation and Development) principles. Moreover, a study toward automated bioactivity annotation of large compound libraries is also carried out.In Chapter1, we have discussed the importance of OECD principles for QSAR/QSPR model validation. Based on the five OECD principles, we proposed that there are several key problems in QSAR/QSPR modeling need to be studied. These key problems include how to improve the accuracy and robustness of QSAR/QSPR models, how to define the applicability domain and interpretation of QSAR/QSPR models.In Chapter2, we studied on the method for improving the accuracy and robustness of QSAR/QSPR models. We have proposed an M-ULDA (Modified Uncorrelated Linear Discriminant Analysis) algorithm coupled with RFE (Recursive Feature Elimination) method for feature selection as a powerful QSAR modeling method. The QSAR studies on six data sets related to ADMET(Absorption, Distribution, Metabolism, Excretion and Toxicity) properties and inhibition activity of factor Xa were used to evaluate the performance of new method. The results of accuracy and robustness indicate that the new method is superior to the original method. And the comparison with other linear or nonlinear QSAR/QSPR methods has shown that the new method can provide comparable or better predictive accuracy. In addition, the new modeling method is easier to interpret with respect to the nonlinear methods.In Chapter3, the studies were mainly focused on the method for promoting the accuracy and robustness of PLS (Partial Least Squares) model. We have introduced the MC outlier detection method and random frog variable selection method recently developed by our laboratory in the QSAR model to predict retention index of237flavor compounds on four stationary phases with different polarity. And the important structural features relating to the flavor compounds'retention behavior on stationary phases with different polarity were explored. The results of SDEP (Standard Deviation Error of Prediction) and Q2show that the accuracy and robustness of PLS model can be significantly improved by using our new method for outlier detection and variable selection. This conclusion has been further confirmed by results of Monte Carlo test.In Chapter4, a comprehensive study on accuracy of QSAR/QSPR models, the applicability domain of QSAR/QSPR models and interpretation of models was carried out. Four sets of important bioactivity and toxicity were used for QSAR/QSPR study. For the study on accuracy and robustness of QSAR/QSPR models, we compared the performance of different types of molecular descriptor and modeling methods. The results indicate that the use of molecular descriptors of fingerprint type such as MACCS and Pubchem did not reduce the accuracy and robustness of QSAR/QSPR models compared with the theoretical type Dragon descriptors. Among the different modeling methods studies in this chapter, SVM and RF are superior concerning the accuracy and stability of predicting results. For the discussion about applicability domain of QSAR/QSPR models, we have proposed a novel method for defining the applicability domain. The new method based on predictive probability has been compared with a commonly used method which is based on molecular similarity. The results of assessment indicate that the new method is superior to the method based on molecular similarity. It seems quite reasonable to defining the applicability domain of QSAR/QSPR models by using the new method. Furthermore, we have found that the method based on probability of SVM (support vector machines) is better than that based on probability of RF (Random Forest). For the study on model interpretation, we mainly focused on the effect of variable selection and use of molecular fingerprinting. We have drawn the conclusion that variable selection and use of molecular fingerprinting are both very helpful for model interpretation since they can provide the important substructure related with the activity or property.Chapter5describes a process to automatically annotate biochemotypes of compounds in a library and thus to identify bioactivity related chemotypes (biochemotypes) from a large library of compounds. The process consists of two steps:(1) predicting all possible bioactivities for each compound in a library, and (2) deriving possible biochemotypes based on predictions. About a one million (982,889) commercially available compound library (CACL) has been tested using this process. This chapter has demonstrated the importance and feasibility of automatically annotating biochemotypes for large libraries of compounds. Moreover, we suggest the ways in which the systematic bioactivities prediction program should be improved. Firstly, a balance between the automated bioactivity annotation technology and data quality has to be found. The annotation process is very fast by using PASS program. It is equally important that accuracy not be sacrificed. Secondly, an ideal systematic bioactivity prediction tool must indicate privileged structures and be trainable by users. Thirdly, the definition of bioactivities (biochemotype ontology) needs to be better developed in future.
Keywords/Search Tags:QSAR/QSPR, OECD Validation Principles, Modeling methods, Molecular fingerprint, Variable selection, Applicability domain, SupportVector Machine, Interpretation of model, Annotating biochemotypes
PDF Full Text Request
Related items