| The development of science comes from the need of daily life.When people talk about a newthing,they always ask questions like:is it useful or can it really solve some problems?Quantitative structure-activity relationship (QSAR) methodology is also in this situation now.Actually the QSAR definition came up with the development of medicinal chemistry,so from thebeginning it is useful.Up to now,QSAR has been widely used in biology,chemistry,medicinalchemistry and environmental science etc.And the endpoints include bioactivity,toxicology,pharmacokinetics (ADME),molecular properties and some environmental related endpoints etc.Researchers try to understand the relationship between the microcosmic molecular structure andthe macroscopical behaviors,and find out the important structural information related to thecorresponding endpoints to facilitate the design or screening of compounds with desired activities.Mostly QSAR models are used to predict the corresponding endpoints for unmeasured orunseen new compound.But if we want to make prediction for new compounds,the used QSARmodels must be vigorously validated.And the higher and reliable external predictivity is essential.So this dissertation aimed to improve the reliability and predictive ability by considering the eachstep in the whole modeling process and tried to solve some aspects needed to be improved inQSAR methodology.We discussed the influence of molecular conformations optimized bydifferent methods on the quality of QSAR models,several novel modeling strategies and used twonovel nonlinear modeling methods to build QSAR models.Furthermore,we proposed a newhybrid QSAR/docking approach for virtual chemical database screening to screen novel pan-SrcLck inhibitors.In this dissertation,a brief description of the QSAR principle was given in Chapter 1,including the history,principle,and research status of QSAR studies.We discussed the process tobuild a stable,reliable and predictive model,and among them we gave an emphasis on the modelvalidation techniques.Furthermore,to understand different modeling methods clearly,weclassified all the methods from different views.Additionally the trend and several novel ideas inQSAR area were also summarized.In Chapter 2,we discuss a basic problem in QSAR modeling-the lowest-energyconformation used to build model,aiming to analyze the influence of molecular conformations optimized by different methods on the quality of QSAR models.We used three datasets withdifferent structural complexity,SMF,Lckl and NS5BI.Comparing the obtained results,we drewour conclusion as following:(1) The original input conformations are very important in structureoptimization task,which may influence the quality of the QSAR model,especially for moleculeswith much flexibility;(2) Conformation searching aimed to find better original conformation nearto the low-energy conformation maybe play an important role in the optimization process;(3)New samples in the test set should use the same optimization process with the training samples ifwe want to predict the corresponding endpoint accurately.In Chapter 3,we discussed two new consensus modeling strategy proposed by us.Consensusmodeling,which uses several submodels to make prediction for a new compound,is a novelstrategy in QSAR research.In all the published consensus models,the final prediction of a sampleis obtained by a simple average of the results predicted by all the contained submodels (averageconsensus modeling,ACM).However,maybe it is more reasonable to give each submodel adifferent weight (weighted consensus modeling,WCM).So in this work,to give a reasonableweight for every submodel,the results predicted by all the involved submodels serve as variables,and multiple linear regression (MLR) method was used to give them different weights.Furthermore we proposed Q2 guided model selection (QGMS) to guide the sumbodels selection.The obtained results indicated that WCM consensus model based on QGMS submodel set couidgive highest fitting ability and external predictivity.Combined data splitting-feature selection (CDFS) is also a kind of consensus modelingmethod.With CDFS,data splitting is achieved many times and in each case feature selection isperformed.Then the resulted models are compared and the final model is the one whosedescriptors are the common variables among all of the resulted models.The shortcoming of CDFSis that it is very hard to say that each training set could span the whole descriptor space so as torepresent the studied data set.We proposed a new strategy to build this kind of final model in adifferent way.At first,we got a training set using rational data splitting method.Then a modelpopulation was established by GA-MLR using training set data only.Descriptors with higherfrequency were considered as key structure features related to the inhibition activities.So theywere extracted to build the final QSAR model.This strategy was used to analyze 169aminothiazole based Lck inhibitors,and the obtained results were satisfactory. In Chapter 4,we pointed out a self-contradictory problem in local QSAR prediction,andproposed a solution to this problem.The commonly used local method is local lazy regression(LLR).It has been proved that any improvement in prediction from LLR is dependent on thenature of the neighborhood obtained for a given query point.In LLR,the leave-one-out crossvalidation (LOO-CV) procedure is usually used to optimize the number of neighbors (k),and themodel giving the lowest LOO-CV error or highest LOO-CV correlation coefficient is chosen asthe best model to make prediction.However,LOO-CV is just an internal validation technique,andthe good statistical value from LOO-CV appears to be the necessary but not the sufficientcondition for the model to have a high predictive power.So we proposed a new strategy toimprove the predictive ability of LLR models and to access the accuracy of a query prediction.The bandwidth of k neighbor value for LLR is optimized by considering the predictive ability oflocal models using an external validation set.This approach was applied to the QSAR study of aseries of melanin-concentrating hormone receptor 1 (MCHR1) antagonists.The obtained resultsfrom the new strategy shows evident improvement compared with the commonly used LOO-CVlocal lazy regression methods and the traditional global linear model.In Chapter 5,we used two novel nonlinear methods to build QSAR models:least squaresupport vector machines (LS-SVMs) and gene expressing programming (GEP).The LS-SVMsmethod was used to analyze the structure-activity relationship (SAR) of a series of oxindole basedcycle-dependent kinase (CDK) inhibitors,and the LS-SVMs classifier predicted the test setsamples into the right class more accurately than linear discriminate analysis (LDA) classifier.Then LS-SVMs method was used to build QSAR models for 44 human liver glycogenphosphorylase a (hlGPa) inhibitors and 32 pyrazine-pyridine based vascular endothelial growthfactor receptor 2 (VEGFR2) inhibitors.The obtained nonlinear models perform much better thanthe linear MLR models.At the end,nonlinear GEP method was used to analyze the quantitativestructure-activity relationship of 62 melanin-concentrating hormone receptor 1 antagonists.Thefitting ability and external predictivity of GEP model were both better than MLR model.Especially the Rext2 of 0.819 for the GEP model was much higher than linear model.In Chapter 6,we proposed a new hybrid QSAR/docking approach for virtual chemical databasescreening and further used to mine a drug database to screen novel pan-Src Lck inhibitors.As aresult,two sulfonylurea derivatives were predicted to be the potential Lck inhibitors in silico, which could bind to the target protein active site in a very similar mode to other reportedinhibitors.And the key sulfonylurea and hydrophobic substructures can be used as a lead skeletonto further Lek inhibitor design.The proposed strategy is a successful combined application ofLBVS and SBVS,which can take into account all important aspects of the structure features forthe training samples while guaranteeing the diversity of training set.The obtained results indicatethat the proposed approach for chemical screening is of practical utility and can be used as ageneral tool to screen chemical database and discover lead compounds. |