Font Size: a A A

QSAR Of Chemical Pesticides Based On Support Vector Regression

Posted on:2012-06-15Degree:MasterType:Thesis
Country:ChinaCandidate:W W LiFull Text:PDF
GTID:2213330371950403Subject:Plant protection
Abstract/Summary:PDF Full Text Request
One of the central research problems of chemical pesticide is the development of new pesticide with expected activity for the diseases, insects and weeds, which are harmful to agricultural production. Since hazardous objects in agricultural production have increasing resistance to traditional pesticides and people require more green crops, creating new chemical pesticides become more difficult. Moreover, adopting traditional methods to synthesize pesticides in a vast amount and screen them is obviously time-consuming, laborious, expensive, and bad for the environment, so quantitative structure-activity relationship (QSAR) has played an extremely important role in the creation of new pesticides. In the research of QSAR, there are three major components:obtaining molecular structural descriptors, selecting descriptors and modeling methods.To obtain descriptors, firstly, we consulted the literature to get the low-dimensional descriptor vector (descriptor number is generally not more than 10, and thus these descriptors only characterize partial information of the compounds). Secondly, we got the high-dimensional descriptor vector through E-Dragon 1.0 software (there are thousands of descriptors, which can characterize the major information of the compounds). The descriptors in the low-dimensional descriptor vector can be screened by pre-established nonlinear multi-round screening descriptor, but the high-dimensional descriptor screening problem has not been completely solved:the nonlinear multi-round worst elimination is time-consuming for high-dimensional descriptor screening; stepwise linear regression can only be applied to linear problems; the principal component analysis is used to replace the linear weighted combination of multiple descriptors with the main components, so the model based on principal component shows poor interpretability. The commonly-used QSAR modeling methods include multiple linear regression (MLR), stepwise linear regression (SLR), partial least squares regression (PLS), quadratic polynomial regression (QPR) and other linear or quasi-linear models. Though these traditional models have good interpretability, they show limited resolving power to high-dimensional, nonlinear, small samples. The artificial neural network (ANN) has good nonlinear approximation ability, but it has many drawbacks, such as difficulty to determine the model structure, poor explaining ability, easily excessive training or inadequate training and possibility to fall into local minimum. Support vector regression (SVR), based on statistical learning theory, can avoid these problems such as small sample set, nonlinearity, over-fitting, dimension disaster, local minimum, and have superior generalization ability.In order to reasonably selecting descriptors in the QSAR study for chemical pesticides, We adopted ChemDraw and E-Dragon 1.0 to calculate the descriptors for three data sets of chemical pesticides, more than 1000 descriptors were obtained, with descriptor type up to 24 categories, such as structural descriptors, topological descriptors,2D autocorrelation descriptors,3D-MoRSE descriptors. With the purpose of obtaining the descriptors associated with the chemical activity, we established a method of nonlinear rapid descriptor selection for high dimension descriptor vector based on support vector regression, and evaluated those using data sets of pesticides, fungicides and herbicides. Compared with related literature, the results show that:1. Based on support vector regression and the descriptors in the literature, the chemical pesticides were studied by QSAR, including the ternary asymmetry of organophosphate pesticides (22 samples), a new thiazole-containing triazole ring and sub- amine fungicides (17 samples) and 2-hydroxy-3-alkyl-1,4-naphthoquinone herbicides (23 samples). We first analyzed the descriptors in the article, and found that the correlations between most descriptors were significant, especially the correlations between descriptors HE and ClogP, Polar and ClogP which reached 0.99 and 1.00 respectively in 2-hydroxy-3-alkyl-1,4-naphthoquinone compound. In addition to this linear relationship, the non-linear relationship between the descriptors may also exist, so we adopted the pre-established nonlinear multi-round descriptor screening strategy to screen the descriptors. It turned out that MSE,R2,F values all increased after descriptors selection, confirming the fact that this method can effectively remove the descriptors which were irrelated to the active compounds or can be replaced by other parameters, and this method can also effectively screen the linear correlation descriptors HE,ClogP and Polar. Finally, single-factor importance ordering was applied to the retained descriptors based on the importance analysis, which enhanced the interpretability of the model.2. The above results showed that the information of the common descriptors is limited and may be not associated with the activity of a particular compound; the performance of the model is not so good. Adopting the method of nonlinear rapid descriptor selection for high dimension descriptor vectors, we got no more than 8 retained descriptors with clear meanings through screening the data of three chemical peptides calculated by the software. The results showed that the results of leave-one-out prediction based on the nonlinear selected descriptors have been significantly improved, which has obvious advantages over the reported results. It also proves that the new screening method can effectively select the required descriptors from a large amount of descriptors. Furthermore, we employed the SVR regression significance test to analyze the established QSAR model, and conducted significance ordering using single-factor analysis, which enhanced the interpretability of the established QSAR model.In summary, based on support vector regression, this article established a method of nonlinear rapid descriptor selection for high dimension descriptor vectors, providing a theoretical basis for the QSAR studies of chemical pesticides in the choice of descriptors and a wide range of applications in the QSAR of compounds.
Keywords/Search Tags:quantitative structure-activity relationship, screening descriptors, support vector regression, chemical pesticides, pesticides, fungicides, herbicides
PDF Full Text Request
Related items