Font Size: a A A

Application Of Support Vector Machines (SVM) And Radial Basis Function Neural Networks (RBFNN) In Chemistry, Environmental Chemistry And Medicinal Chemistry

Posted on:2007-09-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:F LuanFull Text:PDF
GTID:1101360182994220Subject:Analytical Chemistry
Abstract/Summary:PDF Full Text Request
Quantitative structure-property/activity relationships (QSPR/QSAR) studies are important research topics in computational chemistry and chemoinformatics. They have been widely used for the prediction of various physicochemical properties and biological activities of organic compounds by using different statistical methods and various kinds of molecular descriptors.To build a rapid, simple and valid model is one of the important topics of QSPR/QSAR study. Since modeling method is one of the major factors, it is necessary to search for novel type of learning machine. On the basis of the research on artifical neural networks (ANN) by our group in recent 10 years, support vector machine (SVM) was introduced to chemistry, environmental chemistry and medicine chemistry and predicted the important properties of organic compounds, environmental pollutants and drugs in this dissertation. We showed the capability of Radial Basis Function Neural Networks (RBFNN) and SVM in QSPR and QSAR analysis and their potential utilities to solve problems in biology, chemistry and environment science through several applications in classification and correlation analysisA brief description of the QSPR/QSAR principle, research process and status was given in Chapter 1, and among them we gave an emphasis on the methods of model building. In this section, we also indicated the shortcoming of the present modeling method such as ANN and then introduced the new machine learning method—the support vector machine in detail. At last we gave a review and prospect of the application of SVM in QSPR/QSAR field.In Chapter 2, we applied SVM and RBFNN in chemistry. A brief description was given as follows:(1) Multiple linear regression (MLR) and SVM was used to develop QSPR models to predict the van der Waals' constants of a diverse set of 364 compounds.MLR was utilized to not only select the molecular descriptors but also construct the linear model. The SVM models gave Mean Square Error (MSE) of 5.96 for the training set, 8.00 for the validation set, 6.67 for the test set and overall data sets are 6.65 to constant a. To constant b the value were 9.56x 10"5 for training set, 3.18 x 10 "4 for validation set, 4.22 x 10 ~4 for test set and 2.33 x 10 ~4 for the whole set.(2) The Heuristic Method (HM) and SVM was used to develop the linear and nonlinear QSRR models between the retention time (RT) and five molecular descriptors of 149 volatile organic compounds (VOCs). The mean squared eixors (MSE) in RT predictions for the test data set given by HM and SVM were 1.644 and 1.094, respectively, which showed the performance of SVM model was better than of the HM model. The prediction results are in agreement with the experimental values very well.(3) QSPR study was performed by HM and RBFNN to study the permeability coefficients of 63 various compounds through low-density polyethylene at 21.1 °C. Comparison of the models obtained by us and by others, it can be seen that their performance was comparative. It implied that this approach was suitable and alternative one in the field of polymer science.In Chapter 3, SVM and RBFNN were applied to environmental chemistry.(1) SVM, as a novel type of learning machine, was used to develop a classification model of carcinogenic property of 148 N-Nitroso compounds (NOCs). 7 descriptors calculated solely from the molecular structures of compounds by forward stepwise linear discriminant analysis (LDA) were used as inputs of the SVM model. The accuracy of training set for SVM was 97.4% and the test set was 86.6%. The total accuracy for SVM was 95.2%, which is higher than that of LDA (89.8%). It can be concluded that the steric and electric factor are likely two major factors in the process of carcinogenicity. And it gave a useful and convenient way for classification of the carcinogenicity of N-Nitroso compounds.(2) QSAR models for 93 polychlorinated dibenzofurans (PCDFs), dibenzodioxins (PCDDs), and biphenyls (PCBs) binding to the aryl hydrocarbon receptor (AhR) have been developed based on HM and SVM. Since various membersof the three classes of compounds have been shown to produce qualitatively similar toxicities, a combination of the different classes for each bioactivity were performed in one QSAR study. A subset of five molecular descriptors selected by HM in CODESSA was used as inputs for SVM. The results obtained by none linear SVM model were compared with those obtained by the linear heuristic method. The prediction result of the SVM model was better than that obtained by HM. The model of SVM led to a correlation coefficient (R) of 0.928 and root-mean-square error (RMS) of 0.324 for the test set and the values for HM model are 0.845 and 0.667 respectively. The work clearly demonstrated that single QSAR equation could be developed for the prediction of binding affinity of PCDFs, PCDDs, and PCBs.(3) Quantitative classification and regression models for prediction of sensory irritants (logRD5o) of 142 volatile organic chemicals (VOCs) have been developed. The best classification results were found using SVM: the accuracy for training, test and overall data set was 96.5%, 85.7% and 94.4%, respectively. The nonlinear regression models were built by RNFNN and SVM, respectively. The root mean squared errors (RMS) in prediction for the training, test and overall data set were 0.4755, 0.6322 and 0.5009 for reactive group;0.2430, 0.4798 and 0.3064 for nonreactive group by RBFNN. The comparative results obtained by SVM were 0.4415, 0.7430 and 0.5140 for reactive group;0.3920,0.4520 and 0.4050 for nonreactive group, respectively. This paper proposed an effective method for poisonous chemicals screening and considering.(4) Rat blood: air partition coefficient (logwood) for 100 volatile organic compounds (VOCs) was predicted by QSPR models. Simple molecular descriptors that calculated from the molecular structures alone were used to represent the characteristics of compounds. HM was used to pre-select the whole descriptor sets and to build the linear model. The model of HM led to a correlation coefficient square (R2) of 0.8832. This QSPR models provided a rapid, simple and valid way to predict the log^biood values of VOCs.In Chapter 4, we introduced SVM and RBFNN to medicine chemistry. Two research works were related:(1) QSPR studuies were developed to predict pKa values of a set of 74 neutral and basic drugs by the linear and nonlinear methods based on the HM andRBFNN, respectively. The linear model obtained had a correlation coefficient (R) of 0.884 with an RMS error of 0.482 for the training set, while R was 0.693 and RMS was 0.987 for the test set. The RMS in prediction for overall data set was 0.619. The RBFNN model gave better results: for the training set R= 0.886, RMS= 0.458 and for the test set R= 0.737, RMS- 0.613. The RMS in prediction for overall data set was 0.493. And the model was useful to predict pKa during the discovery course of new drugs when the experimental data were unknown.(2) QSPR method was performed for the prediction of the standard Gibbs energies (AG9) of the transfer of 54 peptide anions from aqueous solution to nitrobenzene based on HM, RBFNN, and SVM. Comparison the results obtained by the three methods, it could be seen that the results of nonlinear model were better than of linear model. And of the nonlinear model, SVM was better than RBFNN. To SVM model, the RMS errors of the training set, the prediction set and the whole data were 1.604, 2.478 and 1.817, and the correlation coefficient were 0.968, 0.947 and 0.962, respectively.
Keywords/Search Tags:Chemoinformatics, QSPR/QSAR, SVM, RBFNN, Medicine chemistry
PDF Full Text Request
Related items