Font Size: a A A

New Chemometric Algorithms In Multivariate Calibration And Quantitative Structure-Activity Relationships Studies

Posted on:2008-08-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y P ZhouFull Text:PDF
GTID:1101360215479766Subject:Analytical Chemistry
Abstract/Summary:PDF Full Text Request
The research work in this thesis focuses on multivariate calibration and quantitative structure activity relationship (QSAR) studies and the development of some new chemometric algorithms applied in these two fields.A novel near infrared (NIR) spectroscopic measurement technique, dry film method, has been designed for determining the glucose in plasma. Rare earth element ytterbium (Yb) has been taken in the dry film method as the internal standard to compensate for the thickness variation of the dry films. Support vector regression (SVR) has been combined with boosting for the development of a boosting support vector regression (BSVR) method used in the dry film measurement modeling. The main idea behind BSVR is firstly to train a sequence of SVR models on various weighted versions of the original calibration set and then to combine the predictions from the constructed SVR models to obtain integrative results. Experimental results show that the interference from water absorption can be well eliminated and the glucose in plasma can be determined with satisfactory accuracy using the dry film technique coupled with the BSVR modeling. Only 50 microliters of sample volume is required. Moreover, the performance of BSVR compares favorably with that of the conventional SVR and PLS. It is expected that the developed dry film method, when coupled with BSVR, might hold great potential in NIR spectroscopic analysis of other analytes of clinical significance in bio-fluid samples.Dry film-based Fourier transformed-infrared (FT-IR) spectroscopic technique, coupled with boosting support vector regression (BSVR), has been employed for blood glucose assay. Potassium thiocyanate (KSCN) has been taken in the dry film method as the internal standard to compensate for the film thickness variation. The moving window partial least-squares regression (MWPLSR) has been used for wavenumber interval selection before multivariate modeling. By using the IR spectroscopic dry film technique coupled with the BSVR modeling, the interference from water absorption can be well circumvented and glucose in plasma can be determined with satisfactory accuracy with only 5 microliters of sample volume required. The performance of the BSVR methodology has been compared with that of conventional SVR as well as PLS, indicating that BSVR is an effective multivariate calibration tool providing better performance than conventional PLS and SVR. Boosting support vector regression has been applied for QSAR studies of nitrobenzenes and 5-lipoxygenase inhibitors. Experimental results show that the introduction of boosting drastically enhances the generalization performance of individual SVR model and BSVR is a well-performing technique in QSAR studies superior to multiple linear regression (MLR). It can be utilized as a complementary tool, for the experimental assessment might be expensive, hazardous, and time- consuming.As outliers often present in the synthesis of the compounds and boosting is sensitive to outliers, it is demanding to design a robust method for effectively executing the QSAR studies. To combat these difficulties, a robust version of boosting has been developed to improve the performance of partial least square (RBPLS). The RBPLS attempts to establish a sequence of robust PLS models by introducing an error-trimming technique before the weight renovation for the next cycle, and then integrate the outputs of all these resultant PLS models to obtain the final predictions. In PLS modeling, an F-statistic has been introduced to automatically determine the number of PLS components. RBPLS has been assessed by the angiotensin II antagonist data set, coupled with boosting PLS (BPLS) and PLS. The results reveal that RBPLS shows satisfactory training and prediction performance in the QSAR studies of angiotensin II antagonists and it can be less sensitive to outliers than the other two methods.The configuring of radial basis function network (RBFN) consists of selecting the network parameters (centers and widths in RBF units and weights between the hidden and output layers) and network architecture. The issues of suboptimum and overfitting, however, often occur in the RBFN configuring. To combat these issues, a hybrid particle swarm optimization (HPSO) algorithm has been used to simultaneously search optimal network structure and parameters involved in RBFN (HPSORBFN) with ellipsoidal Gaussian function as basis function. The continuous version of PSO is used for parameter training, while the modified discrete PSO is employed to determine the appropriate network topology. The ellipsoidal Gaussian function is used to increase the network flexibility and alleviate excessive variability in the input variables. In addition, a new fitness function has been formulated to search the optimum network architecture and optimum values of the network parameters. The proposed HPSORBFN algorithm has been applied to modeling the inhibitory activities of substituted bis[(acridine-4-carboxamide)propyl]methylamines to murine P388 leukemia cells and the bioactivities of COX-2 inhibitors. The results have been compared with those obtained from RBFN-s with the parameters optimized by continuous PSO and by conventionally RBFN training algorithm for a fixed network topology, indicating the HPSO is competent for RBFN configuring in that it converges quickly towards the optimal solution and avoids overfitting.Support vector machine (SVM) has been receiving increasing interests in QSAR studies for its abilities of function approximation and remarkable generalization performance. However, selection of support vectors and intensive optimization of kernel width of nonlinear SVM are inclined to get trapped into local optima, leading to increased risk of underfitting or overfitting. To overcome these problems, a new nonlinear SVM algorithm has proposed using adaptive kernel transform based on radial basis function network (RBFN) as optimized by particle swarm optimization (PSO). The new algorithm incorporates a nonlinear transform of the original variables to feature space via a RBFN with one input and one hidden layer. Such a transform intrinsically yields a kernel transform of the original variables. A synergetic optimization of all parameters including kernel centers, kernel widths as well as SVM model coefficients using PSO enables the determination of a flexible kernel transform according to the performance of the total model. The implementation of PSO demonstrates relatively high efficiency in convergence to a desired optimum. Applications of the proposed algorithm to QSAR studies of binding affinities of HIV-1 reverse transcriptase inhibitors and activities of 1-phenylbenzimidazoles reveal that the new algorithm provides superior performance to BPNN and conventional nonlinear SVM, indicating that this algorithm holds great promise in nonlinear SVM learning.A new optimized version of nonlinear partial least-square method based on artificial neural network transformation (ANN-NLPLS) has been proposed. This algorithm firstly transforms the training descriptors into the hidden layer outputs using the universal nonlinear mapping carried by an artificial neural network, and then utilizes PLS to relate the outputs of the hidden layer to the bioactivities. The weights between the input and hidden layers are optimized by a PSO method using the criterion of minimized model error via PLS modeling. An F statistic is introduced to determine automatically the number of PLS components during the weight optimization. The performance is assessed using a simulated data set and two QSAR data sets. Results of these three data sets demonstrate that ANN-NLPLS offers enhanced capacity in modeling nonlinearity while circumventing the overfitting frequently involved in nonlinear modeling.Generally, the construction of classification and regression trees (CART) used to be carried out by greedy recursive partitioning. This method may be successful; however, the greedy search will necessarily miss regions of the search space. The issues of suboptimum and overfitting, however, often occur in the CART configuring. To circumvent these problems, a modified discrete particle swarm optimization method has been taken to adaptively configure a global optimal CART (MPSOCART), that is, the optimal splitting attribute and their corresponding best splitting value for each internal node and the appropriate size of a CART are simultaneously identified. In addition, a new objective function has been formulated to determine the appropriate tree architecture and optimum splitting attributes and their corresponding splitting values. The proposed MPSOCART has been used to predict the bioactivities of flavonoid derivatives and inhibitory activities of epidermal growth factor receptor (EGFR) tyrosine kinase inhibitors. The results have been compared to those obtained by PLS and CART induced by greedy recursive partitioning method. The comparison demonstrates that the MPSO is a useful tool for configuring CART, which converges fast towards the optimal solution and avoid overfitting in great extent.
Keywords/Search Tags:Multivariate calibration, Quantitative structure activity relationship, Boosting, Particle swarm optimization, Artificial neural networks, Support vector machines, Classification and regresstion trees
PDF Full Text Request
Related items