Font Size: a A A

New Chemometric Algorithms In Quantitative Structure-Activity Relationships Studies And High-Dimensional Microarray Data Analysis

Posted on:2010-12-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:L J TangFull Text:PDF
GTID:1101330338982088Subject:Analytical Chemistry
Abstract/Summary:PDF Full Text Request
The research work in this thesis focuses on new chemometric algoritms for quantitative structure-activity relationship (QSAR) studies and high-dimensional microarray data analysis.Support vector machine (SVM) has been receiving increasing interests in QSAR studies for its abilities of function approximation and remarkable generalization performance. However, selection of support vectors and intensive optimization of kernel width of nonlinear SVM are inclined to get trapped into local optima, leading to increased risk of underfitting or overfitting. To overcome these problems, a new nonlinear SVM algorithm has been proposed using adaptive kernel transform based on radial basis function network (RBFN) as optimized by particle swarm optimization (PSO). The new algorithm incorporates a nonlinear transform of the original variables to feature space via a RBFN with one input and one hidden layer. Such a transform intrinsically yields a kernel transform of the original variables. A synergetic optimization of all parameters including kernel centers, kernel widths as well as SVM model coefficients using PSO enables the determination of a flexible kernel transform according to the performance of the total model. The implementation of PSO demonstrates relatively high efficiency in convergence to a desired optimum. Applications of the proposed algorithm to QSAR studies of binding affinities of HIV-1 reverse transcriptase inhibitors and activities of 1-phenylbenzimidazoles reveal that the new algorithm provides superior performance to BPNN and conventional nonlinear SVM, indicating that this algorithm holds great promise in nonlinear SVM learning.Generally, the construction of classification and regression trees (CART) used to be carried out by greedy recursive partitioning. This method may be successful. However, the greedy search will necessarily miss regions of the search space. The issues of suboptimum and overfitting, however, often occur in the CART configuring. To circumvent these problems, a modified discrete particle swarm optimization method has been taken to adaptively configure a global optimal CART (MPSOCART), that is, the optimal splitting attribute and their corresponding best splitting value for each internal node and the appropriate size of a CART are simultaneously identified. In addition, a new objective function has been formulated to determine the appropriate tree architecture and optimum splitting attributes and their corresponding splitting values. The proposed MPSOCART has been used to predict the bioactivities of flavonoid derivatives and inhibitory activities of epidermal growth factor receptor (EGFR) tyrosine kinase inhibitors. The results have been compared with those obtained by PLS and CART induced by greedy recursive partitioning method. The comparison demonstrates that the MPSO is a useful tool for configuring CART, which converges fast towards the optimal solution and avoids overfitting at great extent.The use of numerous descriptors that are indicative of molecular structure is becoming common in quantitative structure-activity relationship (QSAR) studies. As all of the descriptors might carry more or less molecular information, it seems more advisable to investigate all the possible variables rather than traditional variable selection. Based on particle swarm optimization algorithm, a more flexible variable selection and modeling method, variable-weighted SVM is proposed. The strategy of variable weighting allows non-negative weights of variables rather than removing or reserving any variables. Using PSO to seek the non-negative weights of variables can be seen as an optimized rescaling of the variables in certain sense. If employing PSO to search for the other parameters in the model of SVM at the same time, the variable-weighted SVM would become a total-automatically modeling approach and therefore be more flexible and intelligent than traditional variable methods. Results obtained by investigating glycogen synthase kinase-3a and carbonic anhydrase II inhibitors indicate that variable-weighted SVM can not only realize the variable selection but also can optimize the combination of variables in QSAR studies, consequently benefit for acquiring better QSAR models with developed performance in training and prediction.One problem with discriminant analysis of microarray data is representation of each sample by a large number of genes that are possibly irrelevant, insignificant or redundant. Methods of variable selection are, therefore, of great significance in microarray data analysis. To circumvent the problem, a new gene mining approach is proposed based on the similarity between probability density functions on each gene for the class of interest with respect to the others. This method allows the ascertainment of significant genes that are informative for discriminating each individual class rather than maximizing the separability of all classes. Then one can select genes containing important information about the particular subtypes of diseases. Based on the mined significant genes for individual classes, a support vector machine with block-wise kernel transform is constructed for the classification of different diseases. The combination of the gene mining approach with support vector machine is demonstrated for cancer classification using two public data sets. The results reveal that significant genes are identified for each cancer, and the classification model shows satisfactory performance in training and prediction for both data sets.Considering the difference between within-class samples derived from different pathogenic mechanisms, another new method for key gene selection has been proposed based on interval segmentation purity that is defined as the purity of samples belonging to a certain class in intervals segmented by a mode search algorithm. This method identifies key variables most discriminative for each class, which offers possibility of unraveling the biological implication of selected genes. A salient advantage of the new strategy over existing methods is the capability of selecting genes that, though possibly exhibit a multimodal distribution, are the most discriminative for the classes of interest. Based on the key genes selected for individual classes, support vector machine with block-wise kernel transform is employed to model the relationship between the identified optimal gene groups and class variables. Two public data sets are investigated and the results demonstrated that the developed gene selection method could identify a key gene set with the least size but the optimal classification performance.Gene microarray data are frequently featured by a small sample size and a large number of variables. To be a modeling technique based on large sample theory, CART would be unstable when training small samples. On the other hand, the difference between the within-class samples derived from the diversity of disease would decrease the homogeneity of within-class samples, resulting in a more unstable classification tree. In addition, greedy searching in the whole variable (gene) set would increase the overfitting risk of CART. To solve these problems, a strategy of unimodal transform of variables selected by interval segmentation purity (UTISP) for CART is suggested. Variable selection is a straightforward solution for dimension-reducing by filtering irrelevant genes in microarray data. An interval segmentation purity-based variable selection algorithm has been demonstrated to be a reasonable and reliable approach for key disease related gene identifying. The unimodal transform of optimal genes is to enhance the homogeneity of within-class samples by seeking the unimodal features of genes in feature spaces. The applications of the proposed strategy to two data sets reveals that UTISP-CART provides superior performance to k-nearest neighbors and other versions of CART, indicating UTISP-CART holds great promise for microarray data analysis.
Keywords/Search Tags:Quantitative structure-activity relationship, Particle swarm optimization, Artificial neural networks, Support vector machines, Classification and regresstion trees, Gene microarray, Variable selection, Kernel transform
PDF Full Text Request
Related items