Font Size: a A A

Research On Some New Methods Of Statistical Learning Based On Chemical Data

Posted on:2014-11-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:X HuangFull Text:PDF
GTID:1260330401955245Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
For the increasingly complex data, especially in the field of structure-activity relationship and spectra data, how to mine the most useful information from the complex data by statistical learning methods is one of the hot issues in current applied statistics research. Under the guidance of "data-driven", in the background of chemical data, through in-depth study the advantages and disadvantages of some classical statistical methods, such as classification and regression tree, support vector machine, partial least squares, etc. we proposed creatively some new statistical learning methods. The thesis consists of seven chapters.Firstly, we briefly introduced the research background and motivation, and then reviewed some theories and methods of statistical learning on chemical data analysis. These are the foundation of the new methods of statistical learning. Finally, we introduced the main content and innovation of this thesis in Chapter1.In Chapter2, the constructed tree kernel is proposed for the first time, which is one of the most important innovations. We discussed in detail the classification and regression tree(CART) algorithm. We pointed out that the samples under the same terminal node may possess some specific similarity to some extent, rather than only being limited to class similarity. Simultaneously, in order to obtain the diversity of tree structures, We coupled Monte Carlo procedure with a classification tree algorithm, and skillfully constructed a novel tree kernel by using the fuzzy pruning strategy and ensemble strategy. The fuzzy pruning strategy helps in effectively exploiting the information of inner nodes in trees, but does not totally destroy the structure of tree. Ensemble strategy selection can effectively guarantee that the results by tree kernel is more stable and reliable compared to one by CART, not deriving from the chanciness. This is our original motivation of building tree kernel. In fact, CART carries out a greedy but may not be global optimal search in sample and variable to seek for variable subsets most relevant to classification and sample subsets with specific similarity under different variable subspace. The constructed tree kernel has several outstanding advantages:It is "supervised" because the class information dictates the structure of the trees in the process of constructing tree kernel; Because irrelevant metabolites contribute little to the tree ensemble, they have little influence on the proximity measure, and tree kernel thereby can easily discover the inportant variable; By means of the classification tree, constructed tree kernel can effectively deal with nonlinear problems.Then, under the framework of kernel methods, we coupled a novel tree kernel with support vector machine, partial least squares and k-nearest neighbor, and presented three new statistical learning methods: tree kernel support vector machine (TKSVM),tree kernel partial least squares (TKPLS) and tree kernel k-nearest neighbor (TKk-NN). Three datasets related to different categorical bioactivities of compounds are used to test the performance of these methods. The results show that advantages of constructed tree kernel can effectively improve the traditional methods.For the high-dimensional spectral data, we proposed a novel model method PLSSIS. A difficulty of high-dimensional data analysis lies in multi-collinear and a lot of redundant information. PLS can be usually employed to deal with this case. However, calibration model including all the variables contains much redundant information, which will bring about negative influence on the prediction ability of the model. By employing PLS regression coefficients and sure independence screening principle, a novel strategy for selecting stepwise the variables, named PLS regression combined with sure independence screening (PLSSIS), is developed. PLSSIS is a forward iteration algorithm that combines the PLSR with SIS, which can fastly and efficiently deal with the high dimensional collinear data. For three spectral datasets, Our study shows that better prediction is obtained by PLSSIS when compared to PLS modeling and moving window partial least squares regression (MWPLSR).At last, Chapter7is the summarization of whole thesis and expectation for the future.
Keywords/Search Tags:Statistical learning, Classification and regression tree, Kernelmethods, structure-activity relationship, Cross validation, Support vectormachine, Sure independence screening, Partial least squares
PDF Full Text Request
Related items