Font Size: a A A

Research On High-dimensional Data Fusion Analysis And Evaluation Methods Of Traditional Chinese Medicine

Posted on:2016-09-10Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z Z LinFull Text:PDF
GTID:1484304718985369Subject:traditional Chinese medicine chemistry
Abstract/Summary:PDF Full Text Request
With the development of separation and detection techniques,massive high-dimensional datasets,which can be analyzed independently,have been generated to discribe the difference in chemical properties.High-throughput x-omics,including genomics,proteomics and metabolomics,provide system characterization of a research object from different physiological aspects.All of the analysises and characterizations are partial,but some are useful.Improvement on the prediction performance can be achieved by merging complementary data from different sources.Additionally,the information contained in the high-dimensional highthroughput data is redundant.The latent variable analysis methods such as PLS(Partial least square),OPLS(Orthogonal PLS)are sensitive to the irrelative variables.Therefore,the elimination of irrelevant variable is the first problem needed to be solved in the fusion analysis of high-dimensional data.Conclusions drawn in TCM(Traditional Chinese Medicine)experiments require strict statistical test and domain knowledge check.Test in the field of TCM can make it clear that weather the conclusions can be explained using existing knowledge or not.While the statistical tests can be used to determine the credibility of the results.The data of TCM experiment is of the property of small sample size and the number of variable is considerably larger than the sample size,which bring a huge challenge for the statistical test and field knowledge validation.In view of the challenges faced by the high dimensional data analysis of Chinese medicine,this thesis will focus on researches presented below:1.Cross validation methods on classification problem Evaluation on the generalization ability of a model is one of the most important steps in the process of modeling,since it measures the prediction ability of each model.The sample size of TCM experiment is generally too small to provide a separate and non-overlap test set to test the generalization of any model.Therefore,the generalization ability can only be estimated through the internal cross validation methods.Leave one out cross validation(LOOCV)is the one used usually in the quantitative analysis of small size datasets.However,lots of studies indicate that the model is particularly sensitive to the change of sample set when the sample size is small.Moreover,imblance in the calibration set caused by LOOCV is more obvious,if the sample size is small.To deal with this problem,Leave pairs out cross validation(LPOCV)is presented in this paper to perform cross validation on small sample size classification data.Simulation and experimental results demonstrated the effectiveness of the proposed method.Although LPOCV shows a reasonable estimation on the prediction performance of the small sample size model,the estimation is biased.Since the estimation based on small sample size data cannot measure the general well.Therefore,cross validation methods based on Bootstrapping and Monte Carlo resampling were proposed,and the objectiveness of their estimation were tested on imbalanced and small datasets.2.Variable selection mechanism The redundant information existing in Highdimensional high-throughput data disturb the prediction performance and the interpretation ability of the model.Generally,variable selection can improve the prediction performance and reduce the complexity of the model,but the mechanism of the variable selection remains unclear.Besides,studies on the mechanism of variable selection method will provide theoretical guidance on on the choice of variable selection methods and the establishment of new variable selection method.In this study,the effect of different kinds of dependences was investaged to explore the selection mechanism.To investigate the effect of variable dependence on variable selection methods,the original data was resampled and permutated randomly.Then,two kinds of test methods were developed based on permutation test.Additionally,the permutation test results also provide information on the necessity of the tentative biomarkers.Coupled with the internal cross validation method,a rigid statistical validation protocol was built.3.Robustness and visualization of variable selection methods Biological and medicinal researchers,especially the researchers in TCM,focus more on the prediction performance of a model,since it characterize the fitness of modeling method on a specific dataset.But the variables selected affect directly the subsequent experiments and the interpretation of the results.Thus,a steady variable set is more important than accurate prediction.Besides,the probability of obtaining a positive result in subsequent experiments can be improved when a robust variable set is produced.To investigate the robustness of variable selection methods,indicators employing string metric were first built and examined exhaustively.The results showed that the robustness reflected by Jonathan similarity performed similar to that of the frequency distribution of variable sets,thus it was used in the following experiments.Although CARS is an efficient and parsimonious variable selection method,its robustness needs to be improved,especially when the data has more variability unmodeled.Motived by the standard addition method in analytical chemistry,a method evaluating the efficiency of variable selection methods was established.In order to obtain intuitive understanding of variable selection,a visualization method charactering the contribution of each variable was also built.With the visualization method,the variable contribution in nonlinear learning model can be directly obtained.4.Temporal data fusion Genomics,proteomics and metabolomics data are redundant but incomplete.These omics techniques elaborate the physiological and pathologic status at different stages of a physiological-biochemical reaction.Bringing these data together will facilitate the comprehensive understanding of the process of physiological-biochemical reaction.But,since the physiological and pathological status are measured separately over time,it is naturally to analyse these omics data at different time points.Despite its convenience,the dynamic characteristics of physiological-biochemical reaction is explored insufficiently,because it break the relationship between the various parts.To solve this problem,a arrary of temporal data fusion methods were developed based on structural regularization and were evaluated using metabonomics data.Furthermore,to make the variables selected repeatable,robust fusion-selection methods were established by using fusion methods built previously under the guidance of stability selection theory.The results demonstrated that biomarkers whose temporal trajectory shows significant difference among groups can be identified effectively using the methods developed in this thesis.5.Spatial data fusion In traditional Chinese medical research,the properties of the objects have certain differences across spatial locations,which makes the analysis of the object deviated from the grand analysis.The key of techniques commonly used to overcome this problem is to improve the uniformity of objects.But no efficient measures can be taken in online analysis and in situ medical examination.For this reason,a spatial data fusion method named maxc-LS-SVM(max count-LS-SVM)was developed based on multi-instance learning theory to provide a accurate prediction.However,the maxc-LS-SVM method cannot offer information on the reliability of classification but only the prediction result itself.So the method termed MLE-LS-SVM(maximum likelihood estimation LS-SVM),which provides the probability indicating the reliability of the prediction result,was proposed.Results demonstrated that the MLE-LS-SVM algorithm can further improve the classification accuracy of heterogeneous objects which provide a new tactics for on-line analysis and in situ medical examination.In summary,an array of data fusion methods were proposed in this study to coalesce highdimensional data.Simulated data with clear structure,and many other datasets were adopted to investigate the efficiencies of the proposed methods.Robust variable selection method was developed based on the temporal fusion methods in the framework of robust selection.Additionally,cross-validation method designed for small sample size data set were developed for classification problems.The necessity of variable selection was proved through the study on mechanism of variable selection method.In addition,visualization on the contribution of variables in nonlinear learning method was realized based on the pseudo trace technique.In conclusion,the methods and results obtained in this study laid the foundation for highdimensional data fusion analysis and evaluation.
Keywords/Search Tags:Structural regularization, Balanced cross-validation, Spatial data fusion, Generalization ability, Screening efficiency, Temporal data fusion
PDF Full Text Request
Related items