Font Size: a A A

Research Of Application Strategy And Imputation Fusion Methodof Missing Datafor Gene Expression Profiling

Posted on:2018-07-19Degree:MasterType:Thesis
Country:ChinaCandidate:X J WuFull Text:PDF
GTID:2334330518967883Subject:Epidemiology and Health Statistics
Abstract/Summary:PDF Full Text Request
Background: There are a large number of missing data in gene expression profiling,which seriously affect the accuracy of subsequent analysis results;how to effectively estimate the missing data according to the characteristics of the existing data information,fill the construction strategy and fill method to evaluation of the impact of subsequent analysis on gene expression profiles,which is a very important research content of scientific significance.It is also a difficult point in the field of data analysis in statistics and Bioinformatics.The effective resolution of these problems makes the performance of analytical techniques more likely to be further improved by more accurate missing estimates and analysis strategies,allowing researchers to more effectively diagnose and make better useinformation of gene expression profiles.Method: In this paper,we use the theoretical research methods and literature research methods of statistics,computer science and biomedicine to explore and confirm the main content of the subject.Through the Support Vector Regression Nonparametric Multiple Imputation and Miss Forest-non-parametric to estimate and impute in the missing data of gene expression of different sequence types with different deletions under the 6 different deletion mechanisms,and the results are compared with the K-nearest Neighbormethod,Bayesian principal component analysis methods and multiple imputation methods.Combined with the performance of different filling methods,on the basis of the principle of a certain filling strategy,this paper constructs the imputation strategy of different sequence data sets,different missing mechanism and different missing proportion and the biological effects of imputation methods on the subsequent analysis of gene expression profiles were also elucidated.Result:(1)For the different characteristics of the expression of the missing data sets five methods were used to impute,and through comparative analysis,we found that the normalized root mean square error showed a rising trend with the increase of the missing proportion: when the Breast cancer data set in the random time series missing proportion is 20%,NRMSE of the BPCA,KNN,MissForest-non-parametric method,Monte Carlo multiple imputation method and SVR-NPMI were 0.1810,0.3874,0.0780,0.0917 and 0.0744;When the missing ratio of non time series liver cancer dataset is 30%,values of NRMSE of the five methods were 0.2877,0.3335,0.2018,0.2550 and 0.1621;When the missing ratio of Non time series of lymphatic cancer data set is 10%,values of NRMSE of the five methods were 0.8762,0.8753,0.0.8972,0.8811 and 0.9797.Overall,the performance of SVR-NPMI is more stable,imputation the best,followed by Miss Forest--non-parametric method,MI and KNN has the worst effect.(2)The Conserved Pairs Proportion tends to decrease with the increase of the proportion of missing data sets,that is to say,the greater the proportion of missing,the worse the effect of subsequent cluster analysis;If use an inappropriate filling methods which can play a misleading role in the subsequent studies of the expression profiles.SVR-NPMI performance is more robust in different imputation methods,and the use of SVR-NPMI to impute the data set is superior than the other four methods.(3)Through the example analysis,summarizes the different gene expression profiles of missing data imputation strategy,the SVR-NPMI has a good imputation effect under all kinds of influence factors,but at the same time,this method has high computational complexity and high cost;Miss Forest-non-parametric method depends on the lack of forest data sets in characteristics,and can achieve a better filling results under the situation of few genes in the expression profile data and more experimental conditions.The effect of the BPCA and KNN is good or bad related to the choice of important parameters.Conclusion:In this study,SVR-NPMI fusion method and non-parametric deletion forest imputation method developed and enriched the filling model of gene loss profile data,which promoted the development of new methods in the field of bioinformatics analysis.The analysis of the method of providing the method of reference and reference for the method has the important academic theoretical value.The first time to construct the algorithm for the lack of data of the gene expression spectrum and the development of the "gene expression profilingmissing data imputationanalysis system" can help the researchers to Better and faster to determine the data set for its fill method,more convenient and quick for data analysis,to provide reference and services.
Keywords/Search Tags:Gene expression profiling, Missing data, Nonparametric Multiple Imputation, Strategies
PDF Full Text Request
Related items