Font Size: a A A

Research On Missing Value Imputation For Microarray Gene Expression Data

Posted on:2012-02-05Degree:MasterType:Thesis
Country:ChinaCandidate:Y F LiFull Text:PDF
GTID:2210330362950587Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
DNA microarray technology is a kind of biological chip technologies. It is mature and widly used. It can measure the mRNA levels of thousands of genes under certain experiments simultaneously. However, microarray gene expression data generally suffers from missing value problem due to a variety of experimental reasons. Actually, public microarray datasets have missing values in various degree, which can adversely affect downstream analysis. Simply deleting the genes with missing values can result in a loss of much usefull information, and to repeate the experiments is obviously very expensive and time consuming.Missing value imputation is researched from several aspects as follows.Firstly, Bayesian principal component analysis imputation method is studied based on the global correlation information in the data set. It consists of three elementary processes, which are principal component regression, Bayesian estimation and a repetitive algorithm. Missing values and model parameters update each other continuously and ultimately achieve convergence to estimate missing values.Secondly, according to the principle of gene co-expression, this dissertation uses local similarity structure in the data set to study K-nearest neighbors imputation method and local least squares imputation method. Both of them have a similar problem that the estimation accuracy declines in the case of high missing rate. Improved methods are proposed, which expand the range of alternative genes by pre-filling eligible genes and estimating missing values according to the missing rate of genes. Experiments show that the proposed algorithms can improve the effectiveness significantly.Additionally, the common theme for algorithms in this category is the integration of domain knowledge or external information into the imputation process. For example, histone acetylation may alter chromatin structure and provide binding surfaces for transcription factors. Histone acetylation and gene expression datasets are combined to select neighbor genes to estimate missing values.Finally, validation of imputation results is an important step in assessing the performance of imputation algorithms. This thesis focuses on internal validation containing indices derived from statistical calculation and clustering methods. What's more, these indices are also applied to datasets containing genes differentially expressed. In a word, these work is to research the accuracy and application range of the various missing value imputation algorithms.
Keywords/Search Tags:missing value imputation, global structure, similarity structure, domain knowledge
PDF Full Text Request
Related items