Font Size: a A A

Missing Value Imputation Study For Typical High-throughput Omics Data

Posted on:2022-11-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:J X TangFull Text:PDF
GTID:1480306764461834Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
With the development and widespread application of high-throughput sequencing technologies,a large number of different types of high-throughput omics data are accumulating.Among them,DNA methylation and gene expression data,as two of the most common typical high-throughput omics data,have generally received wider attention from researchers.These data have helped to reveal the regulatory relationship between epigenetic modifications and gene transcription,which are important for the study of many complex human diseases.However,due to technical limitations and economic costs,high-throughput DNA methylation and gene expression data measured by experimental assays often suffer from significant missing data problems.For example,the 450 K methylation microarray based on Bulk assays covers less than 2% of the total number of Cp G loci in the whole genome;DNA methylation and gene expression data based on single-cell assays may have up to 90% missing data rate in some samples.Such data missing issues not only diminishes the value of available high-throughput omics data,but also limits the potential of experimental data for downstream analysis applications.Meanwhile,existing imputation methods have problems such as inadequate extraction of predictive features,insufficient utilization of relevant data,and inadequate consideration of reference label information for cell types.Therefore,this dissertation focuses on the specific missing value problems of methylation microarray,single-cell methylation,and single-cell gene expression data,and the limitations of related imputation methods,and further investigates missing value imputation models that are tightly coupled with data objects and applications to help mining and enhance the value of existing high-throughput data.The main research works of this dissertation are as follows:(1)To address the problems of the small Cp G coverage of the 450 K methylation microarray analysis platform,and the inadequate feature extraction of existing methylation array imputation models,we proposed the Preti Meth imputation method to construct the private precise prediction model for each Cp G locus of interest.The Preti Meth method performs co-methylation correlation analysis among Cp G loci in a large number of samples from different tissues and disease types,matches each locus to be predicted with another locus that has a highly similar methylation pattern across samples as its robust methylation marker,and builds the locus-specific private precise prediction model for each locus to be predicted based only on this single signature factor.The results of the model performance validation and the case studies applied to the expansion of TCGA cancer 450 K data showed that Preti Meth can effectively learn the locus-specific methylation patterns of individual Cp G loci,significantly outperformed other chip imputation methods on prediction accuracy,and could help enhance the value of existing methylation chip data.(2)To address the problems of the extremely low Cp G coverage of single-cell methylation sequencing data,and the difficulty of existing single-cell methylation imputation models to effectively utilize methylation information from other cells and Bulk sequencing data,we proposed the Ca Melia imputation method to predict the missing methylation status based on the local pairwise similarity of methylation patterns between cells.Ca Melia constructs a conductive methylation signature based on the local similarity of methylation patterns between data units,which can adaptively share methylation information between individual cells or between cells and Bulk data.By borrowing information from other data units,Ca Melia can integrate individual cells' own and external information to enhance the imputation effect.The results of the cross-validation and the case studies applied to downstream analysis showed that Ca Melia achieved more advanced imputation performance than previous methods and could help to enhance the results of downstream differential methylation locus identification and clustering analysis.(3)To address the problems of the excessive zero counts in single-cell RNA sequencing data,and the insufficient imputation performance by existing single-cell RNA-seq imputation models under the condition that the cell type label information is known,we proposed the sc IDG imputation method based on the supervised deep generative model to accurately recover heterogeneous gene expression under different cell types.sc IDG robustly learns the intrinsic feature representations and data distribution models of single-cell gene expression data by using deep autoencoder and generative adversarial training strategies and further learns biologically meaningful feature representations under different cell types using real cell label information constrained generators based on auxiliary classifiers,and finally accurately controls the recovery of gene expression dynamics under different cell types.The results of the case studies based on simulated and real data showed that sc IDG can effectively use cell labeling information to enhance imputation results and accurately recover the heterogeneity of different cell types,improving the results of downstream differential expression analysis and temporal trajectory inference analysis.
Keywords/Search Tags:DNA Methylation, Gene Expression, Computational Prediction, Missing Value Imputation, Bioinformatics
PDF Full Text Request
Related items