Font Size: a A A

Research Of Classification Problem For High-Dimensional Single-cell RNA-seq Data

Posted on:2024-07-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:J D ZhuFull Text:PDF
GTID:1520307340974299Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
Single-cell RNA sequencing(scRNA-seq)reveals the gene structure and expression state of individual cells,reflecting cellular heterogeneity and playing a crucial role in developmental biology and other fields.This study focuses on the classification problem of high-dimensional scRNA-seq data,with particular emphasis on the imputation of missing data,detection of differentially expressed genes,and classification of sequencing data.Considering the large data volume,high dimensionality,non-negative integer values,and abundance of zero values in actual scRNA-seq data,we address these three issues by constructing data-driven models and accomplishing the following three tasks:(1)To address the problem of missing data in scRNA-seq data,we propose the NMFTL method based on Non-negative Matrix Factorization and Transfer Learning.To tackle the issue of numerous zero values in scRNA-seq data,we consider filling in the missing data first.Existing imputation methods mostly rely on the available data within the dataset itself to infer relationships between genes and cells,and then utilize these relationships to estimate the missing values.In this paper,we propose a scRNA-seq data imputation method based on Non-negative Matrix Factorization and Transfer Learning.This method imposes constraints on the decomposed feature matrix,taking into account the relationships between genes and cells in the count matrix,and further adjusts the values to be estimated based on information obtained from other related datasets using the concept of transfer learning.By doing so,compared to existing low-rank matrix decomposition-based imputation methods,we achieve further improvements in maintaining the internal geometric structure of the dataset and accurately estimating missing values.(2)Regarding the issue of detecting differentially expressed genes in scRNA-seq data,we propose the scMEB method based on the single-cell Minimum Enclosing Ball.Considering that the dataset,even after missing data imputation,remains high-dimensional and in reality,a large portion of genes contributes less to cell classification,we introduce the scMEB method.The selected differentially expressed genes are then used as our feature genes,thereby transforming the original high-dimensional data classification problem into a low-dimensional one.This method utilizes a subset of non-differentially expressed genes(i.e.,stably expressed genes)to build a model and discriminates differentially expressed genes based on the distance from the mapped genes in the feature space to the center of the hypersphere.This method enables the selection of differentially expressed genes in cases with a large number of cells in sequencing data and unknown specific cell labels.Compared to existing methods,the proposed approach outperforms in cell clustering,predicting biologically functional genes,identifying marker genes,and,most importantly,it significantly improves the calculation speed from a few minutes to tens of seconds,which is more efficient for processing high-throughput data.(3)To address the classification problem of scRNA-seq data,we propose the ZINBLDA method based on Zero-Inflated Negative Binomial Logistic Discriminant Analysis.The expression matrix composed of differentially expressed genes selected forms the final data for classification.Considering the characteristics of scRNA-seq data,such as numerous zero values and excessive discreteness,we assume that the sequencing data follows a zero-inflated negative binomial distribution,and thereby propose the Zero-Inflated Negative Binomial Logistic Discriminant Analysis method.To address the issue of choosing an appropriate classifier when the distribution of actual sequencing data is unknown,we analyze simulated datasets generated with different parameters and establish a decision tree as well as a random forest model to guide the selection of the optimal classifier based on the features of the actual data.To validate the effectiveness of the three proposed algorithms,we use different evaluation metrics to compare the performance of the proposed methods with existing methods on both simulated and actual datasets.The analysis results demonstrate that the proposed methods outperform existing methods in various performance measures.
Keywords/Search Tags:scRNA-seq, Differential expression, Imputation, Classification, Transfer learning, Minimum enclosing ball, Zero-inflated negative binomial distribution
PDF Full Text Request
Related items