Font Size: a A A

Study Of Mining Algorithms For Single Cell RNA-Sequencing Data

Posted on:2021-01-30Degree:MasterType:Thesis
Country:ChinaCandidate:H R HeFull Text:PDF
GTID:2370330602475019Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The single cell RNA-sequencing(scRNA-seq)technology based on high-throughput sequencing developed in recent years can carry out gene expression sequencing at the granularity of the single cell,so as to obtain the expression information of thousands of genes in a single cell,which provides support for identifying the gene expression characteristics of different types of cells and fully revealing the heterogeneity between cells.However,due to the limitation of sequencing technology and the high complexity of gene expression,the single cell RNA-sequencing data has the characteristics of large noise,high dimension and strong sparsity,which leads to the low clustering accuracy of traditional clustering technology for different cell populations.In order to improve the clustering accuracy of different cell populations,how to improve the accuracy of cell population recognition based on the scRNA-seq data was studied in this paper.By analyzing the problems of data preprocessing,dimensionality reduction and clustering in the traditional data processing method of scRNA-seq data,a method of dimension-reduction using Auto Encoder was proposed.Based on the characteristics of Stacked Denoising Auto Encoder(SDAE),which can reduce data loss to the greatest extent and has good processing ability for noise data,two dimensionality reduction clustering methods,SDAE-DBSCAN and SDAE-K-means,were proposed.The experimental results show that the dimension-reduction clustering method proposed in this paper reduces the dependence of the original algorithm on parameters and improves the clustering accuracy of cell population.The main research contents are as follows:(1)In the stage of data preprocessing,the loss rate of effective data was reduced by reducing the proportion of data screening,and L2 regularization was proposed to preprocess the data.This not only reduces the problem of large differences in the expression of different genes,but also minimizes the "strong" features and allows smaller but more characteristic features to emerge.(2)Aiming at the problem that the contribution rate of traditional PCA dimensionality reduction method is not concentrated in the processing of scRNA-seq data,it was proposed to use SDAE to reduce dimensionality and noise of the scRNA-seq data.The noise was added to the original data by means of random zero,and the generalization ability of the model was improved by learning the characteristics of the damaged data.This method can be used to automatically identify the noise points in the data through feature learning on the scRNA-seq data,and features with stronger robustness can be learned,so as to provide better data features for cell clustering and thus improve the ability to identify the cell population.(3)To solve the problem that the traditional clustering algorithm needs to set the clustering quantity and low clustering accuracy,DBSCAN algorithm was proposed to cluster the scRNA-seq data.Since the shape and structure of gene expression data in multi-dimensional space are not easy to analyze,the K-means algorithm is not guaranteed to be applicable.Moreover,the gene expression reflects the cell function,and the functional expression of the same kind of cells should be continuous in the similar spatial structure.Therefore,DBSCAN algorithm was used for cluster analysis.However,the values of Eps and MinPts have a great influence on DBSCAN clustering.In order to improve the clustering accuracy of DBSCAN,an improved adaptive clustering algorithm for calculating parameter values was proposed.For the traditional K-means clustering algorithm,it was found that using SDAE to reduce the dimension of scRNA-seq data could improve the clustering accuracy of K-means algorithm to a certain extent.In this paper,the deng data set was used for experiment.The experimental results show that the clustering accuracy of the two deep combination models proposed in this paper,SDAE-DNSCAN and SDAE-K-means,can reach 0.97 and 0.93 respectively,which are 0.2 and 0.16 higher than the traditional SC3 model.
Keywords/Search Tags:single cell RNA-sequencing, single cell clustering, stacked denoising auto encoder, DBSCAN, K-means
PDF Full Text Request
Related items