Font Size: a A A

Research On Dropout Value Imputation And Classification Based On Biological Data

Posted on:2022-11-20Degree:MasterType:Thesis
Country:ChinaCandidate:W J CaoFull Text:PDF
GTID:2480306779988979Subject:Computer Software and Application of Computer
Abstract/Summary:PDF Full Text Request
High-throughput single-cell transcriptome sequencing(sc RNA-seq)can characterize the transcription map with single-cell resolution and better reveal the diversity of unknown organisms.However,due to the small number of RNA transcripts,the randomness of gene expression patterns,the low cell capture and sequencing efficiency,there is a lot of technical and biological noise,as a result,the sequence data of single-celled transcriptome are characterized by high dimension,sparsity,large number of missing values and batch effect.It is the large number of dropout values in single-cell sequencing data that seriously cover up the important relationship between genes and hinder downstream analysis.Therefore,it is still a great challenge to accurately and rapidly classify cell types and large cell numbers for the imputation of missing values in single-cell transcriptome sequencing data.Based on biological data,this paper studies its missing value imputation and classification,mainly including the following aspects:1.Firstly,the methods of single cell data dropout imputation are summarized and studied.Firstly,the methods of single cell data dropout imputation at home and abroad are classified,analyzed and compared from different angles,through the comparison and analysis of algorithm principle and advantages and disadvantages,this paper provides suggestions on the selection of interpolation methods for specific problems and data,and has basic research significance for downstream function analysis of data.Then,a depth neural network model(AMEDNN)based on multiple encoder and decoder and attention networks is proposed to solve the problem of sparse data and large missing values,through the imputation experiment on six datasets,the validity of the model is verified.2.Based on the research of biological sequence data classification algorithm.A new algorithm ss-RNN is proposed to solve the problems of insufficient memory capacity of RNN and the difficulty of gradient back propagation,the algorithm can directly predict the current moment information by using multiple historical moments,which can enhance the ability of long-term memory and improve the correlation of different states in time dimension.In order to include historical information,we design two different processing methods for SS-RNN in continuous and discontinuous way.For each method,there are two ways to add history information: 1)add it directly and 2)map weighted and activated functions.This algorithm provides six ways to comprehensively and deeply explore the influence of historical information on RNN.It was tested on five disease-related datasets with different sizes and data types.Compared with the original LSTM,GRU and Bi LSTM,and the recent RNN + GRU,RNN + LSTM and MCNN,the results show that our method can significantly improve the classification accuracy of sequence data.In addition,the best way to add past information may be to add it directly in a discontinuous fashion.It can solve the problem of gradient explosion and gradient disappearance effectively.There is a certain correlation between the model performance and the order.3.Research on the classification algorithm of single-cell transcriptome sequencing data based on ensemble learning.We propose six classification models based on ensemble learning,EL-KNN,EL-LDA,EL-SVM,EL-NB,EL-DT and EL-HW,combining data-based random sampling with majority voting and weighted voting integrated machine learning probabilistic prediction method,experiments were carried out on seven typical data sets generated by different sequencing platforms,the practicability and feasibility of the method are verified.4.sc GATv2,a single cell transcriptome sequence data deletion interpolation and clustering algorithm based on graph attention network.sc GATv2 explores the potential relationships between cells by the iteration of three different automatic encoders.Our main innovation is to incorporate a graph-based attentional variational auto-coder,which not only maintains the original topology,but also reduces the dimensionality of single-cell sequencing data,the attention mechanism can also be used to automatically learn and optimize the connections between cells so that the learned data can be embedded in a low dimension and have a higher signal-to-noise ratio,through the experiments of interpolation and clustering on four data sets and a large number of existing methods,we find that the model has better performance and improves the accuracy of clustering,it has important basic research significance for downstream function analysis.Through the above four studies,the missing interpolation and classification of biotype data are completed.The interpolation can restore the transcriptome dynamics masked by the missing data.Filling in the dropout values may enhance the clustering effect of subcohorte cells,improve the accuracy of differential expression analysis,and contribute to the study of gene expression dynamics.The classification and clustering of single cells and the discovery of new cell types are of great significance for the study of tumor,immunology and developmental biology.
Keywords/Search Tags:scRNA-seq, Recurrent neural network, Graph attention network, Dropout, Imputation, Classification, Cluster
PDF Full Text Request
Related items