| Cell deconvolution is to infer the cell composition of tissues,which is to predict the types and proportions of cells present in tissues.With transcriptome sequencing(RNA-Seq)develops,the research on cell deconvolution using RNA-Seq data has been very extensive.The way can explore the pathogenesis of disease,analyze the tumor microenvironment and study the development of tissue.Therefore,cell deconvolution is essential.RNA-Seq includes single cell RNA sequencing(Sc RNA-Seq)and bulk RNA sequencing(Bulk RNA-Seq).Sc RNA-Seq can transcribe and analyze a single cell,which is not convenient for sequencing large-scale samples to obtain the cell composition of tissue.Bulk RNA-Seq obtains gene expression data for the entire tissue.Combining bulk RNA-Seq data and Sc RNA-Seq data for cell deconvolution has become a vital trend.Current cell deconvolution algorithms include experimental methods based on type quantification and computational methods based on sequencing data.The algorithms face difficulties,such as lacking bulk RNA-Seq data with known cell ratios,high sparsity of RNA-seq data with bias,poor specificity of gene expression reference matrices for specific cell types,and poor predicting of unknown and similar cells.With machine learning and deep learning emerges,it provides new solutions for cell deconvolution algorithms.Therefore,the thesis will construct convolutional neural network models,convolutional autoencoder models,and light gradient boosting machine models.Using the three models to optimize and improve computational methods based on sequencing data.1.This thesis proposes and designs a algorithm using a convolutional neural network to automatically predict tissue cell ratios named Aptcr(Cell Deconvolution Algorithm using Automatically Predict Tissue Cell Ratios).For lacking bulk RNA-Seq data for training,using bulk RNA-Seq data simulated by Sc RNA-Seq data firstly,Aptcr solves the problem of the unavailability of deep learning methods effectively.At the same time,high-sparsity data with bias also has the limitation of difficult feature extraction.The convolutional neural network does not require complex feature screening and dimensionality reduction processes.Its nodes can effectively mine the internal connections between genes,and the model can train noise-resistant features.Finally,without obtaining the gene expression matrix of a specific cell type,Aptcr can directly infer the proportion of histiocytes from bulk RNA-Seq data,so it has high universality for different data.Compared with existing advanced algorithms,Aptcr gets a pearson correlation coefficient of 0.903,which is the highest among all methods.Tested on real samples,its root mean square error increased by 6.0% compared to the second-best CPM algorithm,indicating it has strong model extraction ability and high prediction accuracy.2.This thesis proposes and designs a cell deconvolution algorithm using a convolutional autoencoder named Cdaca(Cell Deconvolution Algorithm using Convolutional Autoencoder).For insufficient feature information extraction in traditional methods,Cdaca can reduce dimensionality and extract features from data efficiently,thereby improving the ability of cell deconvolution.Cdaca also uses histiocyte proportion data and gene expression data to optimize the model.It links the gene expression data of the tissue with the cell components,strengthening the interpretability of the model and the ability to extract features.Finally,Cdaca also can distinguish similar cell subtypes.Comparing Cdaca with other algorithms,it obtained the best Lin’s correlation coefficient on the microarray set of 0.72,increasing the value by 10% compared with the DWLS method,and it also got the best average absolute error value of 0.10.Cdaca improved the accuracy of cell proportion prediction.3.This thesis proposes and designs a cell deconvolution algorithm using point by point separation strategy and light gradient boosting machine named Dlgbm(Cell Deconvolution Algorithm using Point by Point Separation Strategy and Light Gradient Boosting Machine).Seeing that most cell deconvolution algorithms simultaneously predict all cell types,which ignore the independent relationship between different cell types and gene expression data,and without considering the heterogeneity of each cell type.Dlgbm uses a point-by-point separation strategy,it infers the proportion of cell types one by one from gene expression information,and makes targeted predictions for the proportion of each cell type.Dlgbm uses a light gradient boosting machine to iteratively fit the differences between actual and predicted values,achieving efficient,highly targeted,and high-precision prediction of cell proportion.Evaluating on the Microarray data set,Dlgbm obtained the highest Lin’s correlation coefficient of 0.78. |