Font Size: a A A

Research On Principal Component Analysis Method And Its Application In Cancer Omics Data

Posted on:2021-01-15Degree:MasterType:Thesis
Country:ChinaCandidate:M J WuFull Text:PDF
GTID:2430330605463060Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In modern molecular biology,the sequencing data containing a wealth of information on biological activities is increasingly used in the identification and diagnosis of diseases.However,the “dimensional disaster” problem caused by such data makes traditional data processing methods unable to effectively mine biological information.And the development of cancer is usually marked only by a small number of genes with altered expression levels.Therefore,how to select a small number of key genes related to cancer from high-dimensional and redundant biological sequencing data is an important challenge now.The Principal Component Analysis method proposed by researchers has been widely concerned as the main method of data processing.This method reduces the complexity of the data by projecting the data onto the Principal Components of the lower dimensions,and the information in the data is retained to the maximum extent,which makes the “dimensional disaster” problem well resolved.This paper improves the existing PCA methods and applies them to sequencing data in The Cancer Genome Atlas(TCGA).The mining and analysis of cancer data helps us to further understand the potential connection between genes and complex diseases,which lays a solid foundation for disease prevention and gene targeted therapy.The research content of this paper is mainly divided into the following four parts:(1)Based on the Capped L1 norm,a graph-Laplacian PCA method(Cg LPCA)is proposed: this method introduces the Capped L1 norm and graph-Laplacian regularization into the PCA objective function.The Capped L1 norm reduces the effects of noise and outliers by setting a maximum upper limit on the value.As a non-linear manifold learning structure,graph-Laplacian regularization is used to capture low-dimensional structural information in high-dimensional space,which makes the mined information more accurate and comprehensive.(2)Based on the double sparse constraints,a graph-Laplacian PCA method(GDSPCA)is proposed: this method introduces double sparse constraints(L1,L2,1)and graph-Laplacian regularization into the PCA method.The effect of double sparse constraints can generate row sparseness within the data and determine the actual contribution of each variable in the original space,which improves the interpretability of the Principal Components in the low-dimensional space.In addition,the introduction of graph-Laplacian regularization can ensure that the geometric structures hidden in the data are fully captured,which further improves the accuracy of the algorithm.Related experiments on the multi-view cancer dataset show that this method can explore potential connections between different cancers and genes.(3)Based on hypergraph regularization,a robust PCA method(HRPCA)is proposed: this method improves the robustness to data outliers by applying the L2,1 norm on the PCA loss term.At the same time,the advantages of hypergraph regularization are used to mine complex and changeable high-order relationships among data and ensure that the data module structure is not damaged,which improves the accuracy of the algorithm's mining information.By this method,the accuracy of sample clustering and identifying common feature genes is improved to a certain extent.(4)Based on hypergraph regularization,an integrated PCA method(IHPCA)is proposed: by making full advantage of PCA's good data processing capabilities and the advantages of high-order mapping of hypergraph regularization,this method integrates multi-omics data representing different aspects of cancer into a unified model framework to discover relevant cancer prediction phenotypic results.The relevant experimental results on cancer multi-omics data show that this method is helpful to find differentially expressed genes of different cancer types,which further promote gene targeted therapy for cancer.In order to verify the effectiveness of the four proposed PCA methods in identifying differentially expressed genes,they were applied to various cancer gene expression data and multi-omics data,and compared with other advanced methods.The results of various experiments show that our methods have advantages over similar methods and can find differentially expressed genes closely related to disease.
Keywords/Search Tags:Principal Component Analysis, Manifold learning, Feature selection, Sample clustering, Low-dimensional embedding, Gene expression network
PDF Full Text Request
Related items