Font Size: a A A

Research On Caner Related Gene Sets Identification And Progression Inference Methods

Posted on:2020-04-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:W ZhangFull Text:PDF
GTID:1364330623451694Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
One of the goals of cancer genomics research is to identify all cancer-related genes and explain their contribution to cancer initiation and progression.The rapid development of high-throughput sequencing technology has produced a large amount of cancer genome data,which facilitates the research of cancer genomics.Research in this area faces two challenging questions:(1)Which related genes drive the cancer progression?(2)how to analyze the cancer progression at the gene level and pathway level? Solving these two problems is crucial for treatment decisions involving targeted drugs.In this thesis,real cancer genome data are taken as the research object,and the main works are about the methods research of identifying cancer-related gene sets and inferring the cancer progression.Although many methods have been proposed to identify the cancer-related gene sets,how to separate the driver gene mutation from the passenger gene mutation,detect rare mutations,identify the drive pathway and find the key gene quickly and effectively remains a challenging problem in cancer genomics.In the face of a large number of tumor somatic mutation data,the pathogenesis of tumors can be analyzed in depth.However,most of these data are cross-sectional,rather than temporal,and it is difficult to infer the time of gene mutations in cancer progression and,due to the heterogeneity between patients,infer that cancer progression is more robust at the driver pathway level than at the individual driver level.In view of the problems and limitations of existing algorithms,the main work of this thesis can be listed as follows:(1)Existing methods require information about gene or protein interactions to build genetic networks.However,due to the current incompleteness of the human interaction group,the conclusions of these methods may be biased.To solve this problem,an effective strategy iKGGE based on graph entropy is proposed to identify key genes related to cancer by combining gene expression data and gene mutation data(An Efficient Strategy for Identifying Cancer-related Key Genes based on Graph Entropy,iKGGE).First,a genetic network is constructed based on sparse inverse covariance matrix,which only uses gene expression data.Then,the parallel maximum cluster algorithm is used to cluster the genes,and a series of subgraphs are obtained rapidly.Finally,a new indicator is introduced to measure the influencing factors of genes by combining the effects of graph entropy and upstream gene mutation.Tests on three existing cancer datasets showed that the strategy could effectively extract key genes that may play different roles in the development of tumors,and could well predict cancer patients' risk groups based on key genes..(2)Identification of cancer driver genes is crucial for personalized therapy.In order to improve the accuracy of identifying cancer driver mutations,a method(A Novel Method for Identifying the Potential Cancer Driver Genes based on Molecular Data Integration,iPDG)integrating multiple histological data to identify driver genes is proposed.DNA copy number variation data,somatic mutation and gene expression data of matched cancer samples are integrated.In combination with the method of the previous chapter,the "key genes" of cancer are identified,and the changes in their expression levels and the effects of mutated genes are taken into account to evaluate whether the mutated genes are potential drivers.For a mutated gene,the concept of mutation effect is defined,which takes into account the effects of copy number variation,mutation sequence itself,and its neighbor genes.The method mainly includes two steps: the first step is data preprocessing.Firstly,DNA copy number variation and somatic cell mutation data are integrated.Then,the integrated data are mapped to a given interactive network based on samples.These diffusion values form the mutation impact matrix.The second step is to obtain the key genes by using the iKGGE method in the previous chapter,and construct the connection matrix by using the gene expression data and mutation impact matrix of the key genes.Experiments on TCGA breast cancer and GBM have shown that iPDG can not only effectively identify known cancer driver genes,but also find rare potential driver genes.Functi onal enrichment analysis shows that these genes are significantly associated with both cancers.(3)In order to simultaneously infer cancer progression from gene level and pathway level,a probabilistic graph model(Inference of Cancer Progression with Probabilistic Graphical Model from Cross-sectional Mutation Data,PGM)is proposed to infer the time constraint and selectivity relationship of cancer driver gene mutations represented by directed acyclic graph.Then,based on the mutation probability of these driver genes,the waiting time between a mutation and subsequent mutations is modeled as a random function of mutation probability under the given mutation of the previous gene,so as to obtain the driver genes with mutations in the same time period.Finally,the performance of PGM on simulated data and real cancer somatic mutation data is evaluated.Experimental results and comparative analysis show that PGM can capture the selection relationship of most driver gene mutations,most of which have been confirmed by previous studies.Furthermore,the PGM can provide new insights on simultaneously inferring driver pathways and the temporal order of their mutations from cross-sectional data.(4)A complete framework is introduced to identify mutation driver pathways and infer cancer progression at the pathway level from somatic mutation data(An Integrated Framework for Identifying Mutated Driver Pathway and Cancer Progression,iMDPCP).First,we use uncertainty coefficient to quantify mutual exclusivity on gene driver pathways and develop a computational framework to identify mutated driver pathways based on the adaptive discrete differential evolution algorithm.Then,we construct cancer progression model for driver pathways based on the Bayesian Network.Finally,we evaluate the performance of iMDPCP on real cancer somatic mutation datasets.The experimental results indicate that iMDPCP is more accurate than state-of-the-art methods according to the enrichment of KEGG pathways,and it also provides new insights on identifying cancer progression at the pathway level.
Keywords/Search Tags:Cancer-related gene sets, Driver genes, Driver pathways, Cancer progression, Bayesian network
PDF Full Text Request
Related items