| With the rapid development of deep sequencing technology,a large number of high-throughput omics data can be used to study the mechanism of carcinogenesis at the molecular level.Studies have shown that the genesis and progression of cancer is regulated by modules/pathways rather than single genes.The researchers found that the active modules were associated with the development and progression of cancer.Active modules are a group of genes involved in a biological signaling pathway or whose generated proteins have important interrelationships,associates with the development and progression of tumors.Centered on the research of this problem,the main work of this paper is as follows:Based on gene expression data and protein-protein interaction network data,gene prioritization methods were studied.Based on the information of neighbor nodes between genes,a regression model based on p-step random walk kernel was established to calculate gene active score.Based on the gene active score and its degree in the protein interaction network,two gene prioritization methods GEPR1 and GEPR2 were proposed.GEPR1 method used the weighted relative difference of active score and degree to determine gene prioritization.Furthermore,in order to avoid the influence of human factors in weight setting,a gene prioritization method GEPR2 was proposed based on Pareto Optimality Consensus(POC)strategy.Gene prioritization methods was obtained by using Pareto Optimality in POC method to determine the suppression relationship between two genes.GEPR1,Sig Mod,LEAN and Reg Mod methods were compared using breast cancer data and cervical cancer data.Compared with Sig Mod,LEAN and Reg Mod,GEPR1 identified more cancer-related genes labeled by OMIM database and CCDB database in the top 100 to 800 gene prioritized lists.Compared with the GEPR1 method,GEPR2 method performed better in the list of top 100 to 800 genes in gene prioritization.Active model recognition model and greedy search Algorithm NSEA(Node Set Expansion Algorithm)are proposed based on the gene prioritization calculated by the above gene prioritization method.The algorithm introduced gene proximity between to quantify the protein-protein interactions in a network of related degree,and on the basis of the introduction of gene-module’s proximity to quantify gene related to the module level,using the greedy strategy by proximity and active score to expanding module,in order to get a active modules with high active score and strong connectivity.Experiments were conducted on breast cancer and cervical cancer data,and compared with Sig Mod LEAN and Reg Mod methods,the experimental results show that the modules identified by NSEA method have higher Fold Enrichment and the standardized connection strength.Many genes in the modules identified by NSEA were enriched in cancer-related signaling pathways,and most of the identified genes were oncogenes or tumor suppressor genes that had been confirmed by previous literatures.In addition,NSEA method did detect many cancer-related genes that were missing by the Sig Mod,LEAN,and Reg Mod methods.In the simulation data set,the precision,recall and F1-score of NSEA,Sig Mod,LEAN,and Reg Mod methods are compared and analyzed.Simulation data results further confirm that NSEA method can identify candidate strong connectivity modules in a larger network,and perform well in precision,recall and F1-score.To sum up,this paper studies the identification of active modules,and proposes two gene prioritization methods GEPR1,GEPR2 and active module recognition algorithm NSEA.The experimental results suggest that these methods may be useful complementary tools for identifying active modules associated with cancer and play an auxiliary role in understanding the pathogenesis of cancer. |