Cancer is a major threat to human health today.In the past few decades,many investigatorshave devoted themselves to studying the main pathogenic factors of cancer.For different types of cancer,some related genes have been discovered.However,there still exist hidden genes,which are waiting for us to discover.The unique topological structure of the protein interaction network has become an important material for the study of cancer-related genes.With this special structure,some network embedding methods can be used to obtain the feature vectors of gene nodes from the network,and then some machine learning algorithms can be used to learn these feature vectors and build models.These models can be used to identify potential cancer-related genes.Based on above ideas,this article proposes a prediction model for two types of important cancer-related genes(oncogenesand tumor suppressor genes).The main contents are as follows.Extensive research on tumor suppressor genes helps to understand the pathogenesis of cancer and design effective treatments.However,the use of traditional experiments to identify tumor suppressor genes is costly and time-consuming,so it is necessary to design effective calculation methods to screen out potential tumor suppressor genes.So far,some calculation methods have been proposed to predict new tumor suppressor genes.However,mostmethods do not include a learning process to extract the basic attributes of validatedtumor suppressor genes,thereby reducing their efficiency.In this study,a novel computational method was proposed to identify potential tumor suppressor genes.To this end,we downloaded validated tumor suppressor genes from the TSGene database(version 1.0).These tumor suppressor genes,together with other genes,are represented by features extracted from protein interaction networkvia the powerful network embedding method,Mashup.Then,severalrandom forest modelswereconstructed and used to predict the potential tumor suppressor genes.According to validatedtumor suppressor genes in the TSGene database(version 2.0),our method has better performance than somepreviously proposed methods.Oncogene is a special gene that can promote the occurrence of tumors.The study of oncogenes helps to understand the causes of cancer.Early biological experiment techniques are very popular in detecting cancer-causing genes.However,in recent years,the shortcomings of this method have become more and more obvious,such as high cost and time-consuming.Considering the limitations of some previous calculation methods,this research proposesa novel calculation method for identifying oncogenes.It constructs a protein interaction network and adoptsthe network embedding method Mashup to extract features from such network.The classic machine learning algorithm,random forest,is applied to these features forcapturing the essential information of oncogenes,therebybuilding the prediction model.According to the measurement results producedby the prediction model,all genes are ranked.Using classic evaluation indicators to evaluate the model,the method in this article has better performance than some other methods.The top-ranked unmarkedgenes are completely different from the potential oncogenes discovered by previous methods,which can be confirmed that they are new oncogenes with high likelihood. |