| Tumor microenvironment(TME)is an environment suitable for tumor growth established by tumor cells in the process of tumor development.Extracellular matrix(ECM)is the main component of TME,and participates in the processes of tumor angiogenesis,signal transduction,proliferation and invasion.Several genes expressed in the ECM are considered to be important indicators for judging tumor prognosis,providing abundant targets for the development of tumor vaccines and anti-tumor drugs.At present,a large amount of biomedical knowledge related to ECM genes emerged,but it is widely distributed in the literature with huge data size,which brings difficulties to their usage.To address this challenge,this thesis carried out a series of bioinformatics studies to systematically collect,organize and predict cancer-related extracellular matrix(CECM)gene-related knowledge.1.The construction of C-ECM gene knowledge graph.Knowledge graph is an important approach to collect,organize and present biomedical knowledge.In order to construct the knowledge graph of C-ECM gene,the ontology of ECM gene and related tumor diseases was first developed.Then,an ontology-based bio-entity recognizer was used to recognize and extract325 candidate C-ECM genes from 48,712 Pub Med abstracts which co-occurred with the cancer entries.After three rounds of strict manual curation,225 C-ECM genes with solid literature evidences were obtained,together with their biological process,and function information.Further bioinformatics analysis shows that these genes tend to participate in cell proliferation and differentiation,signal transduction,angiogenesis,immunity and other functions.In order to facilitate the usage of this knowledge graph,this thesis presented a special website CECMAtlas,users can get related C-ECM gene information,detailed function annotation and literature evidence by submitting genes,diseases or biological processes terms.As the first comprehensive database of C-ECM genes,CECMAtlas will be helpful to understand the relationship between C-ECM genes and tumorigenesis,and provide clues for new tumor markers and drug targets.2.Discover of novel C-ECM gene based on knowledge graph and deep learningAt present,there is still limited knowledge for the C-ECM gene,and the knowledge graph constructed in this thesis lays the foundation for the discovery of new C-ECM genes.Based on CECMAtlas,this paper established a deep learning model based on Me SH.Firstly,397,896 abstracts and 388,632 gene-literature were integrated,and transformed in word vectors.Then the hidden relation between genes and Me SH terms were systematically examined by the Auto Encoder methods.The prediction model established has achieved satisfied performance against the gold standard data set(Area under the curve of ROC: 0.67~0.88).Based on the manual curation,9 of the top 10 genes in the prediction results can be confirmed by literature,indicating that this model has certain ability to discover new C-ECM genes.The construction and application of the CECMAtlas will provide important data resources and research clues for related biomedical research. |