Studies Of Tumor Information Mining Algorithms Based On Multiple-omics Data

Posted on:2020-11-09

Degree:Doctor

Type:Dissertation

Country:China

Candidate:Y N Hou

Full Text:PDF

GTID:1364330572971722

Subject:Operational Research and Cybernetics

Abstract/Summary:

PDF Full Text Request

With the rapid development of life science and technology,a large amount of biological data has accumulated,which provides a possibility to uncover the mystery of life hidden in the biological data,but also poses great challenges to the computing,analysis and interpretation of the biological data.Bioinformatics,as a new interdisciplinary subj ect,can use various tools such as mathematics,statistics,computer science and biology to meet these opportunities and challenges.Cancer bioinformatics,as one of many important research directions in bioinformatics,can promote the rapid development of cancer relevant scientific research,such as cancer diagnosis and treatment,drug research and development,prognosis and survival analysis of cancer,as well as the mechanism of tumorigenesis and development.Tumor is a highly heterogeneous and complex disease.There are many hypotheses about the mechanism of tumorigenesis and development,but there is still no unanimous conclusion at present.Previous studies have shown that genome instability and mutation,and tumor-promoting inflammation are the two major conditions that may lead to the appearance of cancer characteristics.Therefore,this study has carried out in-depth research on cancer bioinformatics from the perspective of cancer genome mutation and cancer-related inflammation.Studies on cancer genome have shown that there are a large number of gene mutations in all cancer cells,and the tumorigenesis and development are closely related to genetic abnormalities.Therefore,a mainstream viewpoint is that cancer is a genetic disease caused by abnormal DNA sequence in the genome of cancercells.Based on this point of view,the concepts of cancer driver genes and passenger genes have been proposed and widely accepted.Driver genes refer to those genes whose mutations can render a selective growth advantage to cancer cells,while passenger genes refer to those genes whose mutations play little role in driving cancer.In view of the importance of cancer driver genes,many studies have been devoted to developing algorithms to predict them.However,the ubiquitous heterogeneity of cancer leads to great challenges in identifying cancer driver genes from a large number of passenger genes.In the existing research,the identification of cancer driver genes by one class of prediction algorithms mainly relies on the significant high mutation frequencies of genes in cancer samples.However,such algorithms may have systematic errors in processing heterogeneous cancer data,which makes it difficult to extract more beneficial information for identifying cancer driver genes from the gene mutation data,thus suffer from low sensitivity and specificity.In addition,some studies have shown that further binding PPI network data can improve the predictive power of the algorithms in identifying cancer driver genes,but the sensitivity and specificity of such algorithms still need to be improved.In order to effectively predict cancer driver genes from heterogeneous cancer data,a new algorithm is proposed in this manuscript based on effective data integration,MaxMIF,according to the analysis of the characteristics of gene mutation data and the shortcomings of existing algorithms.When evaluated on 25 somatic mutation datasets obtained from TCGA database and two PPI network datasets,the MaxMIF algorithm almost always significantly outperforms several state-of-the-art algorthms of the same kind,in terms of predictive accuracy,sensitivity,and specificity.In addition,the MaxMIF algorithm is robust.Here are several obvious innovations of the MaxMIF algorithm.First,the preprocessing method of the gene mutation data is improved by using the new gene mutation-score instead of the existing gene mutation frequency.Based on the gene mutation data,the existing prediction algorithms mainly calculate the mutation frequency of each candidate gene in patient samples.However,in heterogeneous cancer data,the total number of mutant genes varies greatly among different patient samples,which results in significant differences in the contribution value of the mutation frequency among cancer samples,thus resulting in systematic errors.To solve this problem,the strategy of balancing the total contribution of gene mutation for each cancer sample is proposed in this study.Then the concept of gene mutation-score is proposed to eliminate the aforementioned systematic errors,so that more beneficial information to identifying cancer driver genes can been extracted from the gene mutation data.Second,a new Mutational Impact Function(MIF)is proposed to measure the mutational interaction among genes by improving the combination of gene mutation data and PPI network data.The new measurement is based on the gravitational model.The corresponding mass and distance of the model are quantified by the gene mutation-score and the "biological distance" between genes in the PPI network.Third,a new mutational impact network is established on the basis of the original PPI network and the MIF of genes.Then the Maximum Mutational Impact Function(MaxMIF)score of each gene among its’ adjacent genes is calculated in the new network,which is used to generate the candidate gene list.Based on the improvement of preprocessing method for gene mutation data,together with the intergration method of gene mutation data and PPI network data,the MaxMIF algorithm could significantly improve the ability to identify cancer driver genes.In the study of cancer-related inflammation,various kinds of immune cells and stromal cells infiltrated in the tumor microenvironment are the focuses of cancer research.In view of the importance of inflammation and immunity to cancer,together with the close relationship between chronic inflammation and cancer,the critical factors that may affect the occurrence and development of early cancer have been explored from the perspective of chronic inflammation diseases in this study.Through the analysis of the current research status of the deconvolution algorithm based on gene expression data,and the analysis of the cancer risk levels of various common chronic inflammation diseases,the critical factors at the tissue level that may affect the cancer tendency of chronic inflammation diseases have been explored in this study.In that way,there may be a chance to discover the critical factors that affect the occurrence and development of early cancer,and may even have practical application value for the prevention and early diagnosis of cancer.Based on the above analysis,a series of innovative exploratory studies have been carried out in this research.The main contents are as follow.First.10 kinds of common chronic inflammation diseases together with their corresponding gene expression data have been collected,which have been classified into two categories according to their cancer risk levels.Second,based on the corresponding gene expression datasets of each disease,a series of gene sets representing specific functions of different cell types are obtained by the ICTD algorithm,which are called feature modules.At the same time,the differential expression of the gene expression data was analyzed,and logarithmic fold changes(LFC)were used to measure the differential expression level of genes and feature modules to remove tissue specificity and make different diseases comparable.3)A new Multitask Feature Selection(MtFS)algorithm is proposed to screen feature modules with consistency in the same category and significant differences between different categories,so as to screen the critical factors at the tissue level that may affect the different cancer risk levels among chronic inflammation diseases.4)The MtFS algorithm was used to analyze the gene expression data of the above two categories of chronic inflammation diseases.This study shows that the MtFS algorithm can effectively screen out the characteristics of consistency difference in cell types of tissue infiltration between two categories of chronic inflammation diseases,which points out the direction for future research.In summary,in the research of cancer genome mutation and cancer-related inflammation,we developed the MaxMIF algorithm and the MtFS algorithm respectively for specific research problems to mine cancer-related information hidden in those multiple-omics data.Meanwhile,the results show that the above two algorithms can effectively solve the corresponding problems,and have very strong practicability.Besides,the MaxMIF algorithm has been implemented in C++.The software is user-friendly,which can be downloaded and used freely through the following website:https://sourceforge.net/projects/maxmif/files/.

Keywords/Search Tags:

Cancer Bioinformatics, Omics data, Cancer driver genes, Differential expression analysis, Multitask feature selection

PDF Full Text Request

Related items

1	Prediction Method Of Cancer Driver Genes Based On Multi-omics Data
2	Cancer Driver Gene Identification Algorithm Based On Integrated Analysis Of Multi-omics Data And Network Models
3	Research On Driver Genes Discovery Algorithm Based On Cancer Omics Data And Network Analysis
4	The Research Of Gastric Cancer Feature Genes Selection Based On Gene Expression Data
5	Identification Of Driver Genes And Pathways In Cancer With Omics Data Based On High-throughput Sequencing
6	Identification Of Cancer Driver Genes Based On Machine Learning
7	Research On Integrating Multi-omics Data Mining Driver Genes Of Gastric Adenocarcinoma
8	Identification Of Cancer Driver Genes Based On Multiomics Data And Network Analysis
9	Research Of Cancer Biomarker Identification Algorithms Based On Multi-omics Integration Analysis
10	The Research Of Cancer Feature Genes Selection Based The Gene Expression Data