Font Size: a A A

Mining Large-scale Tumor Transcriptome Profiles To Inform Cancer Heterogeneity And Immune Microenvironment

Posted on:2020-09-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:J ZhangFull Text:PDF
GTID:1360330599952417Subject:Bioinformatics
Abstract/Summary:PDF Full Text Request
Cancer is one of the major diseases that threaten human health.Every year,tens of millions of people around the world are diagnosed with cancer.In China,the incidence and mortality of malignant tumor have remained high.In recent years,due to the increase of China's aging population and deterioration of environmental pollution,these two indicators have shown a trend of continuous growth.Cancer is composed of cells with structural,functional,and metabolic abnormalities and uncontrolled proliferation,induced by the long-term interactions between various carcinogenic factors,such as endogenous genetic variation,pro-tumor microenvironment,and the external environment.Cancer is a complex and highly heterogeneous disease.Its pathogenic mechanism,molecular mechanism and evolution process remain unclear.With the development and widespread application of various high-throughput technologies,such as Microarray,Next Generation Sequencing and Mass Spectrometry,researchers have turned from traditional single-molecule study to integrated analysis of large-scale high-throughput data.Many important progresses have been made in this field.For example,molecular classification has provided important insights for tumor heterogeneity;the discovery of molecular markers for tumor diagnosis and prognosis has become an important basis for early diagnosis and precise treatment of tumors.With the widespread use of high-throughput technologies and the maturity of data sharing strategies,the international public databases,such as Gene Expression Omnibus(GEO),The Cancer Genome Atlas(TCGA),and International Cancer Genome Consortium(ICGC),have accumulated large-scale tumor omic data.Cancer research has entered the era of “Big Data”.In the context of the “Big Data” era,data-driven approach has become one of the important models of tumor bioinformatics research.The integrative mining of these data can provide valuable information for studying tumor heterogeneity and developing new diagnostic and therapeutic methods.In this dissertation,based on the large-scale public cancer omics data and specific questions in cancer research,the author conducted the following three aspects of researches:First,in the aspect of intra-tumor heterogeneity,by using TCGA transcriptome data,the author obtained a stem-like gene enrichment(SGE)landscape of 9854 tumor samples and 696 adjacent normal tissue samples across 32 cancer cohorts.Cancer stem-like cells,marked by the ability to both self-renew and differentiate into non-tumorigenic progenies,are considered as the main cause of tumorigenesis,metastasis,recurrence and drug resistance.Therefore,identification of the stem-like cells in tumor tissues and exploration of its potential function and regulatory mechanism will increase our understanding of tumor.In this study,the author firstly identified a stem-like gene signature through meta-analysis in previously published stemness gene sets.By taking advantage of this signature and single-sample gene set enrichment analysis method,the author obtained the SGE scores from all TCGA data,and found that,1)There are significant differences in SGE scores among different tumor types,2)In most tumor types,the SGE scores of tumor samples are significantly higher than that of normal tissues,3)The high SGE score is closely related to the poor prognosis of patients in many tumor types.4)Multi-platform data analysis shows that tumor stemness could be regulated by various mechanisms.Second,in the aspect of inter-tumor heterogeneity,the author identified a long non-coding RNA(lncRNA)signature by using public microarray datasets to predict the survival of patients with early-stage non-small-cell lung cancer(NSCLC).In recent years,deep sequencing studies of transcriptomes have found the complex stratification and regulatory relationships in the transcriptome,and revealed that only about one-fifth of the transcripts of the human genome are protein-coding genes.LncRNAs are a class of non-coding RNAs of more than 200 nt in length.Studies have shown that lncRNAs are involved in the regulation of tumor development,invasion and metastasis.In this study,we firstly obtained lncRNA expression profiles from mRNA expression microarray datasets by using a re-annotating method.Then based on these lncRNA profiles,we identified a seven-lncRNA signature that were significantly associated with the overall survival in the training set.We further validated the prognostic value of the signature in the testing set and another three independent testing sets.Cox regression analysis shows that the lncRNA signature is an independent prognostic factor for early-stage NSCLC.Our results suggest that the seven-lncRNA signature may have clinical implications in NSCLC.Third,in the aspect of tumor immune microenvironment,the author developed a novel method to detect T and B cell receptor hypervariable sequence from RNA-seq data.Tumor infiltrating lymphocytes are key regulators in tumor immunity and are primary targets of cancer immunotherapies.Identification of tumor-specific infiltrating T cell receptor(TCR)and B cell immunoglobulin(Ig)repertoire is critical to understanding the tumor-immune interactions.Extraction of the hypervariable complementarity-determining region 3(CDR3)from tumor RNA-seq data is particularly interesting,because it allows direct analysis of the infiltrating immune cell repertoire.In this study,we present a new computational method for de novo assembly of sequences from CDR3 regions using RNA-seq data.By applying it to large-scale tumor datasets,we have identified a large number of tumor immune infiltrating TCR and Ig CDR3 sequences.These sequence data might be useful for the early diagnosis of tumors,the development of cancer vaccines and novel immunotherapy strategies.Extracting valuable information from massive data has been an important part of data science.However,due to the nature of different data and the needs of different research focuses,researchers are often asked to design various analysis algorithms to solve different problems.Due to the limitations of the high-throughput technology itself and the complexity of histological samples,the transcriptome data generated by microarrays or the second-generation sequencing technologies,usually contains a lot of information to be discovered.In the three part of research work presented here,the author developed corresponding bioinformatics algorithms based on the specific characteristics of tumor high-throughput data and identified many valuable information from the large-scale omic data of tumor samples.These efforts not only increased the statistical power through increasing data volume,but also achieved a critical transition from public data to "new" value.In addition,the relevant tools developed in this study can be used for future related research works.
Keywords/Search Tags:Data mining, Cancer stem cell, Long-non-coding RNA, Prognosis signature, Cancer immunology, T cell receptor, B cell receptor, Complementarity-determining region 3
PDF Full Text Request
Related items