Tumour microenvironment(TME)refers to the surrounding microenvironment in which tumour cells exist,blood vessels,including immune cells,fibroblasts,myeloid-derived inflammatory cells,various signalling molecules and extracellular matrix(ECM),etc.TME is the site where cancer cells develop or are removed.In addition to the diversity of genetic mutations,the type and relative proportion of cells in the TME and their spatial distribution(immune infiltration)are also important features of cancer heterogeneity.This heterogeneity is closely related to the treatment and prognosis of cancer.Therefore,analysing the types and proportions of cells in TME is a prerequisite and basis for analysing cancer mechanisms,discovering new therapeutic strategies and constructing accurate prognostic models.Single-cell sequencing technology is a way of analysing cell types and relative abundance in TME.In addition,a lot of biomolecular and clinical data has been accumulated in previous studies(e.g.in the TCGA database,which has deposited gene expression,DNA methylation,and clinical data for over thirteen thousand tumour samples),and these data provide important data to support the study of cancer.However,most of these data are tissue(or blood)samples,which measure the average effect of cell populations.How to resolve the cell types and proportions in such data,and then analyse their TME information,is an important fundamental step for deeper mining of cancer big data.Deconvolution methods make it possible to analyse the cellular composition in complex cell samples by assuming that the gene expression values are a linear superposition of the expression values of each cell type within it,so that cell types and proportions can be inferred back from cell-specific gene expression information.A well-performing deconvolution tool not only requires an efficient and robust deconvolution model,but also relies on a well characterised marker genes(or regions derived from epigenomic data)with good specificity and robustness.The focus of this thesis is on cancer heterogeneity and TME,including the construction of deconvolution methods and tools to address cell types and proportions,the development of prognostic models and the analysis of cancer mechanisms.Three areas are covered:(1)two deconvolution methods and tools based on gene expression data and chromatin accessibility data were developed,respectively.And the molecular characterisation of acute myeloid leukaemia(AML)heterogeneity and its subtypes were further examined on the basis of deconvolution tools.(2)Construction of breast cancer-specific reference gene expression profiles for the analysis of TME in breast cancer patiennts;development of a prognostic model for drug treatment of breast cancer,and to characterise mutations and immunological profiles in different risk groups of breast cancer.(3)Identification of molecular markers in pan-cancer that can be used for both prognostic and diagnostic purposes.(1)Deconvolution model and tool LinDeconSeq based on gene expression data.As there is still a lack of methods and tools to identify marker genes across an arbitrary variety of conditions,LinDeconSeq can be used to identify cell type marker genes and to predict the cellular composition in complex cell samples.LinDeconSeq’s approach is:(i)to identify potential marker genes from arbitrary multiple cell types with a Shannon entropy specificity score and a mutual linear strategy;(ii)to introduce weighted robust regression(w-RLM)to predict cellular proportions in complex cell samples.Compared to other marker gene prediction tools based on several publicly available datasets,marker genes identified by LinDeconSeq showed better accuracy and reproducibility.For deconvolution,LinDeconSeq showed the smallest deviation between the"predicted proportion"and the"true proportion"on the benchmark datasets(RMSD≤0.0958)and the highest Pearson correlation coefficient(r≥0.8792).Using AML as an application,the cells predicted by LinDeconSeq have potential diagnostic power(AUC≥0.91).Using the cellular fractions of AML patients can be divided into two distinct subgroups,which differ significantly in terms of prognosis and mutation patterns.Granulocyte-monocyte progenitor cells(GMP)were significantly enriched in Subgroup A and were strongly associated with a better prognosis and a younger population.In conclusion,LinDeconSeq has important applications for the identification of marker genes and the analysis of cellular components of the tumour microenvironment(https://github.com/lihuamei/LinDeconSeq).(2)Deconvolution model and tool DeconPeaker based on chromatin accessibility data(ATAC-seq).Deconvolution model and tool DeconPeaker was developed to predict cell types and relative proportions in complex cell samples based on chromatin accessibility data.On multiple simulations and benchmark test datasets,DeconPeaker accurately predicts the relative proportions of different cell subpopulations in open chromatin data samples.The lowest root mean square error(RMSE=0.042)and the highest mean correlation coefficient(r=0.919)were observed between the’predicted’and’true’proportions on the 12 test datasets compared to the other nine known deconvolution methods.As an application,chromatin accessibility data from acute myeloid leukaemia(AML)were analysed and DeconPeaker successfully identified distinct cell types associated with AML progression.It was also found that there is a sharp difference in the ability of the transcriptional regulatory layer(chromatin accessibility)and the gene expression layer to characterise cell identity,as cell type or cell state is usually associated with extracellular stimuli,resulting in chromatin accessibility information affecting transcription factor binding to DNA being more sensitive to cell identity.Thus,chromatin accessibility represents a more specific cell type identification feature than gene expression.In summary,DeconPeaker can be an important tool for probing the tumour immunemicroenvironmentusingchromatinaccessibilitydata(https://github.com/lihuamei/DeconPeaker).(3)From cellular infiltration assessment to a functional gene set-based prognostic model for breast cancer.The first breast cancer(BC)-specific reference gene expression profile(RGEP),BC-RGEP,was constructed using single-cell transcriptome data to predict cellular components in the microenvironment of breast cancer patients.Combining BC-RGEP with a deconvolution model(LinDeconSeq)for the prediction of cellular proportions in multiple breast cancer cohorts,the results show that BC-RGEP can accurately estimate the relative composition of 15 cell types in the BC microenvironment.The proportion of cells predicted by BC-RGEP provides a better classification of breast cancer patients than other non-breast cancer-specific RGEPs(AUC=0.895).In addition,a prognostic model based on 24functional gene sets was developed by correlating BC-RGEP-predicted cell proportions with the activity of functional gene sets,which effectively predicted overall survival in breast cancer patients(p=5.9×10-33,n=1091,TCGA-BRCA cohort),chemotherapy(p=5.3×10-3,n=116,GSE5462 cohort)and immunotherapy response(p=6.5×10-3,n=348,IMvigor210 cohort).With the extension to pan-cancer analysis showed that functional genomic-based prognostic models also performed significantly in predicting prognosis in 24other cancer types,suggesting that functional gene sets may depict potential shared factors across cancer types.(4)Prognostic gene expression signature revealed the involvement of mutational pathways in cancer genome.Using expression and clinical data from 29 cancers(solid tumours)in the TCGA database to predict key prognostic genes,a comprehensive examination of prognostic genes in pan-cancer was completed by considering differences in gene survival,expression and associated mutational pathways in different cancers.The results showed that:the number of prognostic and diagnostic genes varied considerably across cancers;22 genes with significant diagnostic and prognostic abilities were identified;universal prognostic genes(CDC20,CDCA8,ASPM,ERCC6L and GTSE1)were mainly involved in biological processes such as spindle assembly checkpoints;Genes with both prognostic and diagnostic ability were significantly associated with frequent mutations in TP53,MAPK,PI3K and AKT-related pathways.This project systematically investigates methods for predicting the compositions of cell types in complex cell samples,and develops corresponding deconvolution tools,which are important for understanding the heterogeneity,prognosis,and clinical response to therapy in TME of cancer patients. |