| Cancer is a complex,heterogeneous disease that seriously endangers human health.Researchers have studied cancer for decades.With the continuous development of highthroughput sequencing technology and the declining cost,large numbers of multi-omics data have been generated.It is generally believed that biological data at different levels jointly affect and regulate multiple biological processes,providing more reliable information for researchers to study the formation and development of cancer.Therefore,multi-omics data integration,as an important computational method for comprehensive utilization of different omics data to study cancer-related issues,has attracted wide attention in the field of bioinformatics in recent years.Identifying molecular subtypes of cancer is one of the most important topics in cancer research.Patients with different subtypes of cancer have significant differences in clinical practice,so that patients need accurate diagnosis and treatment according to the subtypes they have.How to identify cancer subtypes by data integration has become a key issue in cancer research.This dissertation focuses on the multi-omics data integration and cancer subtyping,including evaluating the existing multi-omics data integration methods,exploring the impact of integrating different omics data on the cancer subtyping,developing a platform for the evaluation and comparison of data integration methods,and proposing the method of integrating Bulk and single-cell transcriptome data.Specifically,the main research contents and contributions of this dissertation are as follows:(1)At present,researchers have proposed many multi-omics data integration methods for cancer subtyping.Different methods have their own drawbacks because of the mathematical models or integration strategies they used.A comprehensive evaluation and comparison of these methods can help researchers choose the best performing methods to get more accurate cancer subtyping results.However,due to the lack of the gold standards,it is very difficult to evaluate these multi-omics data integration methods.In this dissertation,we proceeded a comprehensive evaluation and comparison of multi-omics data integration methods for cancer subtyping.We constructed three classes of benchmarking datasets of nine cancers in TCGA by considering all the eleven combinations of four multi-omics data types which belongs to three different omics.Using these datasets,we conducted a comprehensive evaluation of ten representative integration methods for cancer subtyping in terms of accuracy measured by combining both clustering accuracy and clinical significance,robustness,and computational efficiency.From our experiments and analyses,we observed that the methods NEMO and SNF perform very well in all three criteria,i.e.accuracy,robustness,and computation efficiency.The other methods have certain limitations.In summary,NEMO and SNF are recommended for general cancer subtyping tasks,however,users may also want to consider the other eight methods depending on their specific purpose or under certain circumstances.(2)When using data integration methods to identify cancer subtypes,which omics data are selected for integration,and whether different omics data and combinations have the same effect on cancer subtyping,are important issues that need to be studied urgently.This dissertation explored the impact of integrating different omics data on the cancer subtyping.Using our results on the accuracy,we have also analyzed the influence of different omics data types and their combinations on cancer subtyping.Our analysis shows that the influence of these omics data types varies,and several commonly-used combinations of omics data types can indeed improve the accuracy of all the ten methods as measured by both clustering and clinical metrics.On the other hand,our analysis indicates that integrating more types of omics data may negatively impact the performance on cancer subtyping,refuting the widely held intuition that incorporating more types of omics data always helps produce better results.(3)The emergence and development of single-cell sequencing technology provides a data basis for studying the interactions between different cell types in tumors and its microenvironments,and also provides a new idea for integrating Bulk and Single-cell-level data to study cancers in specific tumor microenvironments.This dissertation proposed a method ctSubtype(cell type-based cancer subtype)to identify cancer subtypes by integrating Bulk and single-cell transcriptome data to construct cell type resolved single sample network(ctSSN).ctSubtype first constructs a single sample network and a cell type-specific network for Bulk and scRNA-seq data,respectively.Subsequently,the single sample network of each sample is integrated with each cell type-specific network to construct a sample-cell type network,i.e.ctSSN.Then,for each cell type,the cosine similarity of the sample-cell type network of each two samples is calculated to construct a cell type-specific sample-sample similarity network,and the sample-sample similarity network of all cell types is integrated to obtain the global sample-sample similarity network.Finally,the spectral clustering method is used on the global similarity network to obtain the cancer subtyping results with cell type specificity.ctSubtype is applied to the study of breast cancer.By integrating the gene expression data of BRCA Bulk and scRNA-seq,BRCA subtypes with clinical significance are identified.(4)Comprehensive evaluation and comparison of multi-omics data integration methods can guide users choose the best method according to their own needs.Up to now,there is still a lack of such tools.This dissertation developed an extensible and easily-used platform for comparison and evaluation of multi-omics data integration methods-CEPICS.Using CEPICS,researchers can easily obtain the cancer subtyping results generated by the five built-in state-of-the-art multi-omics data integration methods,and a comprehensive comparison and evaluation of different results.As the subtyping results depend on the selection of the number of subtypes and the inconsistency of the results generated by different methods,CEPICS combines the results of different methods under different clustering numbers,draws a global sample-sample similarity heatmap,and gives a robust and reliable prediction of the similarity between samples.In addition,users can upload subtyping results of their own methods to compare with the built-in methods.In summary,CEPICS provides a useful tool for research on data integration and cancer subtyping. |