| Background and purpose: With the development of modern high-throughput omics measurement platforms,the omics data at multiple biological levels,such as the genome,transcriptome,proteome,and metabolome,have been accumulated,and data from various omics sources have contributed significantly to the research of complex diseases.The module identification methods are one of the most commonly used approaches to study these omics expression profiles.The identification method divides genes into different co-expression modules.Genes in the same module have similar or opposite expression patterns.They tend to be functionally related and co-regulated,and the identified modules can be used for various subsequent analyses.Currently,many algorithmic tools have been developed in gene module identification.However,most of the existing gene module identification methods are based on a single omics layer without considering the complex interactions at various biological levels.Recent research shows that multi-omics data integration provides a more systematic and comprehensive understanding of the biological system and molecular mechanisms of disease development by integrating information at multiple omics levels.This suggests that we integrate multi-omics data for gene module identification.Moreover,since coexpressed genes in the same module are usually functionally related or co-regulated,introducing known molecular interactions(e.g.,transcriptional regulatory interactions,protein-protein interactions,and biological pathways)in module detection will help improve module detection.Therefore,we present a novel data integration framework,a Correlation-based Local Approximation of Membership(CLAM).This framework integrates multi-omics data, introduces known molecular interactions in the gene module identification process,and achieves higher accuracy in gene module detection.In addition,we propose a new module-based survival analysis method.Methods: First,we present a novel analytical framework referred to as CLAM based on three methodological innovations: 1)constructing a tarns-omics neighborhood matrix by integrating the k-nearest neighbor matrices obtained from different data sources;2)using known molecular interactions to adjust the tarns-omics neighborhood matrix and 3)applying a local approximation procedure to define gene modules.Using RNA-seq and MS data of human CRC and mouse B-cell differentiation,we then conducted a comprehensive evaluation of 12 module detection methods,including 7 integrative clustering methods(CLAM1,CLAM2,i NMF,Lemon Tree,mo Cluster,i Cluster,and j NMF)and 5 individual clustering methods(CLAM3,FLAME,WGCNA,k-means,and ICA).Among these methods,CLAM1 uses known molecular interactions and multiomics data to assist module detection;CLAM2 uses multi-omics data to help module detection but without known molecular interactions;and CLAM3 uses known molecular interactions to assist module detection but without using multi-omics data.By comparing their observed modules with sets of known modules obtained from databases,we calculate the overall score of precision,recall,relevance,and recovery to evaluate the ability of various methods to reconstruct known modules.Finally,we apply CLAM to CRC to explore potential biomarkers and understand the molecular mechanism of CRC occurrence and development.Transcription factor enrichment analysis and KEGG pathway enrichment analysis were performed on the CRC resulting modules.We also compare our proposed survival analysis method of grouping patients based on the standard deviation of gene expression with the other two survival analysis methods(grouping patients based on single gene expression and average gene expression,respectively).Results: CLAM1 and CLAM3 obtained the highest overall score in the evaluation of the comparison with available modules,indicating that the utilization of known molecular interactions can improve the agreement with available modules and significantly improves the module detection performance.CLAM2 outperformed most of the existing integrative clustering methods but showed no significant advantage over FLAME or WGCNA.This indicated that integrating datasets from different sources contributed little to the overall score.However,the GO enrichment analysis result shows that data integration can improve the discovery of functional annotations.The enrichment analysis of CRC modules revealed the TFs and KEGG pathways that play an essential role in the development of CRC.The module-based survival analysis helps us explore potential biomarkers of CRC and understand the molecular mechanism of CRC occurrence and development.Conclusion: In summary,utilizing known molecular interactions can improve the agreement with available modules,while data integration can enhance the discovery of functional annotations.By integrating multi-omics data and introducing known molecular interactions,CLAM has a superior ability to reconstruct complex biological systems’ modular structures and identify biomarkers for complex diseases. |