Background:Multi-omics integration analysis is a statistical method to conduct systematic research on biological samples through combine multiple omics data,including genomics,transcriptomics,proteomics,metabolomics,microbiomics,radiomics,etc.,and explore the interactions between various substances in biological systems.The primary objective of multi-omics integration analysis is to identify or reclassify disease subtypes by clustering methods.At present,a variety of multi-omics integrative clustering methods have been proposed.Different omics data combinations and evaluation indicators were used in different methods,which makes it necessary to compare and evaluate the performance of various methods.Existing comparative studies have fully considered the characteristics of multi-omics data such as small sample size,high dimensions,and strong noise,and compared the performance of various multi-omics integrative clustering methods through various evaluation indicators.However,in addition to the above inherent characteristics of omics,multiomics data following the central dogma still have strong correlations between different omics levels,and most studies did not consider the correlations between different omics in the process of data generation.Therefore,the scientific issue of how to generate multi-omics data by correlation information,and evaluate the performance of multiomics integrative clustering method under correlation data is worth further discussing.In this study,the optimal number of clusters,classification accuracy,and variable importance of eight representative multi-omics integrative clustering methods(LRAcIuster,SNF,CIMLR,Mocluster,intNMF,MCIA,iClusterPlus and PINSPlus)were evaluated in different scenarios through statistical simulation experiments.In addition,the multi-omics integrative clustering methods were used to integrate the multi-omics data of patients with Corona Virus Disease 2019(COVID-19),explore the association of the clustering subtypes with standard subtypes in "Novel Coronavirus Infection Prevention and Control Consensus Diagnosis and treatment of novel coronavirus infection(10th trial edition)",identify the multi-omics biomarkers of COVID-19 and explain COVID-19-related biological pathways.Methods:In this study,the upstream and downstream regulatory pathways of multi-omics were determined by searching biological databases based on DNA methylation,gene expression,and protein expression data of colorectal cancer from The Cancer Genomics Atlas(TCGA).Multi-omics data with realistic correlation were generated through correlation and variance-covariance matrix of regulatory pairs.The optimal number of clusters,classification accuracy,and variable importance of multiomics integrative clustering methods were evaluated.The evaluation of classification accuracy includes the following scenarios:①sample size(50~500);②number of subgroups(2~6);③shift among subgroups(0,0.5,1,1.5,2,2.5,and 3);④proportion of differentially expressed variables(1%,5%,10%,15%,20%,and 25%);⑤level of noise(standard deviations of irrelevant variables are 0.5,1,2,and 3 times);⑥sample size proportion among subgroups(balanced,moderately unbalanced,and extremely unbalanced);⑦combination of omics data(four combinations of DNA methylation,gene expression,and protein expression).Each simulation repeated 50 times,and the adjusted rand index(ARI),normalized mutual information(NMI)and F1-score were calculated to evaluate the classification accuracy and robustness of the multi-omics integrative clustering methods.A cohort study of multi-omics integrative clustering methods was conducted based on the public database of blood proteomics,lipidomics,and metabolomics data of 161 patients with COVID-19 treated in five hospitals in Hubei Province from February to April 2020.The classification accuracy of the multi-omics integrative clustering methods on the COVID-19 data is evaluated by ARI,NMI,F1-score,and silhouette coefficient(SC).The similarity network fusion(SNF)algorithm with the best classification accuracy was employed to explore the association of the clustering subtypes with standard subtypes.Moreover,the difference analysis and enrichment analysis among SNF cluster subtypes were used to screen multi-omics biomarkers and biological pathways of COVID-19.Results:The simulation study showed:①In the evaluation of optimal clustering number,LRAcluster and CIMLR were consistent with the realistic number of subgroups,the optimal number of clusters in each simulation is equal to the realistic number of subgroups(k=3).The distribution of optimal clustering number of SNF,MCIA,and intNMF was 2 to 3,iClusterPlus ranged from 3 to 5,and PINSPlus and MoCluster were between 2 and 6.②In the scenarios of different sample sizes,number of subgroups,shift among subgroups,proportion of differentially expressed variables,level of noise,sample proportion among subgroups,and combination of omics data,the classification accuracy and robustness of SNF algorithm were better than other methods.The ARI,NMI,and F1-score were all close to 1 except in a few special scenarios.In contrast,the classification accuracy of LRAcluster decreased obviously in the scenario of high noise(σ=3),ARI and NMI were around 0.3.The ARI and NMI of CIMLR dropped to 0.7 in the scenario of a small sample size(n=50).The classification accuracy of Mocluster was lower than 0.8 when there were multiple subgroups(k=5).The classification accuracy of MCIA was close to 0.75 when combined with only two types of omics data.The classification accuracy of iClusterPlus was acceptable,but the model parameters are complex and time-consuming to calculate.The classification accuracy of intNMF and PINSPlus were dissatisfactory with the ARI and NMI being close to 0.6 in most scenarios.③In the evaluation of variable importance,the evaluation effect of CIMLR algorithm was prominent,the average number of differential variables in the top 10 and top 20 variables ranked by the variable importance were 9.9 and 18.8,respectively.The variable importance evaluation effect of MCIA is the same outstanding,the average numbers were 9.8 and 18.6,respectively.In contrast,the average numbers of Mocluster were 8.5 and 16.2,and the average numbers of intNMF were 5.6 and 10.4.In terms of the cohort study,the result of multi-omics integrative clustering analysis of COVID-19 was consistent with the conclusion of simulation experiments,SNF also had the preferable classification accuracy(ARI:0.47;NMI:0.51;F1-score:0.78;SC:0.04).SNF divided 161 COVID-19 patients into three subgroups,and the SNF multi-omics clusters had a high correlation with the standard subtyping(Cramer’s V coefficient:0.89).SNF-Ⅰ(n=21)were mainly critical patients(95.2%),SNF-Ⅱ(n=46)mostly included asymptomatic patients(95.7%),SNF-Ⅲ(n=94)mainly consisted of mild(54.3%)and severe(35.1%)patients.The results of differential analysis showed that genes related to neutrophil activation,regulation of inflammatory response,T cell activation,interferon production,protein polyubiquitination,and autophagy were significantly differential expressed among SNF subgroups(FDR q-value<0.05).Furthermore,115 differential proteins,34 differential lipids,and 76 differential metabolites were identified according to the fold change and FDR.The results of enrichment analysis showed that differential proteins were mainly involved in biological processes related to immune response,including myeloid leukocyte migration,neutrophil migration,leukocyte migration,neutrophil chemotaxis,leukocyte chemotaxis,granulocyte migration,cell chemotaxis,regulation of peptidase activity,granulocyte chemotaxis.Differential metabolites were mainly enriched in metabolic pathways,including β-oxidation of very long-chain fatty acids,mitochondrial β-oxidation of short-chain saturated fatty acids,fatty acid biosynthesis,steroidogenesis,taurine and hypotaurine metabolism,tryptophan metabolism,alinolenic acid and linoleic acid metabolism,vitamin B6 metabolism,glutathione metabolism,and caffeine metabolism.Differential proteins and differential metabolites were involved in glutathione metabolism.Conclusions:①The multi-omics integration method based on sample similarity has particular advantages in multi-omics data with highly correlation.SNF is the optimal method for multi-omics integrative clustering,and CIMLR can be used to identify multi-omics biomarkers associated with subtypes.② The classification accuracy of multi-omics integrative clustering methods is easily affected by the number of subgroups,shift among subgroups,proportion of differentially expressed variables,level of noise,and omics combinations.The application of multi-omics integrative clustering methods should focus on these aspects.③COVID-19 patients can be divided into 3 different multi-omics molecular subgroups,and differential proteins and differential metabolites between subgroups are involved in glutathione metabolism,the glutathione metabolic pathway may be of great significance to the occurrence and development of COVID-19. |