| Cancer is a disease caused by the loss of normal regulation of cells and abnormal growth.The production,development,metastasis and deterioration of cancer have complex biological processes.Cancers can be divided into many types according to the role of cancer cells in the body,and the same cancer can be divided into different subtypes according to molecular markers and clinical manifestations of patients.Cancer subtype research can not only comprehensively understand cancer,but also provide patients with more precise treatment options.Studying cancer subtypes through a single data type,such as gene expression,often fails to capture the full complexity of the molecular phenotype of cancer.With the development of high-throughput sequencing technology,a large number of genomic,transcriptome,epigenomic,and proteomic data have been generated.cancer subtyping through multi-omics integration analysis can synthesize multiple types of molecular features and significantly improve the effect of cancer subtyping.However,due to the complexity of biological systems and the heterogeneity between different omics data,multi-omics integration analysis is still a difficult task.A large number of multi-omics data integration methods have been used for cancer subtyping.Some methods have strong assumptions about the distribution of the data.For example,the i Cluster method assumes that continuous data obeys the Gaussian distribution.However,due to the characteristics of the data and measurement errors,the actual data may not meet this assumption.This affects the correctness of the results.In addition,other data integration methods ignore the differences in the internal characteristics of a single omics data.For example,the CIMLR method uses a single data type as a whole to calculate the kernel matrix and determine the weight coefficient.the weight coefficients of different characteristics may be different.Therefore,this paper proposes entropy regularization multi-kernel k-means based on feature grouping for cancer subtyping.This method does not need to make assumptions about the distribution of the data and takes into account the differences in internal characteristics of a single omics data.This method is mainly divided into the following three steps:(1)using the NMF algorithm to group features of different types of data;(2)using the Gaussian radial basis kernel function to calculate the corresponding kernel matrix for the grouped data;(3)The improved entropy regularization multi-kernel k-means method is used in this paper to integrate and cluster the kernel matrix obtained previously to obtain the final cancer subtyping result.In order to verify the effectiveness of the entropy regularization multi-kernel k-means method,this paper uses the simulation data set constructed in the article of the multi-kernel k-means method,and compares this method with the traditional multi-kernel k-means method.Experiments show that compared with the traditional multi-kernel k-means method,the improved method has significantly improved clustering indexes such as the Adjusted Rand Coefficient(ARI)and Normalized Mutual Information(NMI).Subsequently,three different types of simulation data were constructed to verify the improvement of the entropy regularization k-means algorithm based on feature grouping from the aspects of clustering accuracy and method robustness.The experimental results show that after the feature grouping is introduced,the method is both ARI and NMI have good results.Finally,the gene expression,mi RNA expression,and DNA methylation data of Breast,AML,GBM,Colon and Liver cancers in the TCGA database were selected,and the method was used to perform data integration and cancer subtyping,respectively,and to compare with existing cancer subtyping methods SNF,i Cluster Bayes and CIMLR.Calculate clustering indexes such as ARI and NMI on Breast of gold-labeled cancer data,perform survival analysis on four other types of cancer data sets without gold standard,draw KM survival curves,and calculate Cox Log-rank test p-value to evaluate each significant differences in clinical survival time for each cancer subtype;then box plots of cancer subtypes were calculated,and Kruskal-Wallis test statistics were calculated to evaluate differences in gene expression,mi RNA expression,and DNA methylation for each cancer subtype significance.The experimental results show that this method outperforms the existing cancer subtyping methods in the above indexes in five types of cancer data sets,which indicates that the entropy regularization multi-kernel k-means based on feature grouping method proposed in this paper has better performance in cancer subtyping. |