Malignant tumor,commonly called cancer,has become the major disease threatening people’s safe and healthy life in recent years.Because the occurrence of cancer is often accompanied by the misexpression of normal genes and gene mutations,researchers can determine whether patients have cancer by examining the different expression changes of genes in the gene expression profile.As an effective information feature of gene activity,gene expression data has become a key data set for researchers to study cancer.Because the number of gene expression data samples is usually only a few hundred,and the number of genes can reach thousands,tens of thousands or even more,the number of pathogenic genes related to cancer is very small,resulting in such data with the characteristics of typical small sample,high dimension and high redundancy.It is necessary to reduce the dimension of gene expression data through a machine learning algorithm in advance to obtain useful identification information for the subsequent task of characteristic gene selection,cancer classification and cluster analysis.Some methods based on matrix decomposition(such as PCA,LRR,etc.)have been proposed and applied to extract features from high-dimensional and highly redundant data.However,with the increase of data complexity and the shortcomings of these traditional methods,they can not obtain satisfactory results.(1)We proposed a new PCA-based method called robust Laplacian supervised discriminative sparse principal component analysis(RLSDSPCA).At present,the great majority of PCA-based methods have a limitation: most methods do not combine the improvement of robustness to outliers and noise,label information,sparsity and the capture of local geometrical structure in one objective function.To overcome this drawback,we proposed a novel PCA-based method,known as robust Laplacian supervised discriminative sparse PCA(RLSDSPCA),which enforced the L2,1 norm on the error function and incorporated the graph Laplacian manifold into supervised discriminative sparse PCA.To evaluate the efficacy of the proposed RLSDSPCA,we applied it to the characteristic gene selection and tumor classification problems on gene expression data.Computational experimental results demonstrate that the proposed RLSDSPCA achieved the best performance.(2)We proposed a new LRR-based method called block diagonal low rank representation based on Huber loss and ordinal locality(HOBLRR).At present,the graph regularization term of most graph regularization LRR only considered the local geometric structure of the original data and ignored the ordinal locality.Therefore,in this study,we proposed a new LRR-based method,called block diagonal low rank representation based on Huber loss and ordinal locality(HOBLRR).This method forced Huber loss on the error function of LRR to achieve robustness to noise and outliers.The preservation of local geometry and ordinal locality was introduced into graph regularization.In addition,the low rank representation matrix was forced to regularized the block diagonal matrix to seek block diagonal matrix directly.We applied the proposed method to simulation data clustering,characteristic gene selection and cancer sample clustering of gene expression data.The final experimental results show that HOBLRR achieved the optimal performance.(3)In order to facilitate the use of other gene expression data researchers,we developed an online webserver based on spring MVC framework to provide prediction services for cancer sample classification based on gene expression data. |