Font Size: a A A

Research On Gene Expression Data Based On MDS And Semi-supervised Clustering

Posted on:2022-12-20Degree:MasterType:Thesis
Country:ChinaCandidate:D ChenFull Text:PDF
GTID:2480306752993329Subject:Computer Software and Application of Computer
Abstract/Summary:PDF Full Text Request
With the advancement and application of DNA microarray method,more gene expression data can be easily collected.How to gain more from the data of biological significance of the knowledge,which is one of the main bioinformatics research direction.Therefore,the analysis of gene expression data has become a research focus of biological information theory,which is of great significance for understanding genome activity and disease diagnosis.As a biological resources analysis methods commonly used in data mining,clustering analysis has been used in the analysis of gene expression data.At present,the major problems in the interpretation of gene expression data are as follows:(1)Gene expression data is characterized by high dimensionality.If traditional clustering methods are used for direct clustering analysis of data,it is easy to cause "dimensionality disaster",resulting in high algorithm iteration;(2)Most of the clustering analysis methods adopted are traditional unsupervised clustering methods,these methods only consider the mathematical characteristics of gene expression data and do not consider the existence of gene biological information.According to the above dilemma,this thesis adopted the following measures:(1)Select an effective MDS dimensionality reduction method to reduce the dimensionality of high-dimensional gene expression data and reduce the data processing time;(2)Semi-supervised clustering is formed by adding prior knowledge to the traditional unsupervised clustering method and applying it to the analysis of gene expression data to improve the efficiency of clustering.In this thesis,the major work and innovation points are as follows:(1)Aiming at the MDS,the E-MDS dimensionality reduction method is formed by setting thresholds to improve the effectiveness of dimensionality reduction technology.With PCA,LDA,the Iris data set on UCI were visualized for dimensionality reduction,and the performance was compared to verify the effectiveness improvement.(2)The research focuses on semi-supervised cluster analysis of gene expression data.The newness of this thesis lies in the following:(1)Based on the commonly used COP-Kmeans semi-supervised clustering algorithm,the DS-CK algorithm is formed by technical improvement based on data segmentation random selection for the selection method of the initial clustering centroid;(2)DC-BCK algorithm is formed on the basis of BFS+Cop-Kmeans algorithm,which improves the initial clustering centroid selection method based on data segmentation and mean calculation.The purpose of these two innovations is to reduce algorithm iteration and improve clustering efficiency.(3)The improved E-MDS dimension reduction method was combined with DSCK algorithm and DC-BCK algorithm,and experiments were carried out on UCI data set to verify the feasibility of the algorithm.The fusion algorithm was applied to open gene expression data GSE189010 and GSE8187 for clustering analysis,and compared with traditional K-means algorithm,semi-supervised clustering algorithm Cop-Kmeans algorithm and BFS+Cop-Kmeans algorithm in clustering efficiency.Experimental results show that the clustering efficiency of the improved algorithm is better.
Keywords/Search Tags:Gene expression data, MDS, Semi-supervised clustering, Cop-Kmeans
PDF Full Text Request
Related items