Font Size: a A A

Semi-supervised Clustering Based On Constraint Selection For Gene Expression Data Set

Posted on:2023-06-01Degree:MasterType:Thesis
Country:ChinaCandidate:M H ZhaoFull Text:PDF
GTID:2530306827470184Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
Gene expression data contain abundant gene activity information,and it is of great significance to analyze the implied patterns in gene expression data for the understanding and inference of biological gene function and the study of gene regulation mechanism.With the development of DNA microarray technology,a large amount of gene expression data has been generated.How to effectively analyze and understand massive gene expression data has become an important challenge in the field of bioinformatics.Clustering is an important unsupervised data mining method that can help discover co-expressed genes and infer the function of unknown genes.Integrating prior information in clustering process can effectively improve clustering performance.Compared with class label information,pairwise constraints in prior information are easier to obtain,so they are more widely used.The existing semi-supervised algorithms based on pairwise constraints either directly use the known label information to generate pairwise constraints,or mine pairwise constraints according to data characteristics.In practice,gene expression data are usually unlabeled data sets,and pairwise constraints obtained by automatic mining inevitably have noisy pairwise constraints,that is,pairwise constraints that do not match the real cluster information,which seriously affects the performance of semisupervised clustering on gene expression data.To solve this problem,two multi-objective semisupervised clustering algorithms for gene expression data are proposed in this paper.(1)From the perspective of eliminating noisy constraints and selecting effective pairwise constraints to act on semi-supervised clustering,this paper proposes a multi-objective semisupervised clustering algorithm for gene expression data based on constraint selection.The algorithm firstly obtains the initial pairwise constraint set based on the density tracking method,and then introduces it into an objective function with constraint violation penalty term to achieve semi-supervised clustering.To achieve the collaborative optimization of clustering solution and constraint selection under NSGA-II framework,this paper proposes a hybrid coding method of constraints selection and clustering center.In the process of multi-objective evolution,pairwise constraints suitable for clustering are selected,which realizes the joint optimization of supervision information and clustering results,and then effectively improves clustering performance of gene expression data.(2)Integrating knowledge of gene biology into semi-supervised clustering for gene expression data is helpful to further eliminate noisy constraints.In view of this,this paper proposes a multi-objective semi-supervised clustering algorithm for gene expression data integrating gene ontology.Firstly,the algorithm obtains the functional similarity of genes from gene ontology and generates the constraint set of gene ontology.Secondly,considering pairwise constraints information based on gene expression data and gene ontology,the weight of constraint violation penalty term in semi-supervised clustering algorithm is improved.Finally,the proposed hybrid coding is used to realize the joint optimization of constraint selection and clustering center.Experimental results on multiple gene expression data sets show that the proposed algorithm can further optimize pairwise constraints from the initial constraint set by integrating biological information,and obtain more accurate and biologically significant clustering results.
Keywords/Search Tags:Gene Expression Data, Semi-supervised Clustering, Pairwise Constraint, Multi-objective Clustering, Gene Semantic Similarity
PDF Full Text Request
Related items