Font Size: a A A

Research On Multi-objective Clustering Method Of Incomplete Gene Expression Data

Posted on:2022-08-08Degree:MasterType:Thesis
Country:ChinaCandidate:Q Z ChangFull Text:PDF
GTID:2480306509479914Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
With the development of high-throughput DNA microarray detection technology,a large number of gene-related data have been generated.The huge number of genes and the complexity of biological networks have become a huge challenge in understanding and interpreting these data.As an important data analysis method,clustering is often used to analyze gene expression data.Clarifying the patterns hidden in gene expression data through clustering,and obtaining cell physiological status,gene expression regulation information,and gene function from it are of great significance to the study of functional genomics.In the process of acquiring gene expression data,affected by factors such as equipment,experimental environment,collection method,etc.,many data inevitably have missing values,and the imputation accuracy affects the final clustering effect to a certain extent.The existing clustering algorithm for incomplete gene expression data is usually a "two-stage" algorithm,that is,missing value imputation is used as a data preprocessing process,and clustering is performed on the filled data set.The "two-stage" algorithm is commonly used and simple,but it ignores that missing value imputation and clustering both rely on exploiting correlation information within data,and this prevents the two learning processes from negotiating with each other.In response to this problem,this paper proposes multi-objective clustering algorithm for incomplete gene expression data that integrates missing value imputation and clustering.(1)From the perspective of improving the accuracy of missing expression value imputation,this paper proposes a multi-objective clustering algorithm for incomplete gene expression data based on nearest neighbor interval.Firstly,the algorithm uses the nearest neighbor rule to determine the nearest neighbor interval of missing expression value,which limit the search for the missing expression values to a reasonable range.Constrained by the neighbor interval of missing values,the algorithm combines missing values imputation with clustering center optimization into NSGA-II by mixed encoding,which achieves missing values imputation and clustering collaborative evolution.Searching for the optimized estimate value in the corresponding nearest neighbor interval to restore the incomplete gene expression data set,which can avoid the misleading of inappropriate information to the clustering,compared with “two-stage” clustering method that separately carried out imputation and clustering processes the proposed collaborative optimization algorithm uses the clustering result as a imputation factor based on information of datasets,so that the algorithm can effectively estimate the missing expression value and improve the clustering effect of incomplete gene expression data.(2)Integrating gene domain knowledge into the analysis process of incomplete gene expression data helps to improve the accuracy of imputing missing expression values.In view of this,this paper proposes a multi-objective clustering algorithm for incomplete gene expression data based on functional neighbor intervals.The algorithm first uses the gene ontology database to obtain the semantic similarity of genes,and fuses the semantic similarity of genes and the similarity of expression values to determine the functional neighbor interval of missing expression values,and obtains the missing value search range closer to the biological network.On this basis,mixed encoding is used to realize the collaborative evolution of missing expression value imputation and clustering centers,so as to improve imputation accuracy and improve clustering performance.Experimental results on multiple gene expression datasets show that the proposed algorithm obtains imputation results closer to the true expression value and more compact clustering effect than an algorithm based solely on the structure information of dataset,and the clustering results is biological significant.
Keywords/Search Tags:Gene Expression Data, Missing Value, Multi-objective Clustering, Nearest Neighbor Rule, Gene Semantic Similarity
PDF Full Text Request
Related items