Font Size: a A A

Research On Feature Selection Algorithm Based On Tumor Gene Expression Data

Posted on:2023-08-14Degree:MasterType:Thesis
Country:ChinaCandidate:H Y TianFull Text:PDF
GTID:2544306848481374Subject:Computer technology
Abstract/Summary:PDF Full Text Request
As the second largest disease in the world,cancer incidence and death cases are increasing year by year,threatening human health and life safety.Research on pathogenic genes and different subtypes of cancer is an important measure for effective prevention and control of cancer.The advent of DNA microarrays allows researchers to analyze the expression levels of thousands of genes simultaneously.One of the most important applications of microarray data analysis is cancer classification,but not all genes are related to the occurrence of cancer,and a large number of genes are irrelevant or redundant for clinical diagnosis.Therefore,if all genes are used for clustering or classification of gene expression data,the accuracy of results may be affected.Microarray gene expression data analysis process is to find the information rich and remove redundant and irrelevant genes,but microarray gene expression data has the problem of dimension disaster,sample sparse,and much of the data mining technology in the treatment of high-dimensional data will also meet computing complex problems,one of the ways to overcome these difficulties is feature selection.Aiming at some problems existing in traditional feature selection algorithm,the following research is carried out in this article.(1)Traditional feature selection algorithms ignore the possible correlation between different features,and fail to retain the local manifold information and global structure information of high-dimensional data.To solve this problem,this article proposes a graph regular low-rank representation feature selection algorithm based on scoring function,in which mutual information is used to consider the paired correlation of features.The low rank representation with graph regularization is used to preserve the manifold information and structure information of high dimensional data.The score function was used to evaluate the advantages and disadvantages of features,and feature subsets were selected for clustering.Finally respectively on UCI data sets and experimental verification on gene expression data sets,and compared with the existing feature selection algorithm,through accuracy,the normalized mutual information value and convergence to evaluate algorithm,experimental results show that the method tend to do well the selected feature subsets and the average highest accuracy and normalized mutual information value can reach 86.7% and 45.5%.(2)Evolutionary algorithm has strong search ability and can find the optimal feature subset in the search space.Particle swarm optimization algorithm is widely used due to its simple rules and easy implementation,but it also has certain defects,such as easy to stagnation and fall into local optimum.In the process of feature selection,multiple objective functions need to be optimized,but the traditional single-peak multi-objective optimization algorithm can only provide a limited number of Pareto optimal solutions,leading to the omission of some important feature subsets.Aiming at the above problems,this article proposes a multi-modal multi-objective optimization algorithm combining star and ring topological particle swarm optimization.In this algorithm,combining star and ring topological structure can give consideration to both global and local particle swarm optimization,and avoid the algorithm falling into local optimum prematurely.Multi-modal multi-objective optimization can select multiple feature subsets for decision makers to select and classify the selected feature subsets.Finally,it was validated on UCI dataset and gene expression dataset respectively.Experimental results show that PSO combined with two topologies can find more feature subsets than simple PSO in multi-modal and multi-objective environment,and has competitive classification results on different data sets.In this article,two improved feature selection algorithms are proposed.First,the rationality and effectiveness of the proposed method are verified on UCI data set,and then applied to tumor gene expression data set.The advantages of these two algorithms in processing high-dimensional data sets are demonstrated.
Keywords/Search Tags:Gene Expression Data, Feature Selection, Low-rank Representation, Multimodal Multi-objective Optimization, Particle Swarm Optimization
PDF Full Text Request
Related items