| With the continuous development of the information technology and constant improvement of information obtaining ability,people often need to analyze and deal with various high dimensional data,such as the mass of web data,remote sensing images,microarray data etc.These high dimensional data usually lead to the exponential increase of the calculation of machine learning algorithm,causing “the curse of dimensionality”.Therefore,feature selection technology for high dimensional data has become an important subject in the field of data mining.Feature selection technique maps high dimensional data from high-dimensional space to lowdimensional space which can better reflect the essential meaning of the data object and improve the efficiency of data analyzing and processing simultaneously.The paper has a deep research and discussion on the theoretical idea and practical application of feature selection technique for high dimensional data by taking the microarray data as the experimental data.A new feature selection algorithm based on feature similarity is proposed.Firstly,normalized signal to noise ratio algorithm is used for removing irrelevant features.Then surplus features are clustered into several clusters and clusters which only have little features are removed as noise features.After the removal,k clusters will be left,and the intra-cluster features redundancy is high,while features redundancy between the clusters is low.Finally,each feature of the clusters is evaluated successively according to the evaluation criteria which is proposed in this paper to decide whether or not to remove.In this way,the rest would be assembled and sorted according to the individual classification ability.The experiment confirms that the algorithm is valid in removing irrelevant features,noise features and redundancy features.A new algorithm can be achieved by analyzing the advantages and disadvantages of the feature selection algorithm based on feature similarity and the Top-r feature selection algorithm,combing them and learning from each other,which can not only fully consider the classification advantages but also guarantee high execution efficiency.First of all,feature set is cut for getting a feature subset with little irrelevant feature and redundancy features,then features in the same clusters are divided into different blocks and features in different clusters into the same.Finally,feature blocks are processed with Top-r algorithm to choose the optimum feature subset.The experiment confirms that the new algorithm can not only select the superior feature subset,but also guarantee high execution efficiency,fully affirming its superiority. |