Font Size: a A A

Dimensionality Reduction Based On Feature Selection

Posted on:2016-05-10Degree:MasterType:Thesis
Country:ChinaCandidate:X P WenFull Text:PDF
GTID:2347330479954424Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
Feature selection is the most important method in dimensionality reduction, which is relative to the feature generation(or extraction). Combining both of the two methods covers the most often used techniques of dimensionality reduction. Furthermore, dimensionality reduction is a key problem in many theoretic and applied fields such as statistics, data mining, and pattern recognition. Feature selection is beneficial to decreasing the time complexity of data processing and the space complexity of data storing, what's more, improving the accuracy, robustness and generalization ability of the learning model. In this thesis, feature selection is classified and described due to the different mechanism of supervised and unsupervised learning. Several efficient algorithms are design based on the mutual information which is one the most significant concept in the information theory. The main topics of are presented as follows:(1) In the supervised case, we use the mutual information as a tool to design the Parzen Window feature selection(PWFS) and maximal relevance and minimal redundancy feature selection(MRMR) algorithms.(2) In the unsupervised case, we design a novel feature selection algorithm by clustering the features using the neighborhood mutual information as the similarity measure. Moreover, this algorithm can be directly applied for mixed numerical(continuous) and categorical data set without discretization or quantization.(3) Applying the neighborhood mutual information to PWFS and MRMR, we can obtain new algorithms applicable to mixed data directly in a supervised way.(4) Algorithms are tested and compared on the datasets downloaded from University of California Irvine(UCI) Machine Learning Repository webpage. And we use these feature selection algorithms to deal with real world dataset related to the economic strength of China's 31 areas coming from China Statistical Abstract 2013.
Keywords/Search Tags:feature selection, mutual information, maximal relevance and minimal redundancy, supervised learning, unsupervised learning, clustering
PDF Full Text Request
Related items