| Classification is one of the important tasks in machine learning community.Selecting an applicable classifier for a specific classification problem is full of challenges because different classifiers have preference on different types of data sets.A feasible method is automatic classifier selection(CS),namely recommending classifiers to a new data set based on its similarities with other data sets.The most critical stage of CS is feature extraction,since it determines the accuracy of the measurement of data set similarities.However,data sets usually have different number of samples,dimensions,classes and types of attributes.The existing data set features describe the similarity between data sets in various aspects,such as statistics and geometry,but they do not have connection with the performances of classifiers,which may lead to an unsatisfactory recommendation.Moreover,these features lack theoretical supports.Therefore,this dissertation aims to extract effective data set features and theoretically investigate the problem of CS.This dissertation divides into four sections.The first section introduces the research background and relating work;the second and third section present our work;the last section includes conclusion and outlook.The second section proposes a Euclidean geometry preserving feature to solve the problem of feature inaccuracy.This feature unites the inner-product matrix and class labels of a data set,which can describe the distributions of data points and decision boundary geometrically.We suppose that the difficulty of classification tasks is determined by the geometry distribution of data and decision boundary simultaneously.The computation of feature similarity is a graph-matching problem that belongs to NP-hard problems.Based on the structure of the inner-product matrix,we propose a novel algorithm that is more efficient than the classical graph-matching algorithms.We theoretically analyze the relationship between the similarity of data set features and the performance of classifiers,which justifies the rationality of our method.By generalizing our feature to kernel spaces,we can measure the local geometrical and nonlinear structure of data sets.At last,we conducted experiments using both artificial data sets and data sets from real-world scenarios.The performance of our feature outperforms that of the compared data set features.The third section proposes a kind of classification-problem-complexity-based data set feature to tackle the problems of high computational complexity and inaccuracy of the existing data set features.Classification complexity is a measurement that describes how hard the classification task is.We believe if the measured complexity has connection with the performance of classifiers,then data sets with the same complexity should have the same optimal classifier.We first propose five geometry and statistics-based metrics to characterize the complexity of data set in different aspects,then these metrics are unified as data set feature.We also theoretically proved that two of them have connection with the upper bounds of generalization error of some classifiers,thus we can ensure the effectiveness of our feature.Low computational complexity is another advantage of our feature.Comparing with the existing features,our feature spends less time,so it can improve the efficiency in real applications.Additionally,we can extend this feature to any kernel spaces to characterize the nonlinear structure of data.At last,we conduct experiments using both artificial data sets and data sets from real-world scenarios.The performance of our feature outperforms that of the four existing data set features. |