Font Size: a A A

High-dimensional Data Based On MIC Feature Selection And Application Research

Posted on:2020-03-30Degree:MasterType:Thesis
Country:ChinaCandidate:C S GuoFull Text:PDF
GTID:2417330578973080Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
With the advent of the era of big data,the problem of "dimensional disaster" of high-dimensional data has attracted extensive attention from more scholars.High-dimensional data usually refers to contain hundreds of features and has a large number of irrelevant information and redundant data sets,such as in the field of natural language processing,biological engineering,medical field,the financial sector and the field of face recognition,etc.There are a lot of high dimension,and contains redundant features to the follow-up study and research of inconvenient,which can reduce more credibility in the final analysis results or even get incorrect results.Therefore,the feature selection method of high-dimensional data has become the research focus of scholars at home and abroad,and has been widely used in many fields.In this paper,the Maximal Information coefficient(MIC)method is used for feature selection of high-dimensional data.MIC is an indicator to measure the degree of interdependence between two variables proposed by David N.Reshef et al.,of Harvard University in 2011,and there is a calculation method based on observational data.Traditional feature selection methods(such as AIC and BIC)need to determine the model first.The feature sets which selected by different models are quite different,and the interpretability of the model is poor.In this paper,the important properties of MIC are proved theoretically.The MIC feature selection method is independent of the selected model.No matter what model is used in the subsequent modeling and how to use these features,theoretically,features with real dependent relations will not be lost.This feature fully reflects the stability of feature selection.Then characteristics of random forest model is established to inspection and choose appropriate or not,the first to use 3 block x2 cross validation to classify the training set and testing set model rather than the traditional k cross validation,and then get in the six training on training set random forest model and the corresponding test model test set on classification performance,take the bag outside error OOB(the out-of-bag)as an evaluation standard of random forest model tuning parameter,and the decision tree and tree node number k p double parameters combination for the amount of the feature selection instead of traditional single parameters fixed and adjustable parameters.Finally,six groups of experimental results were averaged and compared with data without feature selection on model performance.The results showed that the accuracy of model classification increased from 67% to 82.5%,and f1-measure increased from 65.26% to 80.73%,fully demonstrating the effectiveness of using maximum information coefficient to select features of high-dimensional data.
Keywords/Search Tags:High dimensional data, Feature selection, MIC, Random forest algorithm, Blocked 3×2 cross-validated
PDF Full Text Request
Related items