Font Size: a A A

Research On PCA And CFS Feature Dimensionality Reduction Algorithm Based On MIC

Posted on:2021-04-10Degree:MasterType:Thesis
Country:ChinaCandidate:K M XieFull Text:PDF
GTID:2427330629486043Subject:Applied statistics
Abstract/Summary:PDF Full Text Request
The arrival of the era of big data and the development of information technology have produced mass data.Machine learning and deep learning in recent years are important ways and powerful weapons to explore data,the feature processing and extraction are important.Feature engineering is an important preparatory stage of machine learning,and data features are crucial to model learning results.There are often irrelevant or redundant features or redundant information in the data.A large amount of redundant information and noise in the data will not only affect the result accuracy of data analysis,but also increase much calculation.Feature dimension reduction can simplify the data structure,increase the interpretability of the model,decrease computer capacity of model,and improve the accuracy of model learning.Feature dimension reduction is mainly divided into two aspects: feature extraction and feature selection.This paper aims to improve,optimize and generalize the feature dimension reduction algorithm to improve the effectiveness of feature dimension reduction and make it more applicable.In this paper,Principal component analysis(PCA)and Correlation-based Feature Selection(CFS)are selected respectively in the feature extraction and feature selection algorithms of feature reduction dimension,and Maximum information coefficient(MIC)is used to improve the two algorithms.The main research work of this paper is as follows:Firstly,due to the limits that the covariance matrix in PCA in feature extraction algorithm can only measure linear relationship between variables and the characteristic that data obeys Gaussian distribution,an improved PCA algorithm for feature extraction(named as YJ-MICPCA)is presented based on Yeo-Johnson transformation and MIC.First of all,transform the data to meet Gaussian distribution assumption in PCA,and extend the linear relationship in PCA to nonlinear.Then the effectiveness of YJ-MICPCA is verified by simulation data and public datasets from UCI Machine Learning Repository from several aspects,and the results showed that YJ-MICPCA is superior to traditional PCA.In the end,compared to other common nonlinear feature extraction algorithms,the results show that YJ-MICPCA also performs better.Secondly,in the CFS,due to the limitation that linear correlation coefficient can only measure the relationship between variables in the regression task and the symmetrical uncertainty(SU)between variables in classification task that the denominator is too large,and the mutual information for continuous variables is not easy to calculate and the result is affected by discretization methods,an improved CFS algorithm for feature selection(named as MICCFS)is presented based on MIC.First of all,we unify the measure between the variables of regression and classification with the MIC measurement,and search the feature subset with evaluation function.Then we conduct experiments to demonstrate the effectiveness of contrast MICCFS and CFS in regression and classification respectively with public datasets from UCI Machine Learning Repository from several aspects,the results showed that MICCFS performs better.Additionally,MICCFS is compared with other commonly used feature selection algorithms in classification task,the results show that the MICCFS outperforms than others as a whole.
Keywords/Search Tags:Principal component analysis, Correlation-based Feature Selection algorithm, Maximal information Coefficient, YJ-MICPCA, MICCFS
PDF Full Text Request
Related items