| This is the age of big data.Many disciplines,from quantitative biology,ge-nomics to financial engineering and risk management,are all faced with high-dimensional problems.In the face of high-dimensional data,variable selection and feature extraction are the key to knowledge discovery.Classical statistics has a long history of studying high-dimensional problems.New machine learning methods challenge classical statistics in high-dimensional data processing.The purpose of this paper is to compare the performance of the classical variable selection method and the newly emerging machine learning method in the problem of variable selection.Among the classical variable selec-tion methods,we chose four methods based on regularization,Lasso,Adaptive Lasso,Elastic net and SCAD.In machine learning,we mainly study the decision tree method.In the first part of the paper,we give a brief and comprehensive introduction to the classical statistics variable selection method and machine learning variable selection method.In the second part of the paper,we introduce the principle of variable selection,parameter selection criteria,solution algorithm and statistical properties of Lasso,Adaptive lasso,Elastic net and SCAD methods in detail.In terms of solving algorithm,we not only introduced the classical least angle regression to solve the problem for the first,three methods,but also used the proximal gradient descent algorithm to solve the problem,and used local quadratic approximation to solve the SCAD.We also analyzed in detail of the differences and relations among the four methods based on regularization.In the third partwe introduce the decision tree method.We mainly introduce information gain,information gain rate,gini index,DKM criterion and distance-based method of the decision tree variable selection criterion,and also compares the performance of these criteria.For the first three criteria,we introduced the corresponding decision tree generation methods,namely ID3 algorithm,C4.5 al-gorithm and CART algorithm.In addition,we apply the regularization idea of the second part to the pruning problem of decision tree.Finally,the advan-tages and disadvantages of decision tree are analyzed,and the corresponding performance enhancement,algorithm for classification tree and regression tree is proposed respectively.The fourth part is numerical simulation.The numerical simulation used four models to generate the data.We proposed comprehensive and reasonable model evaluation criterion.Through numerical simulation,we found that for the four methods based on regularization,the variables selected by Lasso and Adaptive lasso were roughly the same,but the Adaptive lasso had smaller standard devia-tion and mean square error than the Lasso.The elastic net tends to select more variables;SCAD method is not only superior in terms of eliminating irrelevant variables to the other three methods,but also has smaller standard deviation and mean square error than the other three methods.The larger the sample size is,the closer the variables selected by SCAD method are to the real model,which also verifies its Oracle property.Although the decision tree is notgood at doing regression problems,it can also select real variables very accurately.In the importance ranking of variables obtained by the performance enhancement algorithm of the decision tree,the score of real variables is much higher than that of irrelevant variables.The fifth part is real data analysis.In the numerical simulation part,we use the regression model;in the real data analysis part,we use the classification mod-el.First of all,we introduce how to use Lasso,Adaptive lasso,Elastic net and SCAD methods to solve classification problems,which are applied to the logistic regression model.For the real data analysis one,in order to analyze the order in which variables were added to the model,we selected the breast cancer classifi-ation data set with a small number of variables.We fitted the model in the test set and tested the classification accuracy of the model in the validation set.For the classical statistics method,we first give the coefficient path diagram based on a simulation and the corresponding CV error figure.Then we repeated the simu-lation for 100 times,we found that the classification accuracy of Lasso.Adaptive lasso,Elastic net and SCAD in the test set was 96.5366%,96.5877%,96.4781%and 96.7756%respectively.The first three variables selected into the model were all 2,3 and 6,and the last two variables added into the model were all variables5 and 9.For the decision tree method.we first generated a tree in the test set,and then got the classification accuracy is 94.7619%in the verification set.After pruning the decision tree,we got the same result.And then we generated 100 trees in the training set,using decision tree to strengthen performance algorithm on the test set,then the classification accuracy increased to 96.1905%,and we use the strengthening algorithm to sort the variables by its importance,the first three important varia.bles are the same to classical statistics method obtained,that are 2,3,6.But the decision tree think the least relevant varia.bles are 4,9,different from 5,9 obtained by the classical method.The implementation process of real data analysis two is basically the same as that of analysis one.It is con-cluded that the classification accuracy of Lasso.Adaptive Lasso,Elastic net and SCAD based on 100 simulations in the test set is 90.5807%,91.7963%,90.9354%and 99.8387%respectively.The classification accuracy of the decision tree per-formance enhancement algorithm in the test set is93.5484%.We also analyzed the variables selected for each method in detail.The sixth part is the summary and prospect.In this part,the classical statistics methods and machine learning methods are compared and summarized,and the improvement ideas of this paper are proposed. |