| The progress of science and technology has brought huge amounts of data,and data mining has become a hot research topic today.Data usually appears in the form of continuous values,but most data mining algorithms can only deal with discrete data,such as decision trees,naive bayes,etc.Therefore,discretizing continuous values is to make such data mining algorithms work normally.Existing discretization methods can be divided into different categories,such as information entropy-based discretization methods,multi-attribute-based discretization methods,and discretization methods based on statistical independence.The problems they solve are the number of discretized intervals and breakpoints.The Chi2 algorithm is a classic discretization method based on statistical independence.Improvements of the Chi2 algorithm are based on the shortcomings of the Chi2 algorithm.First,a new method for selecting the chi-square statistical degrees of freedom is proposed.The number of categories in adjacent intervals is considered,and the number of adjacent intervals is considered.Secondly,by introducing information entropy to replace the inconsistency rate,it avoids the extra computation brought by the selection of parameters,and can describe the inherent properties of the sample well.Finally,calculate the importance of each feature attribute,assign it to each potential breakpoint of the attribute.Then sort them in ascending order according to the average importance of the breakpoints,and then discretize the attributes according to the principle of backward optimization.By comparing the improved Chi2 algorithm with other Chi2-based algorithms,the classification accuracy of C4.5 decision tree and Na?ve Bayes classifier on the discretized datasets is used as the evaluation indicators.The experimental results show that the improved Chi2 algorithm can achieve better results on datasets with a large number of continuous attributes.The improved Chi2 algorithm for multi-attribute discretization is the focus of the next research. |