Research On Discretization Method Of Continuous Attributes Based On Statistical Independence

Posted on:2023-12-14

Degree:Master

Type:Thesis

Country:China

Candidate:L Yang

Full Text:PDF

GTID:2568307025450094

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

The progress of science and technology has brought huge amounts of data,and data mining has become a hot research topic today.Data usually appears in the form of continuous values,but most data mining algorithms can only deal with discrete data,such as decision trees,naive bayes,etc.Therefore,discretizing continuous values is to make such data mining algorithms work normally.Existing discretization methods can be divided into different categories,such as information entropy-based discretization methods,multi-attribute-based discretization methods,and discretization methods based on statistical independence.The problems they solve are the number of discretized intervals and breakpoints.The Chi2 algorithm is a classic discretization method based on statistical independence.Improvements of the Chi2 algorithm are based on the shortcomings of the Chi2 algorithm.First,a new method for selecting the chi-square statistical degrees of freedom is proposed.The number of categories in adjacent intervals is considered,and the number of adjacent intervals is considered.Secondly,by introducing information entropy to replace the inconsistency rate,it avoids the extra computation brought by the selection of parameters,and can describe the inherent properties of the sample well.Finally,calculate the importance of each feature attribute,assign it to each potential breakpoint of the attribute.Then sort them in ascending order according to the average importance of the breakpoints,and then discretize the attributes according to the principle of backward optimization.By comparing the improved Chi2 algorithm with other Chi2-based algorithms,the classification accuracy of C4.5 decision tree and Na?ve Bayes classifier on the discretized datasets is used as the evaluation indicators.The experimental results show that the improved Chi2 algorithm can achieve better results on datasets with a large number of continuous attributes.The improved Chi2 algorithm for multi-attribute discretization is the focus of the next research.

Keywords/Search Tags:

Data mining, Discretization, Chi-square statistics

PDF Full Text Request

Related items

1	Spatial Data Mining Classification Method And Its Application
2	Research On Method Of Attribute Discretization In Data Mining
3	VPRS Based Approaches For Discretization Of Continuous Attributes And Data Preprocessing
4	Research On The Discretization Algorithm Of Big Data Based On Spark
5	A Study For Discretization Of Real Value Attributes Base On Rough Se Theory
6	The Research On Discretization Oriented To Na(?)ve Bayes Algorithm
7	The Research Of Data Mining Based On Web Log
8	Bayesian Classification Algorithm Based On Attribute Discretization And Its Application
9	Research On Application Of Statistics Based Data Mining In Customer Relationship Management
10	Data Mining And Its Application In Process Industry