Font Size: a A A

Analysis Of Classification Error From The Dimension Of Data, The Imbalance Of Data And The Overlap Of The Data Density Distribution

Posted on:2017-03-10Degree:MasterType:Thesis
Country:ChinaCandidate:W HuFull Text:PDF
GTID:2310330503990902Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
This paper mainly discusses the error of classification from three aspects: deduction of data; imbalance of data and overlap of data. Among them, the overlap of data is the most important aspect, and the new index measuring the overlap of data density distribution is given according to the condition density distribution of data.On the issue of deduction of data, the decision tree algorithm is used to select the important variances from high dimension data, then, a more accurate and effective classification algorithm is adopted to classification. In the paper, Classification and Regression Tree is used for the selection of the important variances, and Support Vector Machine is used for classification. During the process, the first step gets a good result of deduction of data, because of the a clear process of decision tree algorithm. Imbalance of data will effect classification of data, and the issue is researched by many scientists. In the paper, the difference of imbalance of data is summarized from two kinds of representative algorithm. The more important is that a new solution is put forward based on the research of data missing, and the connection between them can be proved theoretically.Many papers discuss the effect of the imbalance of data and the overlap of the data density distribution, and they find the overlap of the data is more important to effecting the result of classification. The concept can be proved through the data experiments in the paper, and many indexes have been proposed such as: Fisher Discriminant Index and Volume of Overlap Region(F2). In the paper, a new index is introduced based on the conditional density distribution of data. When compared with other indexes, the index has advantages in principle of construction and numerical range. The key is that the index can explain the connection of the overlap of data and the error of classification better, so it is a good choice for the measuring of the overlap of data density distribution.
Keywords/Search Tags:error of classification, dimension of data, imbalance of data, overlap of data
PDF Full Text Request
Related items