Font Size: a A A

Research On Several Classification Problems

Posted on:2016-10-11Degree:MasterType:Thesis
Country:ChinaCandidate:X LinFull Text:PDF
GTID:2180330473965234Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
Research on Several Classification ProblemsThis is a paper that focuses on the comparison among linear methods and nonlinear methods on classification problem.For linear methods, I will compare OLS (ordinary least squares), LDA (linear discriminant analysis) and logistic regression, and also the situation combining OLS, LDA with dimension reduced data using PCA and LDA. OLS is one of the fundamental linear model which is commonly used, not only regression problem, it can also deal with classification problem. The only difference is that the response variable becomes an indicator matrix but not a single vector like regression. In the indicator matrix, row represents observations while column represents classes, and in each row, number one under each column represents the observation belongs to that class while number zero means opposite. For linear classification problem, most of the time OLS will give a good results, however it also suffers a masking problem especially when classes distribute parallel in the space, it will then completely ignore the one in the middle. For LDA, similar as OLS, it is also sensitive to linear classification problem, but it is better because it avoid the masking problem OLS has. For logistic regression, which is designed for classify two classes originally, uses the probability ration to convert 0-1 response variable to continuous response variable and then solve the classification problem. Here, I improve it to solve multi classification problem, because of the characteristics of the model, it always performs well on classification.For nonlinear methods, this paper focuses on SVM (support vector machine), Tree, Bagging (bootstrap aggregating) and random forest. For SVM, by adjusting the kernel argument to ’linear’,’polynomial’ or ’radial’, it can adapt to all linear, polynomial and radial classification boundaries. And these options make it a strong classification method. For a single decision tree, because of its structure, it always has a high variance with low accuracy, especially when the classification boundary is linear. When it comes to bagging, with the help of ’taking average’of a plenty of trees, it basically solves the high variance and low accuracy problem of a single decision tree. However, if there is one variable which is especially sensitive to the classification problem, in most of the trees bagging builds, they may all have that variable as the top node, so these trees are correlated that reduce the efficiency of bagging procedure. At last, for random forest, it forces to random select variables for different trees, so that it will not suffer the correlation problem.Finally, by analyzing the real data ISOLET, I compare all the methods above and find the best ones for this specific data set.
Keywords/Search Tags:classification, LDA, logistic regression, FDA, SVM, decision tree, bagging, random forest
PDF Full Text Request
Related items