Font Size: a A A

Application Reseach On Text Categorization Based On Support Vector Machine

Posted on:2015-08-01Degree:MasterType:Thesis
Country:ChinaCandidate:Y WuFull Text:PDF
GTID:2308330473453250Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
As social informatization deepens continuously, information has been produced with an exponential rate especially on the Internet, which furthermore aggregates information overload. It is a very important area of research on how to extract valid data automatically and efficiently. Text categorization is an branch of this area, whose purpose is to classify the texts given to certain categories for further processing. The fact that it contains many methods, and has been widely used makes it very attractive.About text categorization, there are three main directions of methods: word matching, knowledge engineering and statistical learning. The support vector machine(SVM) belongs to statistical learning, which is based on solid theoretical foundation. The SVM requires no professional knowledge, enables easy migration, solves problems efficiently with both high-dimensional data and small sample size, and ensures better generalization performance. Now it works well in many areas such as text categorization and image recognition, etc.Thus, the main purpose of this thesis is to study the theory and methods of text categorization by SVM, as well as all processes which need to be done in the classification, such as the selection of sample, encoding of text, Chinese word segmentation, feature extraction, text vectorization, system designing and realization. In the implementation, some improvements are done to achieve better performance. The theory of SVM is introduced with some basics on calculation of classifier’s parameters, application of the theory to multi-class problem and faster calculation by SMO. To successfully apply SVM in text categorization, there is study about evaluation function, chi-square test,and TF-IDF. Besides the implement of the text categorization system, the main researches of this paper are below:· Combining the simplified chi-square test and TF-IDF to improve the efficiency of quantification.· Based on the decision-making methods in the multi-class problem, this paper presents a different decision-making method for non-unique decision samples, and analyzed its pros and cons. This method enables classifying a text and illustrates its combinations of classifiers based on the decision-making methods in the multi-class problem, and is tested in experiment.? Implementing a web-based text extraction for generalization test and provide a user interface of text categorization task.? Executed cross-validations and studied the relationship of penalty parameters for a particular category. Adequate data were obtained in finding the most suitable kernel and related parameters from cross-validations.
Keywords/Search Tags:Text categorization, SVM, Kernel, Penalization parameter, Non-unique decision
PDF Full Text Request
Related items