Application Reseach On Text Categorization Based On Support Vector Machine

Posted on:2015-08-01

Degree:Master

Type:Thesis

Country:China

Candidate:Y Wu

Full Text:PDF

GTID:2308330473453250

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

As social informatization deepens continuously, information has been produced with an exponential rate especially on the Internet, which furthermore aggregates information overload. It is a very important area of research on how to extract valid data automatically and efficiently. Text categorization is an branch of this area, whose purpose is to classify the texts given to certain categories for further processing. The fact that it contains many methods, and has been widely used makes it very attractive.About text categorization, there are three main directions of methods: word matching, knowledge engineering and statistical learning. The support vector machine(SVM) belongs to statistical learning, which is based on solid theoretical foundation. The SVM requires no professional knowledge, enables easy migration, solves problems efficiently with both high-dimensional data and small sample size, and ensures better generalization performance. Now it works well in many areas such as text categorization and image recognition, etc.Thus, the main purpose of this thesis is to study the theory and methods of text categorization by SVM, as well as all processes which need to be done in the classification, such as the selection of sample, encoding of text, Chinese word segmentation, feature extraction, text vectorization, system designing and realization. In the implementation, some improvements are done to achieve better performance. The theory of SVM is introduced with some basics on calculation of classifierâ€™s parameters, application of the theory to multi-class problem and faster calculation by SMO. To successfully apply SVM in text categorization, there is study about evaluation function, chi-square test,and TF-IDF. Besides the implement of the text categorization system, the main researches of this paper are below:Â· Combining the simplified chi-square test and TF-IDF to improve the efficiency of quantification.Â· Based on the decision-making methods in the multi-class problem, this paper presents a different decision-making method for non-unique decision samples, and analyzed its pros and cons. This method enables classifying a text and illustrates its combinations of classifiers based on the decision-making methods in the multi-class problem, and is tested in experiment.? Implementing a web-based text extraction for generalization test and provide a user interface of text categorization task.? Executed cross-validations and studied the relationship of penalty parameters for a particular category. Adequate data were obtained in finding the most suitable kernel and related parameters from cross-validations.

Keywords/Search Tags:

Text categorization, SVM, Kernel, Penalization parameter, Non-unique decision

PDF Full Text Request

Related items

1	Text Categorization Research Based On Support Vector Machine
2	Research On Chinese Text Categorization Based On Support Vector Machine
3	Research And Implementation Of Chinese Text Categorization Methods Based On Tree-like Keywords Set
4	Research On Web Chinese Text Automatic Categorization Based On Rs-svm
5	Research On Web Chinese Text Automatic Categorization Based On RS-SVM
6	The Studies On Chinese Text Categorization Based On Pso And Svm
7	Research On Food Complaint Document Classification Based On Semantic Kernel Function
8	The Fusion Learning On Technical Text Categorization Based On Decisiontree And SVM
9	Research On Support Vector Machines Classification Algorithm In Text Categorization
10	A Study On Text Categorization Based On Machine Learning