Font Size: a A A

Study On Support Vector Machines Classification Methods And Their Application In Text Categorization

Posted on:2007-06-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:H ZhaoFull Text:PDF
GTID:1119360182960769Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
Support vector machines (SVMs), as a new machine learning method based on statistical learning theory, have attracted more and more attention and became a hot issue in the field of machine learning, because they can well resolve such practical problems as nonlinearity, high dimension and local minima. Text categorization is a key technique in content-based automatic information management. Text vectors are high dimensional and extremely sparse, and have numbers of relevant features. SVMs are particularly suited for text categorization and have great potential in text categorization, as SVMs are not sensitive to relevant features and sparse data, and have advantages in dealing with high dimensional problems. However, there are many challenging topics to SVMs in text categorization application. For example, text categorization is characterized with a high number of classes and training examples as well as too many noises, and SVMs have some drawbacks in text categorization application such as lower speed in training and classification. This paper mainly focuses on the drawbacks of SVMs in the practical application including text categorization, and the main work is as follows:1. SVMs were originally designed for binary classification. How to effectively extend them for multi-class classification is still an ongoing research issue. Firstly, several existing multi-class SVMs methods are compared and analyzed. Secondly, a semi-fuzzy kernel clustering (SFKC) algorithm is presented, and according to the characters of tree-structured SVMs, a tree-structured SVMs multi-class classification method is proposed based on the SFKC algorithm. The method mines information on overlap between classes, designs the tree structure and overcomes the misclassification of tree-structured SVMs based on the semi-fuzzy kernel clustering algorithm. Experimental results indicatethe method has higher precision and faster training speed than other multi-class SVMs methods do, and improves the classification performance of SVMs for multi-class classification.2. A kind of extended fuzzy SVMs is presented to resolve the problems that standard SVMs are very sensitive to noises and undesirably biased towards the class with more samples in the training set. Besides, uniting the virtues of the extended fuzzy SVMs and proximal SVMs, a combined SVMs classification method is proposed. The method rapidly eliminates the non-support vectors, reduces the number of training examples, selects model parameters, and calculates weights of training examples by using proximal SVMs, then trains extended fuzzy SVMs with the reduced training set, the model parameters and the weights of training examples. Experimental results indicate this method can effectively calculate the weights of training examples, reduce training time, and eliminate the bad effects of outliers and skewed training set on SVMs.3. Usually, the larger the number of support vectors is, the lower the classification speed of SVMs is. How to reduce support vectors set and increase the classification speed of SVMs is one of important research topics on SVMs. Several existing methods of reducing support vectors set are analyzed. Then, a method of reducing support vectors set is proposed based on virtual examples and support vector regression (SVR), which improving the method proposed by Osuna and others according to the characters of support vectors set and SVR. The method eliminates redundant support vectors and creates virtual bound support vectors based on virtual examples, and can effectively reduce support vectors set with numbers of redundant support vectors and few bound support vectors, which Osuna's method failed to do. The experimental results indicate that, compared with Osuna's method, the method reduces the number of support vectors to greater grade and increases classification speed of SVMs in the condition that the correct rate almost does not decrease.4. Taking advantages of SVMs in text categorization and applying SVMs to text feature extraction, a method of word clustering based on SVMs is proposed. The method evaluates the contribution of each word to classification by using SVMs, and combines several different words which have similar contribution to classification into one text feature. The experimental results indicate that the method almost does not lose the information ofclassification, dramatically decreases the dimensions of text vectors and the number of relevant features, and improves the precision and recall of text categorization.
Keywords/Search Tags:Support Vector Machines, Text Categorization, Multi-Class Classification, Semi-Fuzzy Kernel Clustering, Support Vector Regression, Virtual Examples, Feature Extraction, Word Clustering
PDF Full Text Request
Related items