Font Size: a A A

Research On SVM-based Highly Imbala-nced Classification And Its Application In Telecommunications

Posted on:2011-09-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y QuFull Text:PDF
GTID:1118330332478378Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
As a statistical learning theory based learning machine, SVM possesses excellent extension ability and generalization error. Imbalanced classification problem is an important research area in the data mining and machine learning field. For the data sets in the practical business applications, there always exist many characteristics, such as highly imbalance, seriously overlapping, influenced by noise, high dimension and multi-classification with highly imbalance, etc., which will have great influence on the performance of the classification results. In consideration of the practical telecom business problems, this thesis is supposed to overcome the shortcomings of the current SVM based approaches in the imbalance classification problems and develop more effective solutions as well, which will finally make the SVM more applicable to the business applications with highly imbalance characteristic. The efficiency and advantage of the proposed approaches are illustrated by simulation case studies in both the benchmark data sets and practical telecom data sets. Based on the research results of the key technologies of SVM which are applied to the imbalance classification, two systems called the Business Intelligence System for Telecommunications of Preventing Arrears of public customers and Intelligent Remindering Worksheet Processing System for Telecommunications, respectively, are designed and developed in the thesis. The successfully application of the two systems in different companies verified that the system can effectively reduce the lost caused by arrear. Finally, based on the experience of practical data mining project applications, a novel DM Methodology for Telecom Industry (DMM-TI) is proposed in the thesis, which can provide useful suggestions and can be a guideline as well, for the future project application in the telecom industry. The main research results of this thesis are listed as follows.Ⅰ. Consider the classification of the data set with highly imbalanced and class overlapping existing simultaneously, a novel SVM-HIO(SVM modeling for Highly Imbalanced and Overlapping classification) algorithm is proposed. Classification hyper plane excursion strategy is utilized in the algorithm to identify the non-overlapping samples in certain feature space and train the meta-model as well. The proposed algorithm can learn in several kernel spaces by using the kernel shifting strategy and then identify more non-overlapping samples in different kernel spaces. Nonlinear model is finally established in the SVM-HIO algorithm by combining the meta-models to replace the original linear model based SVM algorithm. The established nonlinear model can correctly identify the majority non-overlapping samples. As the highly imbalanced property, the rest overlapped samples can be regarded as minority samples. Subsequently, all the minority samples can be correctly predicted and the classification error of the majority class can be minimized simultaneously.II. In consideration of the poor classifying and identifying performance of the minority class in the highly imbalanced classifying problem, a novel standard based class separability and imbalanced scale for the structure establishment of the binary trees is developed. Based on the new benchmark, a new MCI-SVM (Multi-Classification based highly Imbalanced SVM) algorithm is proposed. In the modeling of the multi-classification, the importance of the minority class is considered first, and the separability between classes is combined to establish the structure of the binary tree. The proposed MCI-SVM algorithm can identify important clusters in the multi-classifying problems and make modeling nodes between majority and minority clusters. More importantly, by employing the cost-sensitivity learning strategy, it can effectively avoid the classification performance descending caused by the imbalance property. The tree-structure based MCI-SVM algorithm has N-1 layers training and test structures, which has relatively smaller error cumulation and higher extension advantages. It can reduce the extension error of the minority class while guaranteeing the performance of the general classification and improving the identification performance of minority class.Ⅲ. For the classification of large scale highly imbalanced data set, models can not be obtained during effective time period, and in addition, the highly imbalanced property will result in poor classification performance and poor identification performance for the minority class. Based on the approximate MEB(Minimum Enclosing Ball) theory, a novel algorithm called LCI-SVM(Large scale Classification based highly Imbalanced SVM) is proposed to deal with the above problem. Inspired by searching the core set of MEB in the high dimension space, the LCI-SVM transfers the original SVM optimization problem to the problem of searching minimal envelope ball in the high dimension space, which results in the training time being impendence of the dimension and size of the sample. Consequently the SVM model can be obtained efficiently with the large scale data set. By using the heuristic iteration strategy, the LCI-SVM algorithm can make the classifying hyperplane move toward to the majority class, which leads to better extension ability for the minority class.τapproximation optimal classification hyperplane is proposed to deal with the fitting problem. The classification results of each iteration are evaluated by the knowledge based stop criterion and the optimal position of theτapproximation classification hyperplane is determined as well. All these above strategies lead to the performance improvement of the LCI-SVM in the large scale imbalanced data classification, especially in the identification of the minority class.Ⅳ. Based on the CRISP-DM(Cross-Industry Standard Process for Data Mining) theory, and together with the experience of practical data mining project application, a novel DM Methodology in Telecom Industry (DMM-TI) is proposed, which can provide useful suggestions and be a guideline as well for the future data mining project applications in telecom industry.Ⅴ. To deal with the telecom arrear problem, two systems called the Business Intelligence System for Telecommunications of Preventing Arrears of public customer and Intelligent Remindering Worksheet Processing System for Telecommunications, respectively, are designed and developed. The construction and the function framework of the system are introduced. The detailed implementation procedures of the systems are given in the thesis. Evaluated results of the system's application illustrated that it can help the telecom companies effectively reduce the lost caused by arrear.
Keywords/Search Tags:DataMining, Telecommunications, Highly Imbalanced Classification, Large Scale Datasets, Multi-classification, SVM, Kernel Method, MEB
PDF Full Text Request
Related items