| With the global popularity of the Internet, the world has entered into a high-speed information age. On the Internet, information increase sharply, people conveniently browse and share a big sum of network resources at the same time. However, negative, unhealthy content grow rapidly, which affect national stability and unity. It is hoped that in term of identifying web content, classifying web and filtering URL, user's behavior can be controlled on internet, harmonious and clean network environment can be created. With the increasing depth of research and application, Web classification has become an important research direction on data mining. This paper mainly studies Web classification algorithm, in addition, SVM algorithm has been improved, which is applied to telecom project based on Security Internet Gateway (SIG) and Unified Threat Management (UTM), the specific content are followings:(1) Studying Web classification model. The whole process of Web classification model has been studied through analyzing data resource, pre-processing HTML, segmenting word, extracting and training characteristic word.(2) Including classification algorithm of decision trees, K-nearest neighbor, Naive Bayes have been research. Introduce binary tree algorithm which is typical in decision tree, Naive Bayes algorithm which is based on Probabilistic Model, KNN algorithm which has a wide application in small text sample.(3) Focuses on the SVM algorithm based on statistics theory which is applied to high spatial dimensions. Taking a wide range of Web information into account, meanwhile, recently SVM multi-classification algorithm has been widely verified, SVM multi-classification algorithm has been compared and incremental learning algorithm has been discussed.(4) For the classifier training, kernal function of SVM multi-classifier has been modified which is based on strong support of statistical theory, the optimal classifier is gained ultimately. because the actual classification process is an incremental learning process, the single SVM algorithm could cause re-classification problem or empty-classification problem, we improved traditional SVM algorithm which is combining SVM algorithm and high efficient KNN algorithm to filter URL, It's proved by experiments that improved SVM algorithm enhance both precision rate and recall rate, which filters unhealthy URL effectively, cleans web content to achieve "green internet."... |