Font Size: a A A

Research And Implementation Of A System Of Webpages Classification Based On User Behavior Analysis

Posted on:2012-09-08Degree:MasterType:Thesis
Country:ChinaCandidate:M N HuangFull Text:PDF
GTID:2178330335459838Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
In recent years, with the rapid development of the Internet, various text information especially with the carrier of web pages spring up on network. Data in the network grows exponentially. It grows harder and harder to find the useful information. Search engine with passive mode cannot satisfy the demand of users. How to meet the requirements of Internet users with active mode become the challenging topic in the latest network service system. With the prerequisite of user behavioral analysis and personalized service, aimed at researching and improving the key technologies in web pages classification domain, this paper implemented a text classification system suitable for web pages. The key technologies are including:First, the Chinese word segmentation technology. In this paper, we improved the original word segmentation method, and proposed a new word segmentation method based on the combination of statistical and maximal matching method, which is suitable for web pages. This method can identify the newborn Chinese words on the network, and also merge the single words which appeared together frequently in the web pages. The improved method avoids omitting the newborn words which is important for classification, and reduces the dimension of feature vector space through merging single words, that lead to reducing the computational complexity.Second, feature extraction and weight technology. Through the research and inspection on regular feature extraction and weight algorithm, we proposed an improved method which is suitable for web pages based on the CHI statistic method which is gained a high evaluation among experts. The improved CHI statistical feature extraction algorithm and TD-IDF-CHI weight algorithm is more concerned of the structure of web pages. Experimental result shows that these two kinds of preprocessing algorithm improved the accuracy of text classification.Based on the improved algorithm, we implemented a web page classification module, simultaneously designed and realized a complete user behavioral analysis system, which includes three modules:Data Collection&Filter Module, Web Pages Classification Module and Result Update Module. The feature of these three modules is defined as follows:First, Data Collection&Filter Module, which deals with the collection of original data. Web behavior information of users exists in the header of HTTP packets. We need to analysis and extract HTTP packets in order to get the information of users. This module introduces the procedures of how to analysis the HTTP packets.Second, Web Pages Classification Module is the major object of research. Based on the improved word segmentation algorithm, preprocessing methods and KNN or SVM classification algorithms which have good performance in classification domain, we proposed and implemented this module, which can take the web pages map to a particular category.Third, Result Update Module, which summarize and update classification results of web pages of every Internet user accessed. Through the direct connection with personalized service system, the results of user behavior analysis can be directly sent and applied in personalized AD feedback service system.In this paper, System of WebPages Classification Based on User Behavior Analysis could be applied to both online and offline classification. The experimental results show that the improved preprocessing algorithms make good correction for classification accuracy. Also the Result Update Module obtains good results, which reflects the users' interests obviously, and provides a reference model for the research on personalized service system.
Keywords/Search Tags:User Behavior Analysis, Automation Classification of Web pages, Chinese word segmentation, CHI, SVM
PDF Full Text Request
Related items