Font Size: a A A

Research On The Technique Of Extracting Web Page Informational Content Based On Node Type Annotation

Posted on:2017-04-27Degree:MasterType:Thesis
Country:ChinaCandidate:F L XieFull Text:PDF
GTID:2308330485985646Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, the number of web page has enjoyed an explosive growth. A web page usually consists of rich content, not only informational content that users want to browse, but also irrelevant information which exerts interference to users, such as page navigation bar,recommended links, banner ads, copyright notices, etc, which is usually referred as page noise. The presence of noise not only brings a great deal of problems to web page information retrieval, but also has negative impact to tasks like web pages classification, clustering, knowledge mining, topic detection,personalized recommendations and data mining. If page noise exsits, the retrieval systems will undoubtly not achieve a good search results. Therefore, eliminating web page noise or extracting the main content from the web page is a fundamental and significant work to web information retrieval.In the field of web information extraction, the extraction methods can be divided into three categories according to the character of handling noise information. First, the template matching method,which is based on the same template shared by the site pages. By identifing the common template and removing it, the method take the rest as the main content. Second, the machine learning method, which is suitable for processing large data sets. By artificially annotating web pages to train the information classifier, the classifier is then used to identify the web page themes and non-subject information. Third,the heuristic method, which firstly build a set of heuristic rules based on some specific visual features or structural features of web page and then use these rules to identify noise.Given that the heuristic methods usually have a good processing efficiency compared to other kind of methods, and the deficiency of VIPS( VIsion based Page Segmentation) algorithm, this paper proposes a Node Type Annotation(NTA) information extraction algorithm based on web page DOM structure. First, based on the noise forms present in the web pages, we define four types of nodes: Text Node, Link Node, Image Node and Ignore Node, and define the Degree of Coherence(DoC) to reflect the node content consistency. To determine the node type and cohesion, each node content characteristic is computed and added node type and node Do C properties. During the information extraction stage,with the threshold and node text density, node type is identified and the desired text nodes are selected out to integrate as the main content. As to image and anchor nodes missing problems, specific treatments are made to fix them. The proposed method can make up the deficiency of VIPS and has a good algorithm efficiency. It has better versatility because it does not rely on particular tags.At last, a web page content extraction tool Web Clipper is developed based on the proposed NTA algorithm. Over 100 informational pages from seven major portal website are selected for the test. Three similar content extraction tools are introduced to compare with the Web Clipper, such as YNote,Evernote and Readability. The preliminary results show that the proposed method reaches an average recall rate of 98.15% and average percision rate of 92.41%. As to F1 index, the Web Clipper is 95.1%,which is 0.3% higher than Evernote tools and 5.01% higher than YNote tool. The results verify that the proposed method is effective and practical to some extent.
Keywords/Search Tags:DOM, Node Type Annotation, informational content extraction
PDF Full Text Request
Related items