Font Size: a A A

Vertical Search Engine For Crawling The Web Page Design And Implementation

Posted on:2010-02-19Degree:MasterType:Thesis
Country:ChinaCandidate:Z ChenFull Text:PDF
GTID:2178360275986578Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Today's society, Internet technology is moving very quickly, network information has been the rapid increase in search engine information on the network's coverage on the capacity of the overall downward trend, but at the same time, people search engine search information to the quality requirements but more and more high, all kinds of user information on the request of english from the original quantity on the quality of the up to now has changed. In this context, how can the network at the growing amount of information quickly to find more accurate information became more valuable field of the current search engine a challenging research problem hot spots. In relation to all areas and topics for a comprehensive search engine, vertical search engines are often only for a particular discipline or field of information to meet the needs of the field or discipline-specific requirements, so information on the network for more in-depth and accurate extraction, make search more targeted, more targeted, accurate recall is higher. Vertical search engine to narrow your search at the premise, can quickly search for a more accurate information more valuable, but at the same time a vertical search engine web crawling depth and accuracy of data extracted from information in relation to integrated search engine is also made a higher requirement.In this thesis, how to design and implementation in accordance with the vertical search engine for crawling the web site needs browser of this topic in light of the current web design technology and the characteristics of browser technology to crawl the web page formulate the overall structure of system and system design of each module.In this, based on the requirement to construct a vertical search engine and combined with the technology of making web page and the characteristic of the web browser technology, architecture and modules design schemes are made. In the aspect of Information Extraction, researched the method of Information Extraction based on IE kernel. The method is based on the DOM Tree Model of IE kernel and through DOM, as the input of the characterized information of the content of the tree nodes and the HTML labels, the regular express is created automatically by the program. Then combined with the created regular express and the index of the tree node in the whole DOM tree, the node containing search content is found and the result is extracted from the whole node.In the aspect of web crawling, researched the method based on the IE kernel to crawl the web pages . The method realize the function of the machine browser the web pages automatically, moreover, realize the function of crawling the dynamic pages and the hidden web data in the database of the website through the machine fill in the forms and simulate the user's clicking on the web page.Through applying the above presented methods, the platform to assist the creation of web crawler is constructed and through using the configuration data created by the platform, the web page collecting system not depending on the structure of the concrete website and with the characteristics of universal purposes and lower implement and maintenance cost.
Keywords/Search Tags:Web Crawler, dynamic page crawling, IE kernel, Web Information Extraction, DOM
PDF Full Text Request
Related items