Vertical Search Engine For Crawling The Web Page Design And Implementation

Posted on:2010-02-19

Degree:Master

Type:Thesis

Country:China

Candidate:Z Chen

Full Text:PDF

GTID:2178360275986578

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Today's society, Internet technology is moving very quickly, network information has been the rapid increase in search engine information on the network's coverage on the capacity of the overall downward trend, but at the same time, people search engine search information to the quality requirements but more and more high, all kinds of user information on the request of english from the original quantity on the quality of the up to now has changed. In this context, how can the network at the growing amount of information quickly to find more accurate information became more valuable field of the current search engine a challenging research problem hot spots. In relation to all areas and topics for a comprehensive search engine, vertical search engines are often only for a particular discipline or field of information to meet the needs of the field or discipline-specific requirements, so information on the network for more in-depth and accurate extraction, make search more targeted, more targeted, accurate recall is higher. Vertical search engine to narrow your search at the premise, can quickly search for a more accurate information more valuable, but at the same time a vertical search engine web crawling depth and accuracy of data extracted from information in relation to integrated search engine is also made a higher requirement.In this thesis, how to design and implementation in accordance with the vertical search engine for crawling the web site needs browser of this topic in light of the current web design technology and the characteristics of browser technology to crawl the web page formulate the overall structure of system and system design of each module.In this, based on the requirement to construct a vertical search engine and combined with the technology of making web page and the characteristic of the web browser technology, architecture and modules design schemes are made. In the aspect of Information Extraction, researched the method of Information Extraction based on IE kernel. The method is based on the DOM Tree Model of IE kernel and through DOM, as the input of the characterized information of the content of the tree nodes and the HTML labels, the regular express is created automatically by the program. Then combined with the created regular express and the index of the tree node in the whole DOM tree, the node containing search content is found and the result is extracted from the whole node.In the aspect of web crawling, researched the method based on the IE kernel to crawl the web pages . The method realize the function of the machine browser the web pages automatically, moreover, realize the function of crawling the dynamic pages and the hidden web data in the database of the website through the machine fill in the forms and simulate the user's clicking on the web page.Through applying the above presented methods, the platform to assist the creation of web crawler is constructed and through using the configuration data created by the platform, the web page collecting system not depending on the structure of the concrete website and with the characteristics of universal purposes and lower implement and maintenance cost.

Keywords/Search Tags:

Web Crawler, dynamic page crawling, IE kernel, Web Information Extraction, DOM

PDF Full Text Request

Related items

1	Design And Implementation Of Web Crawler For Given Page
2	Research On Customized Web Information Crawling And Pushing Techniques
3	Web Information Crawling Applied In Fabric Textile Public Service Platform
4	The Research On Key Techniques For Page Segmentation Based Forum Crawler
5	Design And Implementation Of A Directional Information Extraction Model For Dynamic Web Pages
6	Design And Implement Of Distributed Commodity Information Web Crawler System
7	Research On Web Page Classification And Information Collection
8	Based On Templated Web Crawler Technology Of Web Page Information Extraction
9	Research On Web Crawling Strategies
10	Research On Topical Crawler Combining Web Page Content And Hyperlink