Font Size: a A A

Incremental Crawler About Forest Products Trade Web Information

Posted on:2015-10-13Degree:MasterType:Thesis
Country:ChinaCandidate:S Q TianFull Text:PDF
GTID:2309330431459484Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet the information is used everywhere, we can experience the convenience of the information at any time. With the explosive growth of the amount of information, how timely and accurate access to information is very important. In order to effectively utilize this Web information, you need to download the Web page from different web site, then save the key information into local database after information extraction and information fusion. In the process of web crawler will be responsible for the Web page to crawl to the local, is the basis of the whole process. However, Web information is huge, widely distributed and frequent change, getting enough data available in the limited time and resources become a great challenge to the traditional crawler. To solve this problem, incremental crawler becomes a hot research field in recent year.Firstly, the article analysis the structure and characteristics of Web sites for the forest products trade information, then builts a template-based Web pages crawler for forest products trade information. This article also analysis the characteristics of the noise in the web page, and designed a LCS based denoising method. According to the web page updates changes consistent with a Poisson distribution, this article proposes an algorithm which can calculates the probability of the updating for the web page dynamically. And according to the algorithm this article designs a method used to crawl the web pages incrementally. At the end of the article, three more representatives of the forest products trade websites are selected as experimental subjects for this system. Experimental results show that the system can accurately crawled forest products trade Web information, and use less time and resources to keep the web information fresh.
Keywords/Search Tags:Forest Products Trade Information, Incremental Crawler, LCS, PoissonDistribution, Templates
PDF Full Text Request
Related items