Font Size: a A A

The Design And Research Of Topic Web Crawler In Vertical Search Engine

Posted on:2017-04-29Degree:MasterType:Thesis
Country:ChinaCandidate:L T LuoFull Text:PDF
GTID:2308330485969626Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In the era of information globalization, in recent years, traditional general search engines can not meet the needs of people for professional and personalized information. So it is urgent to set up a vertical search engine for a specific domain. As topic web crawler plays an important role in topic search engine, the design of it has a direct impact on the service quality of the search engine. Traditional topic web crawler mainly analyzes the correlation between the candidate URL and the topic through the analysis of all the content of web pages. But web pages contain more and more topics nowadays, this kind of analysis may affect the correlation between the candidate URL and the topic. In recent years, the research about topic crawler has been focused on two aspects:topic relevance algorithm and topic crawler search strategy. In this paper, a hybrid crawler search strategy is proposed, which is superior to the traditional topic web crawler. The main research work includes:(1) On the basis of the research about the related technologies of the topic crawler, this paper states the existing research results, which lays a theoretical foundation for the new crawler strategy proposed in the paper.(2) The hierarchial structure of the tree is applied to the process of Filter Bloom duplicate removal, and a multilayer Filter Bloom(MLBF) based on the traditional Filter Bloom is proposed to remove the reduplication of ULR.Each layer of Filter Bloom consists of k independent hash function and m bit array,and in this way URL is considered as a set divided by "/",transforming the problem of URL duplicate removal into the path problem of a tree.The experimental results show that the improved multilayer Filter Bloom has a smaller false positive rate and crawling efficiency is improved.(3) Learning from existing achievements of the studies, this paper puts forward a hybrid crawling strategy based on contents and link evaluation, which combines web content evaluation and web link evaluation. In content evaluation, Naive Bayes classifier algorithm with web content and anchor text inputted is used to analyze the correlation between the candidate URL and the topic. In link evaluation, the efficient HITS algorithm is used to get the Authority and Hub pages. Therefore, the crawling strategy of the whole cycle is divided into two rounds, improving the correlation between web pages and topics.(4) The typical Dewey decimal classification and link structure analysis are used to predict whether the URL is related to the topic. We have a comprehensive consideration of the correlation between the anchor text, the information next to anchor text, reverse page, reverse link and the topic, which avoids the "theme drift" phenomenon.(5) After evaluating the effect with the precision and recall of simulation and comparing the crawling strategy proposed in this paper and other algorithms, we make a conclusion that the hybrid crawling strategy has obvious advantages in the quality of web crawling.
Keywords/Search Tags:Topic search, HITS, Naive Bayes, Hybrid crawling strategy
PDF Full Text Request
Related items