Font Size: a A A

Study On Improved Best-First Algorithm About Focused Crawler’s Search Algorithm

Posted on:2016-05-31Degree:MasterType:Thesis
Country:ChinaCandidate:F M DingFull Text:PDF
GTID:2308330479984801Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
The rapid development of the Internet has brought the massive information resources, whether the user can effectively obtain their own interest resources largely depends on the performance of the search engine. In the face of a user personalized needs, it’s hard for general search engine to provide users with satisfactory results. In order to break the limitations of general search engine, study intelligent search engine with professional characteristics has become a trend. Thus the vertical search engine has been born logically. In the vertical search engine, the focused crawler is just like people’s heart, playing a basic and key role. According to the target subject given by the user, the focused crawler searches Web intelligently, extract the topic page fast and accurately, to meet user’s needs. Research effective focused crawler to improve the performance of vertical search engine has a very important significance.This paper mainly from the following three aspects to study the content:First of all, this paper introduces the basic principle of web crawler, then analyzes the working process of focused crawler, next discusses the focused description method of the crawler. Web page pretreatment technology are emphatically analyzed, including HTML tags, the extraction of page title and body content, the extraction of anchor text and the Chinese word segmentation technology, to lay a solid foundation for the follow-up focused correlation calculation.The second, analyze the Best-First algorithm based on page content evaluation. For the weight calculation method of vector space model, only considering the word frequency of feature words and ignoring their location information. For this shortage, this paper proposes to take advantage of HTML tags’ modified function, uses frequency-weighted to calculate the weight, to improve the accuracy in topic relevance judgment. In addition, the greedy of Best-First algorithm is discussed. Aiming at the limitation of that it’s difficult to find optimal solutions in the overall situation, do a little improvement of the focused crawler’s search strategy in this paper, not only pay attention to search the links with high focused similarity, at the same time also consider some links containing a great long-term value, thus let the focused crawler can obtain the optimal solutions in the overall situation to some extent.Finally, the paper designs and implements a simple focused crawler system based on the above theoretical analysis. The experimental results show that the improved algorithm is effective and has a higher precision ratio and recall ratio, compared with Breath-First algorithm and traditional Best-First algorithm.
Keywords/Search Tags:Best-First algorithm, topic relevance, HTML tags, precision ratio, recall ratio
PDF Full Text Request
Related items