Nowadays,with the rapid development of science and technology and the acceleration of new knowledge and skills,the data resources on the network are increasing geometrically.When users want to obtain data resources from massive data on the Internet,traditional search engines are gradually unable to compete.Therefore,accurate acquisition of the required information has become a research hotspot in the search industry,and the characteristics of vertical search engine specialization and precision greatly improve the retrieval of relevant information.As the core of vertical search engine,topic web crawler is mainly responsible for collecting web pages related to topics.The performance of topic web crawler directly affects the service quality of search engine.At present,the research direction of topic web crawler mainly focuses on two aspects: search strategy and topic correlation calculation.This paper studies how to improve the performance of web crawler from these two aspects.The specific work is as follows:(1)The analysis and research of crawler search strategy.Firstly,the advantages and disadvantages of link-based search strategy in HITS algorithm are analyzed,and an improved algorithm is proposed to solve the problem that HITS algorithm neglecting new web pages,focusing on old web pages and topic offset.When judging the importance of web pages,the improved algorithm introduces a function related to time and number of comments,as well as a weight function based on the link relationship between entry and exit.(2)Algorithmic Analysis Based on the Relevance of Traditional Vector Space Model.In the traditional vector space model,feature words are mechanical matching of words in text,and the calculation of their weights only depends on word frequency and inverse document frequency.In this paper,TF-IDF algorithm based on improved vector space model is used to assign different weights to different positions of text according to feature words.At the same time,in order to solve the contradiction between the number of feature words and the semantic relationship,a subject dictionary,a synonym dictionary and an inclusive dictionary are created and assigned according to the dictionary to which the feature words belong.Finally,a new crawling method is obtained by combining the improved HITS algorithm and VSM similarity judgment.In this paper,the improved topic crawler algorithm is experimented on different topic web pages.The experimental results show that the improved topic crawler algorithm effectively improves the accuracy of the corresponding web pages. |