| With the advent of the big data era and the continued intensification of information explosion,general search engines help people obtain information on the Internet,but their involved technology is complex,and the software and hardware requirements are also high.The value density of the search results returned by general search engines is difficult to meet the requirements of specific fields.To solve this problem,vertical search engines have emerged.Their core technology is based on focused crawlers,which can recognize and filter web page content,and capture information relevant to the topic as much as possible,thus helping people obtain information more effectively.Up to now,focused crawlers are still a research hotspot in extracting web page topic information,identifying web page topic relevance,and implementing system distribution.The main research works of thesis are as follows:1.An improved webpage keyword extraction algorithm based on Text Rank is proposed.Text Rank algorithm has two main problems.The first is that the initial weights of different candidate keywords in the keyword extraction process are always set to the same value.The second is that only the strategy of weight distribution between adjacent nodes is used for weight transfer.To address these two problems,thesis proposes an M_Text Rank webpage keyword extraction algorithm that improves the Text Rank algorithm from two aspects: the initial weight of candidate keywords and the transition probability.The algorithm assigns initial weights to candidate keyword nodes based on a comprehensive analysis of word frequency and position,and iteratively transfers candidate keyword weights by integrating coverage factors,location factors,and word frequency factors,thus improving the keyword extraction effect on news-type web page texts.2.A webpage topic relevance discrimination algorithm based on the Word2 Vec model is proposed.The use of Boolean model word precise matching to determine topic relevance loses a lot of topic-related semantic information.Furthermore,using one-hot representation of words easily results in "dimensionality disaster" and "high sparsity" problems.To address these two problems,thesis proposes a calculation method based on the Word2 Vec model,which analyzes and discriminates the relevance between the keyword vectors extracted from the evaluated webpage and the topic keyword vectors set by the user,based on the cosine distance.At the same time,an index method based on product quantization and line quantization is used to accelerate the matching of word vectors,thus meeting the requirements of large-scale topic similarity comparison of word vectors.3.A distributed focused crawler system based on the Scrapy framework is proposed.Thesis conducts overall design of the crawler system from three aspects: physical architecture,system architecture,and functional modules,and specifically designs and implements the scheduler,download middleware,crawler,topic discriminator,URL extractor,data pipeline,request deduplication,and data storage modules.Finally,the running effect of the system is tested and verified through experiments.Thesis conducts research and improvement on key technologies such as webpage topic information extraction and relevance identification,and improves the Scrapy crawler framework with distributed technology,designs a distributed focused crawler.The experiments show that the research of thesis is generally effective and has certain application value. |