Design And Implementation Of Distributed Focused Crawler System For Text Data

Posted on:2024-03-08

Degree:Master

Type:Thesis

Country:China

Candidate:T Li

Full Text:PDF

GTID:2568307079971499

Subject:Electronic information

Abstract/Summary:

PDF Full Text Request

With the advent of the big data era and the continued intensification of information explosion,general search engines help people obtain information on the Internet,but their involved technology is complex,and the software and hardware requirements are also high.The value density of the search results returned by general search engines is difficult to meet the requirements of specific fields.To solve this problem,vertical search engines have emerged.Their core technology is based on focused crawlers,which can recognize and filter web page content,and capture information relevant to the topic as much as possible,thus helping people obtain information more effectively.Up to now,focused crawlers are still a research hotspot in extracting web page topic information,identifying web page topic relevance,and implementing system distribution.The main research works of thesis are as follows:1.An improved webpage keyword extraction algorithm based on Text Rank is proposed.Text Rank algorithm has two main problems.The first is that the initial weights of different candidate keywords in the keyword extraction process are always set to the same value.The second is that only the strategy of weight distribution between adjacent nodes is used for weight transfer.To address these two problems,thesis proposes an M_Text Rank webpage keyword extraction algorithm that improves the Text Rank algorithm from two aspects: the initial weight of candidate keywords and the transition probability.The algorithm assigns initial weights to candidate keyword nodes based on a comprehensive analysis of word frequency and position,and iteratively transfers candidate keyword weights by integrating coverage factors,location factors,and word frequency factors,thus improving the keyword extraction effect on news-type web page texts.2.A webpage topic relevance discrimination algorithm based on the Word2 Vec model is proposed.The use of Boolean model word precise matching to determine topic relevance loses a lot of topic-related semantic information.Furthermore,using one-hot representation of words easily results in "dimensionality disaster" and "high sparsity" problems.To address these two problems,thesis proposes a calculation method based on the Word2 Vec model,which analyzes and discriminates the relevance between the keyword vectors extracted from the evaluated webpage and the topic keyword vectors set by the user,based on the cosine distance.At the same time,an index method based on product quantization and line quantization is used to accelerate the matching of word vectors,thus meeting the requirements of large-scale topic similarity comparison of word vectors.3.A distributed focused crawler system based on the Scrapy framework is proposed.Thesis conducts overall design of the crawler system from three aspects: physical architecture,system architecture,and functional modules,and specifically designs and implements the scheduler,download middleware,crawler,topic discriminator,URL extractor,data pipeline,request deduplication,and data storage modules.Finally,the running effect of the system is tested and verified through experiments.Thesis conducts research and improvement on key technologies such as webpage topic information extraction and relevance identification,and improves the Scrapy crawler framework with distributed technology,designs a distributed focused crawler.The experiments show that the research of thesis is generally effective and has certain application value.

Keywords/Search Tags:

Focused Crawler, Keyword Extraction, Similarity Detection, Distributed

PDF Full Text Request

Related items

1	Research And Application On The Key Technology Of Focused Crawler
2	Research On Topic Focused Web Crawler And Related Technologies
3	Research And Implement Of Distributed Focused Crawler
4	Design And Implementation Of Topic-focused Crawler For Education News
5	Research On Search Strategy And Key Techniques Of Focused Crawler
6	The Design And Implementation Of The Topic-focused Web Crawler System
7	Research And Implementation Of Focused Crawler Based On Distributed Strategy
8	Focused Crawler Based On Domain Ontology And Similarity Concept Context Graph
9	Research And Implementation Of Focused Crawler Based On URL Patterns
10	Research And Implementation Of On Semi-automatic Ontology Construction Base On WordNet And Focused Crawler