Optimization And Implement Of The Topic Web Crawler Correlation Algorithms

Posted on:2020-06-08

Degree:Master

Type:Thesis

Country:China

Candidate:Y Gao

Full Text:PDF

GTID:2428330578980185

Subject:Control Engineering

Abstract/Summary:

PDF Full Text Request

Nowadays,with the rapid development of science and technology and the acceleration of new knowledge and skills,the data resources on the network are increasing geometrically.When users want to obtain data resources from massive data on the Internet,traditional search engines are gradually unable to compete.Therefore,accurate acquisition of the required information has become a research hotspot in the search industry,and the characteristics of vertical search engine specialization and precision greatly improve the retrieval of relevant information.As the core of vertical search engine,topic web crawler is mainly responsible for collecting web pages related to topics.The performance of topic web crawler directly affects the service quality of search engine.At present,the research direction of topic web crawler mainly focuses on two aspects: search strategy and topic correlation calculation.This paper studies how to improve the performance of web crawler from these two aspects.The specific work is as follows:(1)The analysis and research of crawler search strategy.Firstly,the advantages and disadvantages of link-based search strategy in HITS algorithm are analyzed,and an improved algorithm is proposed to solve the problem that HITS algorithm neglecting new web pages,focusing on old web pages and topic offset.When judging the importance of web pages,the improved algorithm introduces a function related to time and number of comments,as well as a weight function based on the link relationship between entry and exit.(2)Algorithmic Analysis Based on the Relevance of Traditional Vector Space Model.In the traditional vector space model,feature words are mechanical matching of words in text,and the calculation of their weights only depends on word frequency and inverse document frequency.In this paper,TF-IDF algorithm based on improved vector space model is used to assign different weights to different positions of text according to feature words.At the same time,in order to solve the contradiction between the number of feature words and the semantic relationship,a subject dictionary,a synonym dictionary and an inclusive dictionary are created and assigned according to the dictionary to which the feature words belong.Finally,a new crawling method is obtained by combining the improved HITS algorithm and VSM similarity judgment.In this paper,the improved topic crawler algorithm is experimented on different topic web pages.The experimental results show that the improved topic crawler algorithm effectively improves the accuracy of the corresponding web pages.

Keywords/Search Tags:

topic web crawler, HITS Algorithm, VSM, relevance calculation

PDF Full Text Request

Related items

1	Research On The Topic Crawler Algorithm Based On Vector Space Model
2	Research On Topic Crawler Of Combining Content With Link Structure
3	Research On The Key Technology And Implementation Of The Focused Crawler Based On HITS And Shark-Search
4	Investigation On Web Crawler Technology Based On Hadoop Platform
5	The Research Of Topic Crawler Search Strategy Based On Genetic Algorithm
6	Research On The Key Technology Of Focused Crawler
7	Design And Implementation Of Multithreading Web Crawler Oriented Topic
8	Design And Implementation Of Vertical Search Engine Based On Improved HITS Algorithm
9	Topic Crawler Based On Improved VIPS Algorithm And Improved Grey Wolf Optimization Algorithm
10	Design And Implementation Of University Topic Crawler Based On BP Network