The Application And Research Of Chinese Word Segmentation And Web Deduplication In News Vertical Search Engine

Posted on:2015-01-03

Degree:Master

Type:Thesis

Country:China

Candidate:X S Li

Full Text:PDF

GTID:2298330422485396

Subject:Signal and Information Processing

Abstract/Summary:

With the emerging of the Internet, the web information almost has exponentiallyincreased. The search engine is one of the most important tools for getting web information.The traditional search engines which get web information on a large scale with webspider,but the information got is completely duplicate or partly duplicate. Alot of informationwhich the traditional search engines get is not needed for the users and increase the burden ofsearching the internet. On the other hand the vertical search engines could get the webinformation which the users have mostly concerned. Compared to the traditional searchengines, the vertical search engines only crawl one specific area of contents.This paper firstly describes the working principles of the vertical search engine anddiscusses several key technologies in vertical search engine. These include web spider,Chinese word segmentation, web preprocessing, web deduplication, web indexing andretrieval technology. We have a detailed description for six function models involving thesetechnologies and we have a concrete implementation.This paper has designed a multithreaded web spider. This spider can efficiently crawl thecontents of the internet.The spider uses a bloom filter to filter the processed urls. Weâ€™ve usedopen-source technology Lucene to build web indexing. This paper has researched Chineseword segmentation technology and implemented a algorithm based on bidirectional maximummatching and statistical analysis of two kinds of disambiguation rules. The experimentalresults show that the algorithm has a big improvement in words disambiguation and correctword segmentation. In addition, this paper has researched the web page deduplicationalgorithm and weâ€™ve proposed a algorithm based web content length and web topic content.This algorithm has outperformed the traditional web page deduplication algorithm that basedon web topic contents and MinHash algorithm. Based on the algorithms mentioned above wehave built a news vertical search engine. The final test result has showed that this searchengine has basically implemented the algorithm we mentioned and reached our expectation.

Keywords/Search Tags:

News Search Engine, Web Spider, Web Content Extraction, Chinese WordSegmentation, Web Deduplication

Related items

1	Reseash On Some Key Technologies Of Enterprise Search Engine
2	The Vertical Search Engine Research And Design
3	Professional Search Engine Research And Design
4	Financial News Feed System
5	Design And Implementation Of News-Collecting System
6	On The Research And Development Of A Video Search Engine For Chinese Web
7	Design And Implementation Of A Spider For Topic-Specific Search Engine
8	Research And Achievement Of The Search Strategic For The Topic Search Engine Spider
9	Research Of Vertical Search Engine Based On Web
10	The Design And Application Of Chinese Intelligent Search Engine