The Application And Research Of Chinese Word Segmentation And Web Deduplication In News Vertical Search Engine | | Posted on:2015-01-03 | Degree:Master | Type:Thesis | | Country:China | Candidate:X S Li | Full Text:PDF | | GTID:2298330422485396 | Subject:Signal and Information Processing | | Abstract/Summary: | PDF Full Text Request | | With the emerging of the Internet, the web information almost has exponentiallyincreased. The search engine is one of the most important tools for getting web information.The traditional search engines which get web information on a large scale with webspider,but the information got is completely duplicate or partly duplicate. Alot of informationwhich the traditional search engines get is not needed for the users and increase the burden ofsearching the internet. On the other hand the vertical search engines could get the webinformation which the users have mostly concerned. Compared to the traditional searchengines, the vertical search engines only crawl one specific area of contents.This paper firstly describes the working principles of the vertical search engine anddiscusses several key technologies in vertical search engine. These include web spider,Chinese word segmentation, web preprocessing, web deduplication, web indexing andretrieval technology. We have a detailed description for six function models involving thesetechnologies and we have a concrete implementation.This paper has designed a multithreaded web spider. This spider can efficiently crawl thecontents of the internet.The spider uses a bloom filter to filter the processed urls. We’ve usedopen-source technology Lucene to build web indexing. This paper has researched Chineseword segmentation technology and implemented a algorithm based on bidirectional maximummatching and statistical analysis of two kinds of disambiguation rules. The experimentalresults show that the algorithm has a big improvement in words disambiguation and correctword segmentation. In addition, this paper has researched the web page deduplicationalgorithm and we’ve proposed a algorithm based web content length and web topic content.This algorithm has outperformed the traditional web page deduplication algorithm that basedon web topic contents and MinHash algorithm. Based on the algorithms mentioned above wehave built a news vertical search engine. The final test result has showed that this searchengine has basically implemented the algorithm we mentioned and reached our expectation. | | Keywords/Search Tags: | News Search Engine, Web Spider, Web Content Extraction, Chinese WordSegmentation, Web Deduplication | PDF Full Text Request | Related items |
| |
|