Research On Near-duplicate Detection Algorithm

Posted on:2010-07-01

Degree:Master

Type:Thesis

Country:China

Candidate:X G Mao

Full Text:PDF

GTID:2178360275451807

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

The rapid popularization and development of Internet makes people face a sea of information. It becomes essential to obtain really important information from it. The search engine(mainly referred to the full text search system)is a kind of tool that provides this function. However, in the retrieval results from the search engine, there are a large number of duplicated web pages which mainly come from the reproduction among the websites. Those repetitive web pages not only occupy the network bandwidth but also waste storage resources. Users do not want to see a pile of search results with the same or approximate contents, and truly useful results are often drowned in this redundant information and can't be easily discovered. Effective removal of those duplicate web pages will enhance the accuracy in searching and save time and energy for users, so that the search system itself can save the network bandwidth, reduce the storage cost and enhance the quality of the search engine .This paper mainly studies the problem of removing duplicated web pages for search engine. First of all, give a brief introduction of the principle of the search engine, the development status, the existing deficiency and development trend, as well as the research background and significance. Secondly, give a dynamic and deep analysis on current international and domestic-related research in this field, also a detailed description of the origin and history of the near-duplicate detection algorithm, the classification of the algorithm and its represented algorithm, including the efficiency, advantage, disadvantage and the improvements of the algorithms. At the same time, focused on two great and classic algorithm, shingling and simhash, many other algorithms are based on their idea to carry out improvements. Google exactly use simhash to achieve near-duplicate detection. Simhash is practically useful for identifying near-duplicates in web documents belonging to a multi-billion page repository. simhash is a finger-printing technique that enjoys the property that fingerprints of near-duplicates differ in a small number of bit positions. Simhash is a dimensionality reduction technique. It maps high-dimensional vectors to small-sized fingerprints. It is applied to web-pages as follows: first convert a web-page into a set of features, each feature tagged with its weight. Features are computed using standard IR techniques like tokenization, case folding, stop-word removal, stemming and phrase detection. A set of weighted features constitutes a high-dimensional vector, with one dimension per unique feature in all documents taken together. With Simhash, we can transform such a high-dimensional vector into an f-bit fingerprint where f is small, say 64.Finally, give introduction at detail on a open source project which is applied in many important projects and receive unanimous praise, Clucene, and how to build search engine by Clucene to index and search. Clucene provide abundant API functions, and we can build search engine easily through these API functions. Illustrate in detail the main classes, data structure, system structure and how to achieve index, search and analyze.

Keywords/Search Tags:

Search Engine, Near-duplicate Detection, Shingling, Simhash

PDF Full Text Request

Related items

1	Research On Mobile Search Oriented WAP Duplicate Data Detection
2	Research Of Deleting Duplicate Web Pages On Campus Search Engine
3	Research On Near-Duplicates Detection Algorithm Of Search Engine
4	The Research On Key Technologies Of Search Engine Under Cloud Environment
5	Study And Applications Of Duplicate Web Page's Elimination And Clustering Algorithm In Search Engine System Of Colleges And Universities
6	Research And Implementation Of Meta Search Engine
7	Research And Implementation Of The Small-scale Search Engine Based On Lucene
8	Research And Implementation On Removing Duplicated WebPages Algorithm Of Search Engine
9	Research And Implementation On Search Engine System Remove Duplicate Webpages
10	Research And Improvement Of Text Similarity Detection Based On Simhash