Font Size: a A A

Research On Near-duplicate Detection Algorithm

Posted on:2010-07-01Degree:MasterType:Thesis
Country:ChinaCandidate:X G MaoFull Text:PDF
GTID:2178360275451807Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The rapid popularization and development of Internet makes people face a sea of information. It becomes essential to obtain really important information from it. The search engine(mainly referred to the full text search system)is a kind of tool that provides this function. However, in the retrieval results from the search engine, there are a large number of duplicated web pages which mainly come from the reproduction among the websites. Those repetitive web pages not only occupy the network bandwidth but also waste storage resources. Users do not want to see a pile of search results with the same or approximate contents, and truly useful results are often drowned in this redundant information and can't be easily discovered. Effective removal of those duplicate web pages will enhance the accuracy in searching and save time and energy for users, so that the search system itself can save the network bandwidth, reduce the storage cost and enhance the quality of the search engine .This paper mainly studies the problem of removing duplicated web pages for search engine. First of all, give a brief introduction of the principle of the search engine, the development status, the existing deficiency and development trend, as well as the research background and significance. Secondly, give a dynamic and deep analysis on current international and domestic-related research in this field, also a detailed description of the origin and history of the near-duplicate detection algorithm, the classification of the algorithm and its represented algorithm, including the efficiency, advantage, disadvantage and the improvements of the algorithms. At the same time, focused on two great and classic algorithm, shingling and simhash, many other algorithms are based on their idea to carry out improvements. Google exactly use simhash to achieve near-duplicate detection. Simhash is practically useful for identifying near-duplicates in web documents belonging to a multi-billion page repository. simhash is a finger-printing technique that enjoys the property that fingerprints of near-duplicates differ in a small number of bit positions. Simhash is a dimensionality reduction technique. It maps high-dimensional vectors to small-sized fingerprints. It is applied to web-pages as follows: first convert a web-page into a set of features, each feature tagged with its weight. Features are computed using standard IR techniques like tokenization, case folding, stop-word removal, stemming and phrase detection. A set of weighted features constitutes a high-dimensional vector, with one dimension per unique feature in all documents taken together. With Simhash, we can transform such a high-dimensional vector into an f-bit fingerprint where f is small, say 64.Finally, give introduction at detail on a open source project which is applied in many important projects and receive unanimous praise, Clucene, and how to build search engine by Clucene to index and search. Clucene provide abundant API functions, and we can build search engine easily through these API functions. Illustrate in detail the main classes, data structure, system structure and how to achieve index, search and analyze.
Keywords/Search Tags:Search Engine, Near-duplicate Detection, Shingling, Simhash
PDF Full Text Request
Related items