Font Size: a A A

A Fast Text Elimination Algorithm Based On Simhash

Posted on:2015-02-26Degree:MasterType:Thesis
Country:ChinaCandidate:Y WangFull Text:PDF
GTID:2268330428497860Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In the presence, World Wide Web has a large number of similar or the samecontent of web pages, for search engines to filter out similar or the same web pagecontent,that can improve the search efficiency of search engines, reducing storagecosts and so on. Because there are a lot of redundant web content pages information,when the web crawler crawls and download all webs to be able to quickly andaccurately identify duplicate pages in massive web pages.Traditional HASH algorithm is only identify for the original page content isassigned a random value equivalent to a random number generating HASH valuealgorithm, if the generated random number is equal to the same as each other, thenthe original page under certain conditions are equal or not equal each other. In a word,HASH algorithm only to determine whether the web content is the same as each other,but can not identity whether it is similar to web page content each other, HASHalgorithm to identity which is similar pages are difficult subject. Because in additionto providing web content is the same as each other, but also should be noted that thedegree of difference between web content information construction and others.Compare web content similarity with the classic method is the vector cosinefunction, and its idea is to form a set of vectors based on word frequency from webpage content appears, and then calculate the vector cosine of the value between thevector corresponding web page content, but because of a web page content contains alot of feature words and lead to the formation of a higher dimension vector space, sothat makes the calculation vector cosine value too costly even beyond the expectedtime and space, the degree of spatial complexity. So, Simhash algorithm was born,and its main idea is to use the "dimensionality reduction" technology, thehigh-dimensional feature vector value is mapped to a unique "fingerprint code value",based on a comparison web content so unique "fingerprint code" to determine theidentity of the page content is repeated or not. Simhash algorithm can compare thesimilarity between many pages and others, literature Simhash algorithm can achieve amassive web content check weight, and it is important that is proposed a trainingalgorithm. First, the web crawler crawled and download page to get web content feature set through basic processing, such as: stemming, remove stop words and soon; Second, for each feature word to get its hash value characteristics through hashalgorithm by comparing hash value determining whether the content of the page is thesame method.The main purpose of this paper is to remove duplicate pages in the web, themethod is based on Simhash algorithm is fast removing, by calculating web contentfeatures "fingerprint code" to determine the hash distance between the two web pagecontent Hamming code value size, in order to determine whether the web pagecontents of the page Similarly, if similar, then use this method to achieve fast weight,and ultimately to improve retrieval efficiency, lower storage overhead.This paper write the basis of other papers proposes a web removing similar webpage content based on Simhash rapid algorithm, in order to verify the validity of thetext pages to quickly remove algorithm, and Shingle comparison algorithm, thismethod can added the accuracy of the algorithm priority at Shingle algorithms, thismethod of rapid remove on the running speed also is obvious advantages, but thisarticle can support rapid massive operation to remove web page text. Experimentalresults show that the proposed method results is very better. For this reason, the papergives specific research direction and research content.
Keywords/Search Tags:Simhash, Text Elimination, Feature Vectors, Hamming Distance, Hash
PDF Full Text Request
Related items