Font Size: a A A

Research And Implementation On Removing Duplicated Web Pages Of Search Engine System

Posted on:2008-12-12Degree:MasterType:Thesis
Country:ChinaCandidate:X Y FanFull Text:PDF
GTID:2178360242956076Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The rapid popularization and development of Internet makes people face a sea of information. It becomes essential to obtain really important information from it. The search engine (mainly referred to the full text search system) is a kind of tool that provides this function. However, in the retrieval results from the search engine, there are a large number of duplicated web pages which mainly come from the reproduction among the websites. Those repetitive web pages not only occupy the network bandwidth but also waste storage resources. Users do not want to see a pile of search results with the same or approximate contents, and truly useful results are often drowned in this redundant information and can't be easily discovered. Effective removal of those duplicate web pages will enhance the accuracy in searching and save time and energy for users, so that the search system itself can save a lot of storage resources and improve work efficiency.This paper mainly studies the problem of removing duplicated web pages for search engine. At present the effective methods of removing duplicated web pages are still few, and most of them are realized in the server end, it means duplicated web pages are dispeled during the process of collecting web pages. At present the common used methods are the method based on the same URL, the method based on cluster, the method based on feature codes and the method based on signature. In the method based on cluster, a text is expressed as a vector in a vector spatial model, then various methods are used to achieve clustering or classification. In this method calculating the angle between vectors has high computational complexity which will take up more processing time during the computing. A calculation which is based on extracting feactures codes from webpages can remove duplicated webpages more effectivelly. However, it is still difficult to resist the noise during reproducing webpages.Differing from the past work, this paper divides the task of removing duplicated webpages into two parts: server end and client end. On the basis of reviewing a large number of duplicated web pages, the paper subdivides the duplicated web pages into pages of the same content and similar pages. Removal of these two kind of duplicated webpages are carried out separately in server end and client end.We propose a new method based on feature codes to carry out removal of duplicated web pages on server end. The web texts are identified by primary codes and auxiliary codes to make best use of web page's structure feature. The primary codes are used to mark the paragraph structure information while the auxiliary codes are used to mark the text information of a web page. The primary codes are clustered first, and then the auxiliary codes are matched. The experiment has proved that the algorithm is effective.In order to implement intellectualization, individuality and custom-made function for search systems, an intellectual agent model in client end is proposed in this paper. After analyzing characteristics of news'duplicated web pages a new method based on keywords'context match is proposed for client end. This algorithm is based on highly rate for repetition and reproduction in news websites. The users'search key words generally characterize the users'intention, so matching key words context can be used to determine the consistency of web pages. The algorithm uses fuzzy matching to resist webpage noise. Adjustable fuzzy factors and overlap factors are introduced to meet requirements and satisfied results are obtained.For proving validity and comparing results of the algorithms, a search engine experiment prototype system in Windows based on Java language as well as Lucene tool kit, has been built. Expermental results show that the two methods proposed in this paper have higher recall rate and accurate rate, lower leak deleting rate and mistake deleting rate. These algorithms are expected to obtain practical application after futher improvement. Finally a proposal of further research work is given.
Keywords/Search Tags:search engine, removal of duplicated webpages, client agent, Lucene
PDF Full Text Request
Related items