| Currently Internet Web information explosion, internet has become an important source of information. When people browse the web, can appear a lot of navigation, the advertising message, copyright information, questionnaire and associated information, this information is often not the actual contents what the people to get, people say this web site "web noise". Usually people in through the network information retrieval software, such as search engines, inquires on the content of his own endeavors, hoping to search conditions (keywords, etc) closely related content displayed, and the best contains no or less contain web noise. Therefore, web pages, and eliminate noise identification in recent years has become very important in the fields of network information retrieval research topic.This paper firstly web pages related concepts and architecture are introduced, and then the existing web noise identification and elimination method is discussed and analyzed, based on this, advances a web noise recognition and eliminating methods. The basic thought of the method, according to the contents of the website is to generate the corresponding DOM tree, then according to the information provided by the DOM tree according to certain rules of web information to identify noise, and forming a suspicious web noise information representation model. In information retrieval, according to suspicious web noise information representation model of the information provided by the information, to retrieve VSM method, and according to the similarity calculation similarity calculation results confirm the final pages to remove noise. This paper analyzes the specific identification method web noise, suspicious web noise information representation model formation processes, specific algorithm, the similarity calculation and threshold selection methods, etc.The author of this paper puts forward noise recognition and eliminate the page in Heritrix +Lucene method, by the basis of frame, design of a related simulation environment, and in the circumstances, the web page using actual simulation experiments. Experiments show that, this paper gives the web noise set don't and eliminate method is feasible and effective, compared with other similar methods in web noise identification, both the accuracy and efficiency have improved. |