| With the rapid growth of Internet technology,Internet has become the main source of information for the most of people,And Internet information is also in the explosion.The applications like Data mining and information retrieval support for obtaining useful information quickly and accurately from vast amounts of information sources.A webpage often contain multiple information,however,a lot of information is not related to the topic of the webpage.For example,a webpage usually exist useless noise information such as navigation,ad links etc.The noise information not only affects the user’s reading experience,but also may cause page theme migration,affects the accuracy and speed of the data mining.Therefore,research and implementation of high efficient and practical Noise information filtering technology has important significance not only in terms of data mining and web information retrieval,but also in improving the users’ reading experience.Because that webpages exist a variety of particularity,Web information filtering noise technology is more complicated than the traditional language information extraction,and the noise for web information filtering technology research has brought new challenges.According to the above problem,this paper proposes a new web identify noise filtering algorithm.Firstly we analyze the current mainstream web purification method.Secondly we analyze the representative each big web portal of the characteristics of the noise information in the news pages,and based on this,advance noise information recognition algorithm based on the web page visual properties and content rules of noise.Thirdly,to guarantee the accuracy of the web information noise filter and that does not affect the user’s reading,We propose an webpage noise information filtering and webpages have no deformation algorithm based on the similarity WVP_DOM tree’s structure.Finally on WEB proxy system verification we tested the effect of this method and versatility.Experiments prove that the method can better remove the noise of the information in the webpage,and maintain a webpage with no deformation,at the same time this method has good generality. |