| With the explosive development of Internet technology,the data that Web page carries is growing day by day.The data carried by an ordinary web page generally contains two parts: the content block and the noise block.The noise block on the web pages generally includes the navigation menus,the advertisement bar,pop-up ads,the copyright links and so on.Noise data is almost half the proportion of the web page,and this proportion is still growing.The sustained growth of noise data not only make it difficult to obtain relevant information,and affects the user to obtain the useful information efficiently and quickly,so how to quickly and effectively remove this unwanted noise information from the web pages is particularly important.There are several methods to eliminate the noise in the web page,such as the method of denoising based on Web page template,the method of denoising based on visual information,and the method of denoising based on DOM tree.This paper mainly based on the DOM tree structure to the topic web page denoising.In the previous research of web page denoising based on DOM tree,most researchers divide the DOM nodes into different types according to some rules set,and then determine the noise nodes according to the node type.However,DOM nodes be divided into different types according to some single factors,which may result in DOM nodes to be misclassified and have a direct influence on effect of the subsequent eliminating noise.In addition,through the analysis of several major domestic portal secondary news page found that the topic web page has the following characteristics: prominent and obvious themes,many pure text content,a relatively small number of links and images and so on.Aiming at the shortcomings of previous research and the structure characteristics,text features,tag features of the topic web page,this paper constructs an improved DOM tree model based on the traditional DOM tree,and based on this improved model,a web noise removing method is introduced for the topic web page,the main contents are as follows:(1)The HTML tags are divided into topic labels and non-topic labels according to the relevance of the theme and the granularity of the nodes.Considering semantic association degree between tags and theme,the characteristics value of the link,node text length,number of text nodes,number of pictures,add custom properties tagDeg,linkVal,textLen,textNum and picNum for each node when building a DOM tree.(2)An improved DOM tree model is proposed.First,the HTML document is parsed into a DOM tree structure,then traverse the DOM tree,and add custom attributes for each node,the tagDeg value of new node is the sum of tagDeg value of all merge nodes,the linkVal value calculate by the same method.Finally,we construct an improved DOM tree model which only contains the topic block nodes.(3)This paper presents a method of web page denoising based on improved DOM tree model.And this method mainly includes three steps: cleaning the web page,constructing the improved DOM tree and removing the noise of the improving DOM tree.Among them,through the analysis of the node in the custom attributes values and set the threshold,so as to determine and remove the noise nodes,to achieve the purpose of web page denoising.Based on the answer set from many Web pages,experiments show that the proposed method based on the improved DOM tree can effectively eliminate most of the noise in topic web pages,so it has a good practicality value and useful prospect. |