| The Internet contains a large number of webpages generated based on templates.Webpages generated based on the same template often have certain similarities.Through the method of webpage clustering to calculate the degree of similarity between webpages and effectively organize and organize similar webpages,it can help formulating more targeted information extraction strategies and improving the efficiency of information acquisition.In addition,the method of webpage clustering is often used in the identification and prevention of malicious websites.This paper studies the web page clustering method,designs and implements a massive web page labeling analysis system with the web page clustering method based on structural similarity as the core,and provides technical support for the research of large-scale similar web pages.The core of webpage clustering method based on structural similarity is the calculation of DOM structural similarity between webpages.The existing clustering methods based on structural similarity have some shortcomings,such as insufficient consideration of DOM structural features,computational efficiency and clustering effect is difficult to balance and so on.Therefore,this paper proposes a new DOM structure similarity calculation method-hierarchical distribution distance,and designs a webpage clustering method based on hierarchical distribution distance.The hierarchical distribution distance not only covers the properties,distribution and amount feature of DOM tree nodes,but also making calculation of hierarchical features in a more concise form.The hierarchical distribution distance can also measure the similarity of DOM structures from an overall perspective.Therefore,while improving the solution efficiency,the clustering effect is guaranteed.The experimental results show that the complexity of calculating the similarity of webpage structure through the hierarchical distribution distance is better,and the webpage clustering method based on the hierarchical distribution distance can better balance the clustering effect and execution efficiency.Taking the clustering method based on hierarchical distribution distance as the core,this paper designs and implements a webpage labeling analysis system:First of all,this paper carries out the requirement analysis and outline design of the system.The main functional goal of the system is to automatically complete the crawling and processing of web page data,and to provide visual query and display functions.The system is mainly divided into three layers,the data acquisition and analysis layer,the data processing layer and the data display layer,which respectively complete the functions of web page data collection,processing and display.Secondly,in the process of system implementation,the collection of webpage text,structure and image data is solved by multi-threaded tasks.Then,we use topic prediction based on word frequency and clustering method to deal with webpage data.The results of prediction and clustering are used as label of webpages.Finally,the system realizes the query and display function of webpage results in the form of icon file list,and designs a method to visualize web page structure features,which converts complex web page structure features into more intuitive feature images.Finally,this paper tests each functional module of the system.The test results show that each module of the system can complete the designed function correctly.In the scenario of 930,000 webpage data and about 610G webpage image data,the query response time is about 0.312 seconds,and the result loading time is about 0.293 seconds.It can be seen that the system in this paper can realize the annotation and analysis of massive web pages,and provide effective support for the research of large-scale similar webpages. |