| With the rapid development of information technology,the web data on the Internet has shown an explosive growth,the network has become one of the important channels for people to get information.However,the presence of a large number of near-mirrored web pages on the Internet has become the biggest obstacle to the rapid access to effective information.In order to solve the problem that there are a large number of approximate mirror pages on the network,the researchers proposed a variety of approximate mirror page de-algorithm,but the performance of these algorithms in web noise resistance is not satisfactory.In particular,for some real-time high-news Web pages,these algorithms often misjudge the situation,the stability of the algorithm is not high.To solve these problems,this paper proposes two improved algorithms algorithms based on Simhash.The first algorithm is Simhash-based long sentence extraction approximate mirror page de-emphasis algorithm,to solve the algorithm is sensitive to noise.At present,commonly used de-emphasis algorithm includes feature extraction,because the presence of text noise makes the noise vocabulary in the extracted feature set,which affects the accuracy and recall rate of the algorithm.After analyzing the noise of the webpage,the noise text length is generally short.By extracting the long sentence of the webpage text as the segmentation range of the feature words,the paper can effectively avoid the noise information existing in the web page and reduce the adverse effect of the noise on the algorithm.The second algorithm is special weight ratio based Simhash web page de-emphasis algorithm,which is for the Web page to re-algorithm for high real-time news Web page to the weight of the problem often miscarriage of justice.Because the Simhash algorithm gives the weight of the feature words based on simple word frequency statistics,for the same category of news web pages,the page text is often similar,but different in time and place,which leads to Simhash algorithm extracted feature words and their corresponding weights are similar,and ultimately result in the miscarriage of justice.Simhash web page de-emphasis algorithm based on special weight ratio takes into account the core vocabulary factors,assigns the extra weight ratio to the core vocabulary in the news,and enhances its influence on the text fingerprint value,making the two core words with large difference Can be distinguished.Finally,the two algorithms proposed in this paper are applied to the Web page de-emphasis module in the enterprise dynamic information system of FTA,and the scientific and effective results are proved by the practice. |