| The detection of phishing websites has become a game between phishing attack and phishing detection.The direction and technology of phishing detection need to seek a new Angle because of the constant upgrading of phishing attack technology.In terms of feature calculation,the detection method of phishing websites calculates the similarity between phishing websites and a single suspected target webpage,which makes the conditions for judging phishing websites too simple.In the aspect of webpage feature extraction,the detection of phishing websites not only lacks the guarantee of the independence between webpage and webpage features,but also only focuses on the webpage itself.These two aspects make phishing detection technology easy to be reversed detection,thus reducing the detection efficiency and accuracy.Therefore,from the perspective of reducing the correlation between web pages and their features based on human visual behavior and from the perspective of complex feature calculation,the anti-detection research on phishing website detection is carried out.The main contents are as follows:(1)A phishing website detection algorithm based on improved TCD feature space conversion is proposed.The Texton Correlation Descriptor(TCD),which can express the external features of the webpage,is improved to make it more suitable for the characteristics of the phishing detection.Firstly,the underlying texture feature extraction method in the TCD is improved.Secondly,the method of selecting the neighborhood based on Euclidean distance and double cross window based on position weighting is proposed to improve the feature correlation statistics.Then,based on the spatial relationship,the set of imaged web pages is mapped.In the new feature space,the correlation between webpages and webpage features is separated to achieve anti-detection purposes;finally,the TCD operator is further improved by using the similar relationship among a large number of imaged webpages.Experiments show that the improved TCD operator is applied to phishing website identification with satisfactory stability and accuracy.(2)Proposed a phishing website detection algorithm based on structured document model.Using the human visual behavior,the relationship between the internal code features of the webpage and the layout of the webpage,a document based on the main visual area of the webpage(DMVA)is used to detect the phishing website.Firstly,the merge algorithm between child nodes(MABC)is used to generate the visual segmentation of the webpage and the hierarchical DOM tree.Secondly,the user's visual behavior and the hierarchical structure of the hierarchical DOM tree are used to extract the main visual area of the webpage;Obtaining the text information in the hierarchical main visual area of the webpage,and then constructing the DMVA of the webpage to reconstruct the webpage,reducing the relevance of the webpage and the webpage feature;finally,proposing the relevant website collection,calculating the DMVA of the website to be tested and the DMVA of the related website centralized webpage Similarity to detecting phishing sites.Experiments show that the phishing website detection algorithm based on DMVA model has better detection accuracy.(3)A phishing detection model based on improved TCD image retrieval and classification is proposed.Combining the advantages of the TCD operator to express the external features of the web page and the advantages of the DMVA document containing the internal features of the web page.Firstly,the webpage is imaged.Secondly,the visual layered TCD operator containing visual information is constructed based on the DMVA model.Then the TCD-PLSA four-layer probabilistic latent semantic model is constructed to classify the webpage.Finally,the webpage retrieval and feature conversion are calculated in the corresponding classification.The similarity between the web pages to determine whether the website to be tested is a phishing website.The off-line training part of the TCD-PLSA model involves large-scale data and is designed in parallel using Map Reduce.The experiment proves that the phishing detection model based on improved TCD image retrieval and classification has good stability and accuracy. |