| Since the 21 st century,due to the rapid development of the Internet,we have entered the era of big data.On the one hand,enjoying the improvement of living standards,on the other hand,it is also facing a series of new problems,especially information security issues,so network security cannot be ignored.Nowadays,phishing is a typical fraud method used to deceive and obtain profits from netizens,causing serious losses to netizens' property.It can be seen that effectively curbing "phishing websites" is an important guarantee for network security.Generally speaking,phishing attackers use fake phishing URLs to lure netizens into phishing websites and then implement fraud.Therefore,how to accurately and efficiently identify phishing URLs has become a research hotspot and important issue in network information security.Nowadays,scholars at home and abroad have conducted more and more detailed research on defensive phishing websites,but they still need to be improved and deepened.Nowadays,the emergence of deep learning technology has played an extremely important role in phishing website identification,greatly improving detection efficiency and accuracy.However,a deep learning model is like a black box.Give it an input and feedback a decision result.Although the result is considerable,no one can know exactly the decision basis behind it and whether the decision it makes is reliable.Its model We don't know the specific internal operation,which hinders the further development and application of deep learning,so its interpretability research is imminent and has become a hot and difficult point today.Phishing URLs generally have short survival times and various forms of variation.The features of artificially extracted URLs(Uniform Resource Locator)often depend on human prior knowledge.These extracted features may not be effective in distinguishing phishing URLs.The detection method is not high,and theefficiency of the detection method is low.Therefore,this article uses a detection method that directly learns the URL character sequence without manually extracting features.The specifics are as follows.First,use the web crawler technology to blacklist on the https://openphish.com website 5000 phishing URLs were crawled in the database,and 5000 normal URLs were crawled through the search engine to search for the brand corresponding to the phishing URL,and then the 10000 positive and negative labeled URL samples were converted into a two-dimensional matrix through the ASCII code table.Then the neural network embedding layer is used to construct the word vector.Finally,it is sent to several recurrent neural network models for training and comparison.It is found that the bi-directional Gated Recurrent Unit(Bi GRU)neural network can learn the serialized features and Characteristics of long-term dependencies and capture the implicit dependencies between URL character sequences System,when used for phishing website identification,it can greatly improve the accuracy and recall of phishing website detection.In addition,in order to study the interpretability of the model and find the basis for the classification of the Bi GRU neural network model,this article first uses the SHAP(SHapley Additive ex Planations)proposed by Lundberg and Lee in the 2017 paper "A unified approach to interpreting model predictions".Interpretative study of Bi GRU neural network model.Then,the LIME(Local Interpretable Model-Agnostic Explanations)interpretation method proposed at the 2016 Top Data Mining Conference(KDD)was used to study the interpretability of the Bi GRU neural network model.This kind of interpretation model does comparative analysis.The following conclusions are obtained through research and analysis:1.From the performance of model classification,the two-way gated recurrent unit neural network has a higher accuracy rate than other recurrent neural networks when it is used to identify phishing websites,reaching more than 98%.2.From the judgment basis of model classification: The neural network has learned a large number of URL datasets and found that the basis of its judgment is mainly based on the characteristics of a character or a string.3.Judging from the characteristics of the judgment basis: These strings havedifferent lengths and contain relationships,but they all have a certain effect on determining whether this URL is a normal URL or a phishing URL,and are given different characteristics contribution.4.From the comparison of the performance of the two interpretation methods:Generally speaking,SHAP interpretation takes into account the correlation between features and has a wider scope of application,but specifically for this article,for the case where there is no artificial feature extraction in this article,Interpretation using LIME will give better interpretation of results than SHAP interpretation. |