| With the vigorous development of the Internet industry,there are more and more cyber security attacks through malicious web pages,which seriously threaten cyberspace security.The method of malicious web pages based on deep learning can detect url links that have been confused,but url text is not very similar to regular text.The url text is very noisy,which affects the classification effect of the neural network model.Therefore,in order to improve the effect of neural network model classification of URL links,and increase the detection of short links,ip and other URL links without lexical features,this thesis proposes a malicious URL link detection algorithm based on hybrid embedding and a malicious web page detection based on web text classification.And based on these two algorithms,a malicious webpage detection system is designed and implemented.The main research work of this thesis is summarized as follows:(1)Propose a malicious URL link detection algorithm based on hybrid embedding.Aiming at the problem of irregular url text,a data preprocessing method is proposed,and for the problem of a large number of OOV(out of vocabulary)words in url text segmentation,a malicious url link detection algorithm based on hybrid embedding is proposed.This method uses high-speed network to combine character level embedding and word embedding,and then use convolutional neural network to extract the text features of URL links,then use the softmax function for classification,and finally compare this algorithm with other commonly used text classification algorithms to prove the effectiveness of the algorithm.(2)Propose a malicious webpage detection algorithm based on webpage text classification.Aiming at the problem that URL links without lexical features such as short links and ip addresses cannot be detected by malicious URL link detection algorithms,a malicious web page detection algorithm based on web text classification is proposed.This algorithm extracts the text information from the web page and then uses the neural network model to classify.Aiming at the problem of dispersive and incoherent webpage text,the algorithm uses the CNN-Attention-Bi LSTM neural network to extract the features of the full text,so as to realize the classification of the webpage text.Finally,the algorithm is compared with other commonly used neural network models,which proves the effectiveness of the algorithm.(3)Design and implement a malicious webpage detection system.In order to prove the effectiveness of the two algorithms mentioned in this article,this article designs and implements a malicious webpage detection system that allows users to detect malicious webpages in real time.The system includes a browser plug-in,home page and backend.When a user visits a malicious webpage,the plug-in can prompt the user immediately,and the administrator can complete some system configuration tasks on the homepage.According to the experimental results,the recognition rate of malicious URL links based on the hybrid embedded malicious URL link detection algorithm is 98.9%.The malicious webpage detection algorithm based on webpage text classification has a recognition rate of 96.8% for malicious webpages.The malicious webpage detection system developed based on these two algorithms can efficiently and accurately detect more and more diverse malicious webpages,and can effectively protect the security of users’ information and property.In short,the two malicious webpage detection algorithms mentioned in this article are effective,and the malicious webpage detection system developed based on these two algorithms is also effective and practical. |