Font Size: a A A

Dark Web Classification Based On Image And Text Fusion Features

Posted on:2022-10-26Degree:MasterType:Thesis
Country:ChinaCandidate:M Z LiFull Text:PDF
GTID:2518306563476144Subject:Cyberspace security
Abstract/Summary:PDF Full Text Request
The dark web is a hidden network that needs to be accessed by special means compared to the surface web(i.e.the Internet).It was originally designed to protect the privacy of users’ communications.However,due to its strong anonymity,the dark web has become a breeding ground for various serious illegal and criminal activities.Therefore,it is of practical significance to classify the illegal activities on the dark web accurately.The current dark web content classification research generally uses text classification methods to process web page classification tasks,which cannot reasonably use the rich structural features and image-text relationships in web pages.In addition,the characteristics of sparse text features,mixed category keywords,and unevenly distributed illegal activities make traditional web page text classification algorithms not well applicable.In response to these problems,this thesis takes the anonymous network Tor as the research object,and proposes a dark web content classification method based on the fusion of images and texts for the unique features of dark web pages.The main work and contributions of this thesis are as follows:(1)Aiming at the problem that the public dark web datasets are separated from the text and image resources,most of the websites have been unable to visit again,this thesis proposes an improved content crawling framework for the dark web,which can capture the latest published hidden services,complete the source annotation of pictures and their web pages,and restore the complete structure of local downloaded HTML files.The improved crawler is more in line with the actual needs of this paper in the analysis of web page structure features and the creation of dark web data sets corresponding to images and texts.(2)In view of the problems of sparse text,mixed category keywords and uneven distribution of illegal activities on dark web,this thesis analyzes the characteristics of a large number of illegal web pages on the dark web and finds that the pictures are representative.According to the complementarity of web text and visual features,a classification method based on the fusion of images and text is proposed.For text data,we extract three types of structural tags with higher probability of feature words,and proposes a text graph neural network model which combines the structural features of dark web pages.For image data,transfer learning is used to deal with the insufficient amount of illegal image data on the dark web.And based on the position relationship of the image and text,a filtering method for noise images in the web page is designed,which can effectively reduce the weight of irrelevant pictures and avoid affecting the classification results in the fusion decision-making stage.Finally,multiple strategies in the combined back-end fusion method are fused in the decision-making stage to improve the fault tolerance of the model.(3)In the method implementation phase,experiments are carried out using the text graph neural network and the noise-filtered image classifier proposed in this thesis,achieving 90.1% and 93.8% respectively.In the text classification comparison experiment,the graph neural network model that considers the semantic association and structural information of the context has an accuracy rate of 4% higher than that of the traditional machine learning algorithm.The ensemble classifier after fusing multiple features is used to conduct comparative experiments with three dark web content classification methods which perform well in this field.The method in this thesis is more effective,reaching a classification accuracy of 99.7%.The method proposed in this thesis rely on nothing of external knowledge,and does not have high requirements for users’ professional fields.It can run only by using the image and text resources of the dark web itself with portability for the identification of new illegal web pages.The fused model can achieve high-precision classification results for large-scale dark web data.
Keywords/Search Tags:Dark web, Fusion features, Webpage classification, Illegal activities, Tor
PDF Full Text Request
Related items