Font Size: a A A

Research On Key Technologies For Tor Darknet Content Classification

Posted on:2024-06-16Degree:MasterType:Thesis
Country:ChinaCandidate:Y M ZhangFull Text:PDF
GTID:2558307067973369Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the continuous development of network technology,the content on the Internet has shown a trend of diversity and large-scale.Meanwhile,cyberspace security problems have been increasing,especially on anonymous networks like the Tor(The Onion Routers)dark web,where illegal,malicious,and illicit activities have become easier,and criminals,in turn,often use jargons to commit illegal acts,making cybercrime more difficult to be regulated.In order to secure cyberspace and solve the information overload problem,an efficient and accurate content classification technology is needed.However,it is very challenging to classify Web pages or other content in the Tor dark web.The main problems are the inefficiency of dark web data collection,the immaturity of Chinese jargons recognition,and the low performance of classifying large amount of data.Therefore,the main work of this thesis is as follows:(1)In this thesis,the Tor network is improved to address the problem of inefficient data collection on the dark network.The specific measures taken are to reduce the number of nodes passing through the Tor link in order to improve the access speed.In addition,to avoid acquiring duplicate data,this thesis uses cuckoo filters to filter the data.Based on this,a distributed crawler was designed and implemented using the Scrapy framework.Finally,the acquired data is stored in the Elasticsearch database.It is experimentally verified that using the improved crawler system significantly improves the speed of dark web data acquisition,while avoiding duplicate data acquisition,and saves 39.75%of time compared to ordinary crawlers.(2)In this thesis,for the Chinese jargons recognized task,a jargons recognition method based on SCM(Semantics Comparison Model)model is proposed,an d data preprocessing is performed by considering Chinese text characteristics,such as lexicality and proper nouns.The jargons recognizer uses a combination of different features to show significant advantages of Chinese jargons recognition,and obtains a high accuracy result of 87.66%in the experiment.(3)For text classification problem this thesis proposes an information extraction method based on LDA(Latent Dirichlet Allocation)topic model and Text-CNN(Text Convolutional Neural Network)for related research and work in the field of cybercrime.The method shows significant results in reducing noisy data,improving model accuracy and execution efficiency.Experiments prove that this scheme not only saves more than 90%of overhead,but also can improve the accuracy to 91.35%,which is further improved to 94.88%after adding jargons.The research results of this thesis have made a certain contribution to the field of darknet content classification,and provided feasible solutions and practical enlightenment for the research and work in related fields.
Keywords/Search Tags:Dark Web, Crawler, Jargons, Text Classification
PDF Full Text Request
Related items