Font Size: a A A

Research And Implementation Of Source Code Authorship Identification Technology Based On Deep Learning

Posted on:2024-08-04Degree:MasterType:Thesis
Country:ChinaCandidate:Q DingFull Text:PDF
GTID:2568306941484034Subject:Cyberspace security
Abstract/Summary:PDF Full Text Request
Source code authorship identification generally refers to the process of identifying the author of an anonymous source code from candidate identities,which usually based on its unique characteristics that reflect the author’s legacy programming style.Source code authorship identification techniques are widely applied in malware homology detection,software copyright dispute resolution,code plagiarism detection and so on.At present,most research studies depend on the lexical features and grammatical features extracted from the word sequence of the source code or the source code abstract syntax tree.However,these two types of features can be easily affected by the code formatting tool and the code obfuscation tool.In contrast,the structural features at the semantic level are more resistant to obfuscation attacks.To solve this problem,this paper proposes a source code authorship identification method based on the source code abstract syntax tree to learn the semantic structure features.This paper is committed to use graph neural network to identify the authorship of anonymous source code.The main research results are as follows:Firstly,a method of extracting semantic features based on the enhanced abstract syntax tree is proposed.This method can make up for the confusion of the source code vocabulary sequence.After the source code is parsed into the original abstract syntax tree,additional reinforcement edges containing context and control flow information are added.Based on the enhanced abstract syntax tree,graph neural network technology is used to learn the semantic features.Secondly,a hierarchical graph learning model for abstract syntax trees based on the GraphSAGE algorithm is proposed.After transforming the source code into an enhanced abstract syntax tree,Word2Vec is used to pretraining the representation of the nodes,and then the GraphSAGE convolution is used to aggregate the information of neighbor nodes.The graph nodes are filtered through the TopKPooling pooling layer to reduce the size of the graph.The model adopts a hierarchical architecture.After multiple convolution pooling and global readout operations,the final output representation aggregates the structural information of the graph at multiple levels.Thirdly,a source code authorship identification system was proposed.In order to verify the effectiveness of the method proposed in this paper,the corresponding source code authorship identification system is developed and implemented based on the proposed method.The system mainly includes source code preprocessing,abstract syntax tree extraction,syntax tree processing,graph neural network learning,prediction,and output modules.The overall process of the system and the details of each module are described in detail.Finally,through the experimental verification of the whole model and each module on multiple open datasets,the method of extracting source code features based on the enhanced abstract syntax tree proposed in this paper can effectively extract differentiated structural features.The neural network model of deep learning graph proposed in this paper can achieve the same effect as other traditional deep learning models on multiple open datasets.However,this paper is based on the structural features of the semantic level in the source code,which is more robust than other solutions.
Keywords/Search Tags:source code authorship identification, graph neural network, abstract syntax tree, authorship attribution
PDF Full Text Request
Related items