Research And Implementation Of Source Code Authorship Identification Technology Based On Deep Learning

Posted on:2024-08-04

Degree:Master

Type:Thesis

Country:China

Candidate:Q Ding

Full Text:PDF

GTID:2568306941484034

Subject:Cyberspace security

Abstract/Summary:

PDF Full Text Request

Source code authorship identification generally refers to the process of identifying the author of an anonymous source code from candidate identities,which usually based on its unique characteristics that reflect the author’s legacy programming style.Source code authorship identification techniques are widely applied in malware homology detection,software copyright dispute resolution,code plagiarism detection and so on.At present,most research studies depend on the lexical features and grammatical features extracted from the word sequence of the source code or the source code abstract syntax tree.However,these two types of features can be easily affected by the code formatting tool and the code obfuscation tool.In contrast,the structural features at the semantic level are more resistant to obfuscation attacks.To solve this problem,this paper proposes a source code authorship identification method based on the source code abstract syntax tree to learn the semantic structure features.This paper is committed to use graph neural network to identify the authorship of anonymous source code.The main research results are as follows:Firstly,a method of extracting semantic features based on the enhanced abstract syntax tree is proposed.This method can make up for the confusion of the source code vocabulary sequence.After the source code is parsed into the original abstract syntax tree,additional reinforcement edges containing context and control flow information are added.Based on the enhanced abstract syntax tree,graph neural network technology is used to learn the semantic features.Secondly,a hierarchical graph learning model for abstract syntax trees based on the GraphSAGE algorithm is proposed.After transforming the source code into an enhanced abstract syntax tree,Word2Vec is used to pretraining the representation of the nodes,and then the GraphSAGE convolution is used to aggregate the information of neighbor nodes.The graph nodes are filtered through the TopKPooling pooling layer to reduce the size of the graph.The model adopts a hierarchical architecture.After multiple convolution pooling and global readout operations,the final output representation aggregates the structural information of the graph at multiple levels.Thirdly,a source code authorship identification system was proposed.In order to verify the effectiveness of the method proposed in this paper,the corresponding source code authorship identification system is developed and implemented based on the proposed method.The system mainly includes source code preprocessing,abstract syntax tree extraction,syntax tree processing,graph neural network learning,prediction,and output modules.The overall process of the system and the details of each module are described in detail.Finally,through the experimental verification of the whole model and each module on multiple open datasets,the method of extracting source code features based on the enhanced abstract syntax tree proposed in this paper can effectively extract differentiated structural features.The neural network model of deep learning graph proposed in this paper can achieve the same effect as other traditional deep learning models on multiple open datasets.However,this paper is based on the structural features of the semantic level in the source code,which is more robust than other solutions.

Keywords/Search Tags:

source code authorship identification, graph neural network, abstract syntax tree, authorship attribution

PDF Full Text Request

Related items

1	Authorship Attribution In Social Media Texts
2	Deep Learning Based Methods For Authorship Attribution
3	Research On Author Attribution Discrimination For Source Code
4	Research On Robustness Enhancement Of Code Authorship Attribution For Time Evolution
5	Cross-entropy approaches to software forensics: Source code authorship identification
6	Machine learning method for authorship attribution
7	Research On Text Authorship Identification Technology And Its Application In Network Tracking
8	Authorship attribution on the Enron Email Corpus
9	Research On Source Code Plagiarism Detection Based On Abstract Syntax Tree
10	Research And Implementation On Weibo Authorship Identification Based On Deep Learning