Font Size: a A A

Research On Clone Detection Based On Intermediate Representation Of Source Code

Posted on:2022-03-14Degree:MasterType:Thesis
Country:ChinaCandidate:L S HouFull Text:PDF
GTID:2518306572959819Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Code cloning is a repetitive code fragment with similar syntax or semantics.Code clone detection plays a vital role in tasks such as software maintenance,code refactoring,and vulnerability detection.In order to save a lot of human and material resources,automatic detection of code clones in the code repositories is one of the most critical issues in the field of software engineering.In recent years,the use of machine learning technology to analyze source code has attracted widespread attention,and many researchers have adopted machine learning technology for code cloning detection.Early research work mainly used information retrieval methods,which loses a lot of important semantic information.Recent studies have shown that the use of intermediate representations of source codes,such as abstract syntax trees,can improve the machine’s ability to understand the semantics of source codes,to better complete downstream tasks,such as code clone detection.Therefore,this article mainly started from the different representations of the source code,and completed the following research work:Firstly,this paper proposes a code clone detection method based on ST-trees(Sentence trees)learning,which combines lexical and syntax information to analyze and characterize source code semantics.This paper selects four evaluation indicators to measure the effectiveness of the method in the task of code cloning detection.This method performs well in all four indicators,among which is 98.6% in the comprehensive evaluation index F1 score,which exceeds the most advanced model(FA-AST-GMN)at present;Then,this paper proposes a graph-based code cloning detection method.First,this method explicitly adds source code’s control and data dependency information to the program graph representation,and then designs a learnable embedding function to map the program graph into an embedding vector representation,Finally,a new attention mechanism is used to calculate the importance of different nodes according to the amount of global information contained in each nodes.Furthermore,the graph-level embedding representation of the source code is obtained by calculating the weighted average of graph nodes.This paper uses this method to perform a code clone detection task on the public data set GCJ,and its F1 score is 95.4%.The graph-based code clone detection method only considers the graph structure information of the source code.In order to further improve the performance of the graph-based code clone detection method,this paper proposes a cloning code detection method based on fusion representation learning,which solves the problem that the graphbased code cloning detection method does not perform well in evaluation indicators without increasing the time cost.This method mainly makes two improvements:First,after obtaining the program graph representation,this method uses the text information of the source code as a supplement to the graph structure information to enrich the semantic representation;Second,to break through the graph message passing limitations in the k-order neighborhood of graph neural network,this method uses a sequence network to learn graph node embedding representations to obtain global information about the graph.On the code cloning detection task,this method achieves good performance in the four evaluation indicators where it reached an F1 score of 99%,which surpassses the state-of-the-art model (FA-AST-GMN) at present.
Keywords/Search Tags:Abstract Syntax Tree, Source Code Semantics, Code Clone Detection, Graphbased Neural Networks, Fused Representation Learning
PDF Full Text Request
Related items