TECCD: A Tree Embedding Approach For Code Clone Detection

Posted on:2020-09-22

Degree:Master

Type:Thesis

Country:China

Candidate:Y Gao

Full Text:PDF

GTID:2518306518966919

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

In the field of software engineering,scholars have never stopped the research of code clone detection technology.The purpose of code clone detection is to find out the clones existing in software system,make use of them scientifically by analyzing the impact of cloning on software quality,and reconstruct or avoid harmful clones that threaten software quality,so as to improve the quality of software system,improve the development efficiency of software personnel and reduce the maintenance cost of software system.Up to now,in the field of clone detection,different methods and techniques have been accumulated.Mainstream detection technologies include code-based grammar structure detection and Non-syntactic structure detection.Non-syntactic structure detection is divided into text-based detection,token-based detection and code Metrics-based detection.The advantages of these methods are f AST detection speed.However,due to the lack of consideration of the similarity between code structures,the effect of this kind of technology on high-level cloned code detection is generally unsatisfactory.Conversely,the technology based on grammatical structure detection has better detection effect for high-level cloned code because it processes code into parse tree or program dependency graph,which uses structural information between codes for final detection.But at the same time,the biggest problem of the technology based on grammatical structure detection is the inefficiency of detection,which is due to the conversion of code to program dependency graph.Tree structure or graph structure,as well as matching algorithms used in these structures,are expensive.In view of this,the application of this kind of technology is also greatly limited.Recently,deep learning techniques has been adopted to improve the code representation capability,and improve the state-of-the-art in code clone detection.These approaches usually require a transformation from AST to binary tree to incorporate syntactical information,which introduces overheads.Moreover,these approaches conduct term-embedding,which requires large training datasets.In this paper,we introduce a tree embedding technique to conduct clone detection.Our approach first conducts tree embedding to obtain a node vector for each intermediate node in the AST,which captures the structure information of ASTs.Then we compose a tree vector from its involving node vectors using a lightweight method.Lastly Euclidean distances between tree vectors are measured to determine code clones.We implement our approach in a tool called TECCD and conduct an evaluation using the Big Clone Bench(BCB)and 6 other large scale Java projects.The results show that our approach achieves good accuracy and recall and outperforms existing approaches.

Keywords/Search Tags:

Code Clone Detection, AST, Tree-embedding, Skip-gram, Sentence2vec

PDF Full Text Request

Related items

1	Research On Code Clone Detection And Clone Bug Finding
2	A Method For Automatic Construction Of Corpus And Code Clone Detection
3	Detection Of Function-based The Structural Clone And The Semantic Clone
4	Research On Code Clone Extension Analysis And Management Technology
5	Research On Code Clone Detection Based On Deep Learning
6	Research On Algorithm Optimization And Application Of Code Clone Detection Task
7	Research On Tree-based Clone Detection In Web Application
8	Research On Clone Detection Based On Intermediate Representation Of Source Code
9	Research On Code Clone Detection And Code Clone Vulnerability Analysis Technology For Large-scale Software
10	Research And Implementation Of Blockchain Smart Contract Code Clone Detection Based On Graph Embedding