| In the field of software engineering,scholars have never stopped the research of code clone detection technology.The purpose of code clone detection is to find out the clones existing in software system,make use of them scientifically by analyzing the impact of cloning on software quality,and reconstruct or avoid harmful clones that threaten software quality,so as to improve the quality of software system,improve the development efficiency of software personnel and reduce the maintenance cost of software system.Up to now,in the field of clone detection,different methods and techniques have been accumulated.Mainstream detection technologies include code-based grammar structure detection and Non-syntactic structure detection.Non-syntactic structure detection is divided into text-based detection,token-based detection and code Metrics-based detection.The advantages of these methods are f AST detection speed.However,due to the lack of consideration of the similarity between code structures,the effect of this kind of technology on high-level cloned code detection is generally unsatisfactory.Conversely,the technology based on grammatical structure detection has better detection effect for high-level cloned code because it processes code into parse tree or program dependency graph,which uses structural information between codes for final detection.But at the same time,the biggest problem of the technology based on grammatical structure detection is the inefficiency of detection,which is due to the conversion of code to program dependency graph.Tree structure or graph structure,as well as matching algorithms used in these structures,are expensive.In view of this,the application of this kind of technology is also greatly limited.Recently,deep learning techniques has been adopted to improve the code representation capability,and improve the state-of-the-art in code clone detection.These approaches usually require a transformation from AST to binary tree to incorporate syntactical information,which introduces overheads.Moreover,these approaches conduct term-embedding,which requires large training datasets.In this paper,we introduce a tree embedding technique to conduct clone detection.Our approach first conducts tree embedding to obtain a node vector for each intermediate node in the AST,which captures the structure information of ASTs.Then we compose a tree vector from its involving node vectors using a lightweight method.Lastly Euclidean distances between tree vectors are measured to determine code clones.We implement our approach in a tool called TECCD and conduct an evaluation using the Big Clone Bench(BCB)and 6 other large scale Java projects.The results show that our approach achieves good accuracy and recall and outperforms existing approaches. |