Font Size: a A A

A Method For Automatic Construction Of Corpus And Code Clone Detection

Posted on:2020-10-10Degree:MasterType:Thesis
Country:ChinaCandidate:W SangFull Text:PDF
GTID:2518306518463494Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Code clone detection is an important task in program quality analysis.Detecting cloned code in a program helps to improve the maintainability of software projects and reduce program code redundancy.In recent years,many researches have proposed a variety of methods to detect clone codes.With the development of machine learning,many methods based on machine learning to detect code clone have emerged,especially supervised learners and pre-processing forms such as word embedding.which showed great effect in clone detection.However,there are two problems in clone code detection with supervised learning.First,labels in training corpus is difficult to obtain.It is time-consuming and subjective to construct a cloned corpus by manual labeling.Second,with token-based intermediate representation used in clone detection,the learning ability of the learner is insufficient if the the vocabulary composed of these finite number of tokens is directly used for pretraining,the learning ability of the learner is insufficient.therefore,this article proposed an effective clone detection method to solve these two problems.We proposed a method for constructing a large-scale pseudo-training corpus automatically based on code clone definition,which reduced the cost of labeling and improved the accuracy of labeling.Moreover,a TPE unit,which was an effective intermediate representation unit in clone detection,was proposed to balance the pre-training vocabulary expression ability and the training cost.In addition,we constructed a standard Bi LSTM model for clone detection and performed experiments on the Big Clone Bench dataset to validate our approach.Results showed that in the clone detection task,the training corpus generated by the method of automatically constructing pseudo clone corpus was better than the manual clone corpus;the TPE unit was used as the intermediate representation of the code in the pre-training process of deep learning,which could can input more abundant code information for the neural model,and avoid the OOV problem caused by the program statement as the intermediate representation;our clone detection model indicated better detection effect than other advanced clone detection tools.
Keywords/Search Tags:Code Clone Detection, Deep Learning, Word Embedding, Code Representation
PDF Full Text Request
Related items