Font Size: a A A

Research On Detection Methods Of Reused Open-source Code

Posted on:2022-06-01Degree:MasterType:Thesis
Country:ChinaCandidate:D H ZhangFull Text:PDF
GTID:2518306323462384Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Open-source code reuse is a common practice in software development,which can reduce time and human effort.On the other hand,problems such as code vulnerabilities and open-source license violations may occur during code reuse.Therefore,it is neces-sary to detect reused code in software code.Existing similarity-based methods,includ-ing clone detection methods,often report a large amount of accidentally cloned code,which is not reused code.Currently code reuse detection methods are scarce which take this factor into account.To this end,we provide two code reuse detection methods fo-cusing on code lexical features and program dependency graphs(PDGs)respectively.The main contents and contributions are as follows:(1)Research on open-source code reuse detection based on lexical featuresIn view of the problem that existing lexical-similarity-based reuse detection meth-ods cannot accurately distinguish reused code from accidentally cloned code,a method for open-source code reuse detection is proposed,based on frequencies of two types of lexical features,namely tokenized code lines and identifiers.First,cloned function pairs are extracted between developed projects and an open-source code repository.Then,for each pair,cloned code lines and shared identifiers are extracted.Finally,a metric based on frequencies of cloned code lines and shared identifiers with reference to the reposi-tory,with a higher weight to a low frequency and a lower weight to a higher frequency than inverse document frequency(IDF),is calculated to determine whether two cloned functions are reused ones.Evaluation on a labeled dataset shows that the F1 value of the proposed method is 84.2%,which is higher than the method based on inverse document frequency.Evaluation on real-world software shows that overall the proposed method has a higher accuracy than the method based on inverse document frequency.Classifi-cations and causes of similar non-reused code fragments are analyzed to illustrate with examples that similar code is not necessarily reused code.(2)Research on open-source code reuse detection based on program dependency graphsIn order to reduce the high complexity of general methods to calculate frequen-cies of subgraphs used in an existing study,a method for open-source code reuse de-tection is proposed,based on frequencies of subgraphs in program dependency graphs(PDGs).First,for each pair of cloned functions obtained between developed projects and an open-source code repository,PDGs of functions are extracted.Next,two types of subgraphs,namely Ⅰ-shaped subgraphs and X-Shaped subgraphs,are extracted from PDGs,for which general graph matching algorithms are not necessarily required.Then,the subgraphs are encoded into sequences.Finally,a metric based on frequencies of matched encoded sequences with reference to the repository is calculated to determine whether two cloned functions are reused ones.Evaluation on a labeled dataset shows that the F1 value of the proposed method is 82.4%,and that the proposed method avoids the complexity of the existing method.Evaluation on real-world software shows that the proposed method is quite complementary to the method proposed based on lexical features.Classifications and causes of non-reused code structures in similar code are analyzed to illustrate with examples that similar structures do not necessarily indicate code reuse.
Keywords/Search Tags:Code reuse, Code clone, Accidental clone, Open-source software, Lexical analysis, Program dependency graph
PDF Full Text Request
Related items