| Copying,pasting and modifying code is a common behavior.This way not only improves the development efficiency and reduces the development time,but also causes a large number of the same or similar code.The same or similar code is called code clone.The popularity of open source projects has further exacerbated the generation of code clone.Code clone will affect software quality,increase the difficulty of software maintenance,and lead to the spread of vulnerability.Code clone detection approach is considered to be an effective means to deal with the problem of code cloning.The existing code clone detection approaches still show the problem of insufficient efficiency when targeting large-scale code.In particular,it is still a challenge to detect near-miss clones in large code,since with the increase of the size of source code,more computing and memory resources are required,and the existing methods are difficult to break through the resource constraints to support the rapid detection of near miss clones.On the other hand,the vulnerability related problems caused by code clone seriously threaten the security of software.Researchers proposed a detection approach for code clone vulnerability to deal with this problem.The existing detection tools still can not solve the problem of code clone vulnerability detection in large-scale code scenarios.The existing graph or tree based methods use sub-graphs or trees for comparison,and their time complexity can not be used in large-scale code.When the large-scale code clone approach is directly applied to code clone vulnerability detection,it is easy to mistake the similar code used for vulnerability,resulting in the problem of high false positive rate.Aiming at the shortcomings of the existing work,this paper carries out relevant work on two aspects: large-scale code clone detection and large-scale code clone vulnerability detection.1.At present,there are still obvious deficiencies in the efficiency of near-miss clone detection approaches and tools in large-scale code scenarios.Aiming at the above problems,this paper proposes and implements a fast and scalable distributed clone detection algorithm Fast DCF,which uses the scalable distributed parallelization of Map Reduce and HDFS,overcomes the limitations of single node CPU and memory resources,and effectively improves the detection efficiency.Especifically,Fast DCF uses the optimization method of partial index to reduce the comparison times of code blocks.This further improves the efficiency of clone detection.The use of multi-threading strategy gives full play to the computing power of each node.Fast DCF can detect not only level 1 and level 2 clones,but also complex level 3 clones in large-scale code.In addition,the requirements of large-scale application scenarios are diverse.In order to solve this problem,Fast DCF uses a powerful and flexible parser,which can provide detection requirements in multiple languages and granularity.The experimental results show that Fast DCF can detect clones in 250 million lines of code in24 minutes,which is more efficient than the existing clone detection technology.With the help of Big Clone Bench and the Mutation Framework,two widely used benchmarks,Fast DCF is evaluated.It is found that Fast DCF achieves high recall and accuracy.Compared with other existing similar open source tools in the world,it has obvious advantages in detection performance on the basis of ensuring high recall and accuracy.2.Analyzing the code clone vulnerability with the help of code clone detection technology can help find the problems caused by code clone in the software system,reduce the defects in the software system,and improve the quality of the software system.This paper proposes a method based on program slicing SVD,which extracts accurate vulnerability and patch information from the program dependency graph,and then converts the vulnerability information into lexical sequences for comparison.This method effectively converts the comparison of the original graph or tree into the comparison of lexical sequences,so as to improve the detection efficiency of the algorithm.At the same time,compared with direct detection using code cloning detection method,the accurate extraction of vulnerability information can effectively improve the accuracy.In comparison,partial indexing and distributed methods are adopted to effectively improve the speed of comparison,so that SVD can be well applied to large-scale code scenarios.In view of the problem that it is easy to mistakenly recognize patches as vulnerabilities during detection,patch information is added,which can effectively reduce the occurrence of this situation.The experiment uses the Linux kernel and CVE vulnerability information as the test set,and successfully finds out some vulnerability examples in the Linux kernel. |