Font Size: a A A

Research On Robustness Enhancement Of Code Authorship Attribution For Time Evolution

Posted on:2024-04-26Degree:MasterType:Thesis
Country:ChinaCandidate:S S ZhaoFull Text:PDF
GTID:2568307175968919Subject:Cyberspace security
Abstract/Summary:PDF Full Text Request
Source code authorship attribution is the process of identifying the author of a given code,which is important in practical applications such as plagiarism detection,software forensics and copyright attribution.With the rapid development of source code authorship attribution methods and deep learning techniques,a number of deep learning-based source code authorship attribution methods have emerged.However,all these methods suffer from time quality degradation,resulting in their accuracy decreasing with time evolution.At the same time,existing solutions suffer from many problems such as non-universality,high time complexity,and dependence on new labeled samples.To address the above problems,the main research work of this paper is as follows:1.In order to make the deep learning-based source code authorship attribution method learn the more unique programming style of programmers,a negative sample construction method is proposed.The method generates negative samples by code transformation and style imitation without changing the code functionality and generating syntax errors.Specifically,such negative samples combine local features from two different positive source categories,enabling clustering distribution near the decision boundary between the two positive source categories,thus achieving the goal of increasing the inter-class distance in the source domain and reducing the intra-class distance in the source domain.2.To alleviate the time quality degradation problem of deep learning-based source code authorship attribution technique,this paper proposes a domain adaptation-based code authorship attribution robustness enhancement scheme in the face of time evolution-Time Domain Adaptation(Time DA).The scheme treats the time evolution problem as a domain adaptation problem,which makes the models trained in the source domain adapt better and faster to the target domain.First,a new feature extractor is added to the original network framework;then,the model is trained iteratively using positive and negative samples;finally,the target domain adaptation is performed using a centroid-based pseudo-labeling strategy and neighborhood clustering loss.3.In this paper,a new time-segmented dataset is extracted and reconstructed in the publicly available Google Code Jam dataset,and the feasibility and effectiveness of the Time DA method is verified through targeted experiments.The experimental results show that the accuracy of Time DA improves by 8.7% and 5.2% in the Java and C++ datasets,respectively,significantly enhancing the robustness of source code authorship attribution to time evolution.In addition,Time DA obtains slightly higher accuracy and significantly reduces the model training time by 87.3% compared with the traditional unsupervised domain adaptation method,thanks to the centroid-based pseudo-labeling strategy employed in this paper.
Keywords/Search Tags:Code authorship attribution, Deep learning, Time evolution, Domain adaptation
PDF Full Text Request
Related items