Font Size: a A A

Research On C/C++ Source Code Vulnerability Mining Technology Based On Deep Learning

Posted on:2023-09-01Degree:MasterType:Thesis
Country:ChinaCandidate:T S LiuFull Text:PDF
GTID:2558306848955159Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In recent years,with the rapid iteration of Internet technology,in order to meet the needs of users,the complexity of software is gradually increasing,and the number and harm of software vulnerabilities are also increasing;at the same time,the attack methods of illegal attackers are constantly escalating,which brings severe challenges to vulnerability mining.The harm caused by vulnerabilities can be reduced if the probability of vulnerabilities can be reduced at the initial stage of the software life cycle.As deep learning methods perform better in natural language processing and other fields,security researchers begin to use detection methods based on deep learning to analyze software source code vulnerabilities in order to reduce the occurrence of vulnerabilities in the coding phase.It is proved that,compared with the traditional vulnerability mining methods,the vulnerability mining method based on deep learning has higher efficiency and accuracy.Therefore,a vulnerability mining system based on deep learning is designed and implemented in this paper.In view of the shortcomings of the existing research,this paper considers the lack of vulnerability data sets,vulnerability type classification,coding characteristics of code language and the choice of deep learning methods.This paper mainly focuses on the following four aspects:(1)Aiming at the problem of insufficient vulnerability data sets,a code slice generation algorithm is designed to establish a large-scale experimental data set.According to the characteristics of the code language,the algorithm extracts the code slices with syntax and semantic information,and generates 9 kinds of vulnerabilities and more than 150000 code slices.The experimental results show that the constructed experimental data set effectively improves the learning ability of the model to the vulnerability characteristics.(2)In order to enhance the ability of vulnerability classification,a word frequencyinverse document offset frequency(TF-IODF)algorithm is proposed to weight the word vector representation generated by the CBOW model.TF-IODF analyzes the inverse document offset frequency of words by defining the deviation between the number of word slices and the average number of slices,thus weakening the influence of extreme words in document frequency.The experimental results show that the TF-IODF algorithm in this paper can effectively improve the representation ability of words to the slices,thus enhancing the classification ability of the model.(3)In order to improve the accuracy of location coding representation in code language,the location coding representation method of self-attention(SA)layer is improved.This method analyzes the difference between the code language and the ordinary text language,and defines the relative distance matrix between words to supplement the position information from the attention layer,so as to enhance the model’s ability to learn the language features of the source code.The experimental results show that the location coding representation of the distance matrix can express the coding characteristics of the code language more accurately and is beneficial to the extraction of the vulnerability features of the code language.(4)A CNN+SA-BLSTM hybrid model is designed and implemented,which combines the SA mechanism to learn the global semantic features of the source code on the basis of the Bidirectional Long Short-Term Memory(BLSTM)model,and combines the local semantic features of the convolutional neural network(CNN)model to represent the vulnerability features,so as to improve the ability of the model to extract the source code vulnerability features.The experimental results show that the recall rate,F1 score and specificity of this model reach 96.24%、97.31% and 94.93% respectively,which has stronger vulnerability mining and classification ability.
Keywords/Search Tags:C/C++ source code, Vulnerability mining, Deep learning, TF-IDF algorithm, Position encoding
PDF Full Text Request
Related items