| With the development of computer science and technology and the improvement of software technology,users’ demand for software products is increasing,the amount of code involved increases,and the probability of vulnerabilities is also increasing.In addition,as attackers’ technical means continue to improve,the possibility of software vulnerabilities being maliciously exploited also increases.If vulnerability mining is carried out at the source code level,it can not only reduce the occurrence of vulnerabilities from the source,but also reduce the workload of subsequent vulnerability mining.Due to the low efficiency of traditional vulnerability mining methods,the use of deep learning technology can assist the vulnerability mining work and improve detection efficiency and accuracy.This thesis designs a vulnerability mining system based on deep learning from the source code level,and summarizes four shortcomings in accordance with existing research,from the classification of vulnerability types,the treatment of data set imbalance,the selection of deep learning models,the characteristics of code writing,etc.Considering comprehensively,a Transformer-CNN(TF-CNN)vulnerability mining system based on the neighborhood division weighted SMOTE algorithm is designed,and the position coding in the Transformer model is optimized and improved,which can better learn the source code feature representation.The main work is as follows(1)This thesis constructes a large-scale source code data set with type annotations,and divides it into 10 categories according to the causes of vulnerabilities,and proposes a weighted SMOTE based on neighborhood division for the unbalanced distribution of the data set(N-SMOTE)algorithm.The N-SMOTE algorithm calculates the position weight or density weight of the sample according to the different distributions around the sample,and synthesizes different numbers of minority samples by setting different sampling weights to increase the diversity of the samples while avoiding the generation of noisy data and reducing the data set The impact of unbalanced distribution between classes on the system.(2)A hybrid deep learning model TF-CNN model is proposed,and the position coding in the Transformer encoder is optimized and improved based on the logarithmic representation.Transformer encoder and CNN encoder can learn the global and local features of source code well through self-attention mechanism and convolution respectively.From the perspective of the difference between code writing and ordinary text grammar,a positional coding method based on logarithmic representation is designed,which is more in line with the characteristics of code writing and can effectively mine vulnerabilities.(3)This thesis designs and implements a source code vulnerability mining system based on deep learning,and conducts comparative experiments from four aspects to comprehensively evaluate the vulnerability detection system.The experimental results show that the system can effectively detect the source code corresponding vulnerability types,and prove that the n-smote algorithm and location coding based on logarithm help to improve the system vulnerability mining ability.The recall rate of the model reached95.50%,the accuracy rate reached 98.18%,and the specificity reached 97.27%. |