Research On Software Source Code Vulnerability Detection Based On Deep Representation Learning

Posted on:2024-06-22

Degree:Master

Type:Thesis

Country:China

Candidate:X Yuan

Full Text:PDF

GTID:2568307121985919

Subject:Engineering

Abstract/Summary:

PDF Full Text Request

With the widespread use of various computer software,the types and number of software vulnerabilities are increasing,posing a significant threat to software system security.Software vulnerability identification is crucial for protecting software systems from attacks.Traditional vulnerability identification methods rely on manual analysis,which is costly and time-consuming.Machine learning-based code vulnerability detection methods can only capture shallow features of the code,making it difficult to adapt to complex code vulnerability detection tasks.Researchers have proposed many deep learning methods to help developers identify vulnerabilities before software delivery.However,due to the difficulty in collecting publicly available software vulnerabilities and their relatively small quantity,deep learning methods face the challenge of insufficient data in the field of software vulnerability detection.To alleviate this pressure and extract as much vulnerability information as possible from limited data,this study conducted research from two perspectives:(1)This paper proposes a C program vulnerability detection scheme based on composite features extracted from source code.To address the scarcity of vulnerability data,this scheme extracts more information from limited data by using composite features of the code for vulnerability detection.The scheme employs two neural networks to capture code information from different code representations.On one hand,the source code is treated as a text sequence similar to natural language,and Gated Recurrent Units(GRU)are used to extract such sequential features.On the other hand,the code is transformed into an abstract syntax tree,and a Gated Graph Neural Network(GGNN)is utilized to capture structural information of the code(e.g.,loops,conditional statements).The two types of features are combined,and Random Forest is used for classification to distinguish between vulnerable and non-vulnerable code.To evaluate the proposed method,experiments were conducted on a dataset consisting of 12 real open-source software.The results indicate that using composite features of the code for vulnerability identification improves the accuracy compared to using only sequential features or only structural features.When detecting Xen(an open-source hypervisor),the output ranks the source code of Xen’s functions based on the probability of being a vulnerability,and using composite features yields a recall rate that is 0.8% higher than using only structural features and 1.1% higher than using only sequential features.Comparing this scheme with other methods,the experimental results demonstrate that the proposed method can identify more vulnerabilities.In the top 200 functions ranked by probability of being a vulnerability in the output,the accuracy can reach 51%,which is 18% higher than some existing methods,such as Cross-VD.(2)This paper designs a C program vulnerability detection framework based on the lightweight Code BERT.In this framework,Code BERT is used as a feature extractor to transform the source code into different vector representations based on different contextual information.Random forest is then employed for vulnerability and nonvulnerability classification.This framework effectively addresses the problem of polysemy in code,allowing the neural network to learn code features more clearly.The Code BERT model inherits the architecture of the BERT model,where the attention mechanism is advantageous for capturing code patterns with long-range dependencies.Additionally,focusing on multiple key variables in the data flow helps analyze and trace potential code defects.To evaluate the effectiveness of this method,it is compared with three commonly used word embedding methods(Word2Vec,Glo Ve,Fast Text).The experimental results show that in a vulnerability detection scenario involving 12 real C programs,Code BERT outperforms the other three methods as a word embedding model.In the output results,all the code is ranked by the probability of being a vulnerability,organized by functions.When retrieving the top 1% functions with the highest probability of being a vulnerability,Code BERT achieves 15%,32%,and 20%higher accuracy compared to Word2 Vec,Glo Ve,and Fast Text,respectively.To make Code BERT more suitable for C program vulnerability...

Keywords/Search Tags:

deep learning, code features extraction, vulnerability detection

PDF Full Text Request

Related items

1	Research On Software Buffer Overflow Vulnerability Detection Method Based On Deep Learning
2	Research On Buffer Overflow Vulnerability Detection Method For Windows Platform
3	Program Vulnerability Detection Through Learning On Code Text And Control Structure
4	Research And Implementation Of Source Code Vulnerability Detection Method Based On Deep Learning
5	Design And Implementation Of Source Code Vulnerability Detection System Based On Dynamic LSTM
6	An Approach For Using Deep Learning To Detect Code Vulnerabilities
7	Research On Automatic Generation Of Vulnerability Features For Program Code
8	Research On Software Vulnerability Detection Method Based On Code Feature Learning
9	Research On Security Detection Of Open Source Software For Source Code
10	Research On Intelligent Vulnerability Detection Methods Based On Scalable Code Metrics