| With the rapid growth of software size and the increase of complexity of system architecture and functionality,identifying and locating security vulnerabilities quickly and accurately is critical to ensuring software security and quality.However,the current popular rule-based vulnerability identification tools often have a high false positive rate or false negative rate,and rely heavily on human-crafted rules with poor generalization ability.In recent years,the vast amount of code and data available in the software development process has brought potential opportunities and challenges to data-driven software vulnerability identification technology,making it one of the most popular research directions in the field of software security.However,most existing data-driven methods do not fully exploit the semantic information of historical vulnerabilities.which exhibit degraded performance in real-world scenarios.In addition,current research is focused mostly on coarse-grained vulnerability identification,and there is a lack of research on fine-grained vulnerability identification.Therefore,exploring more advanced data-driven models is of critical great importance in promoting the progress of automatic data-driven vulnerability identification in real-world applications.This paper delves into data-driven methods to detect security bugs,surrounding code and bug reports that are closely connected with software security.For source code vulnerability identification at the coarse-grained level,improving deep semantic information understanding is still the core of vulnerability identification,which is also essential for maintaining detection performance for data-driven approaches.To achieve this,we propose a novel vulnerability identification framework based on the staged code representation,which firstly leverages the statement encode network(SENet)and program encode network(PENet)to learn low-level(i.e.,token-level)and high-level(i.e.,statement-level)program semantics respectively,and then generates accurate vector representations of programs that serve as input for the coarse-grained vulnerability identification models.Furthermore,both graph neural networks and sequential neural networks always fail to effectively and comprehensively extract both structural and sequential features of source code.This paper proposes a novel hierarchical semantic-aware code representation learning framework,which combines the Tree-LSTM and Graph-LSTM networks to fully extract the syntactic,structural and sequential features of source code,thereby achieving a deeper code understanding.For source code vulnerability identification at the fine-grained level,we model this task as a sequence generation and sequence decision problem,and propose Seq2 seqbased and reinforcement learning-based fine-grained vulnerability identification methods,respectively.Both methods can make full use of the available fine-grained labels to effectively train detection models,and thus improve the accuracy of fine-grained vulnerability identification.Furthermore,in addition to the above advantages,the fine-grained vulnerability identification method based on reinforcement learning proposed in this paper(1)utilizes the reward mechanism to take the correctness of predicted global vulnerability structure into consideration rather than focus on that of individual statements,which is able to provide better guidance to the policy learning and thus give more reasonable and interpretable detection results;(2)leverages the exploration mechanism to make the model unaffected by sparse vulnerability information,and can explore and quantify how likely a subset of statements is to cause vulnerabilities through active trial-and-error learning,which improves the generalization ability of the model.For security bug report detection,we propose a novel content-based data filtering and representation framework LTRWES,which first exploits a ranking model to efficiently calculate the content similarity between bug reports,and then filter non-security bug reports(NSBRs)that have higher content similarity with respect to security bug reports(SBRs).Through the filtering process,LTRWES has the advantage of reducing the number of noisy bug reports in the training dataset and alleviating the class imbalance problem.In addition,due to the complex characteristics of security-related bug reports,it is difficult for most previous methods based on machine learning to capture deep semantic information from the textual fields of bug reports.Therefore,we develop another novel noise filtering framework(FSDON),which first leverages a generative model to identify semantic "regions" in the word embedding space frequently mentioned in textual fields of SBRs,and then filters the noisy NSBRs with high probability arising from security-related semantic "regions".Based on the study of data filtering,we build predictive models for SBRs detection based on different deep learning networks(LSTM,GRU,Text CNN and Multi-scale DCNN).Experiments on real-world projects show that our proposed method achieves better performance than the current state-of-the-art methods.For bug localization,the diversity of textual descriptions and the incomplete vulnerability descriptions in bug reports as well as the lexical mismatches between natural language text in bug reports and technical terms in source code,make it difficult to automatically locate the relevant source files for a given bug report.To alleviate the above problems,we propose a novel method named Dev Bug Locator for bug localization by combining both the domain knowledge of developers and information retrieval techniques.Dev Bug Locator’s particular strength is that it simulates the bug fixing process,which first is to identify the appropriate developers for newly reported bugs,and then increases the suspiciousness degree of source files based on the historical bug-fix information of real assignees,thus improving the accuracy of IR-based localization models.This paper investigates a novel framework for software vulnerability identification and localization based on program analysis and artificial intelligence technologies from a data-driven perspective.The purpose is to break the overseas monopolization and technology blockage of foreign commercial vulnerability identification tools,promote the progress of automatic vulnerability identification methods,lay the groundwork for intelligent vulnerability identification technology from fundamental theoretical research to practical application,and promote the development of the software security industry. |