| With the development of society economy and the progress of information technology,software is ubiquitous in people’s life,bringing great convenience to the daily life and improvement to the efficiency of many domains.However,along with the convenience and improvement,software vulnerabilities has posed great threats to the regular social order and public security.Binary code similarity analysis is also known as binary diffing.Its goal is to analyze whether the given two binary code snippets are similar,which can be used to judge whether they have the same vulnerability.Researching on binary code similarity analysis is conducive to perfecting software analysis theories,improving vulnerability detection methods,and protecting the information security of individual,collectivity and country,having great theoretical value and practical significance.After years of development,researchers have achieved fruitful results in binary code similarity analysis.However,the existing binary code similarity analysis methods still have some problems in the acquisitions and representations of assembly code,and the accuracy,efficiency,and flexibility of code comparisons.This paper conducts an study on several key issues of binary code similarity analysis,and the main contributions and innovations are as follows:1.In order to solve the problem of base address detection in code acquisition,a base address detection method based on absolute address statistics and string reference matching is proposed.Firstly,based on characteristics of instruction formats and absolute address loading,absolute address searching algorithms for different architectures are proposed,which can search and record absolute addresses in the program.After that,the range of candidate base addresses can be determined according to the distribution of recorded addresses.Then,based on the characteristics of string referencing,a string reference matching algorithm is proposed to calculate the matching rate under each candidate base address,and the candidate base address with the highest matching rate is the correct base address.2.In order to solve the problem of basic block level binary code similarity comparison,a basic block embedding method based on Bi LSTM is proposed.Depending on the commonalities between natural language processing and binary code analysis in semantics extraction and content summary,methods from natural language processing are adopted to basic blocks processing.Firstly,each assembly instruction is regarded as a word,and Word2 Vec model is used to encode instructions into embeddings carrying semantic features.Then,each basic block is regarded as a sentence,and is represented as an instruction embedding sequence.The Bi LSTM model takes a basic block as the input,accumulating the semantic feature of instruction embeddings sequentially and generating a basic block embedding to represent the semantics of the basic block.Similarity between basic blocks can be efficiently measured by the distance between their corresponding embeddings.3.In order to solve the problem of function level binary code similarity comparison,a binary function similarity analysis method based on graph embedding is proposed.Firstly,each binary function is represented as an attributed control flow graph(ACFG),whose vertex attributes are the embeddings of corresponding basic blocks.Then,a Structure2 vec network is used to mapping the ACFGs to a high-dimension function embedding carrying both the information of basic blocks and the control flow information between them.By calculating the distance between the embeddings of binary functions,binary function similarity comparison can be efficiently performed.4.In order to solve the problem of similar binary code snippets recognition,a binary code snippets recognition method based on the influence of vertex is proposed.Firstly,the code snippets recognition problem is formalized as a subgraph matching problem,and binary code is represented as its ACFG.After that,based on the characteristics of binary code analysis,a synthetic vertex influence metric combining both the functional and structural influence is proposed.On the basis of this metric,the influence of each vertex in query graph can be calculated and the vertex with the highest influence value is selected as the central node.Then,three filtering rules are proposed to search for the matching node of the central node,and extend sub-areas in target graph according to the minimal spanning tree of query graph.At last,similar subgraphs are verified in each sub-area and similarity value is calculated.5.Application for vulnerability detection using binary code similarity analysis methods proposed in this paper is performed.The process of building a CVE vulnerability database and performing vulnerability detection using this database is introduced,illustrating how to combine the proposed methods organically to solve key issues in different phases of vulnerability detection.Besides,details of applying the proposed binary code similarity analysis methods to detect vulnerabilities are demonstrated by some case studies. |