| With the rapid growth of the number of devices in the Internet of things,the security of the firmware of Internet of things devices can not be ignored.With the continuous iteration of software requirements,in order to complete the development task as soon as possible,software developers often find code that has similar functionality or purpose from other projects and transplant it to their own projects.However,these reused code or components may contain potential defects or even vulnerabilities.Due to these challenges,it’s especially important to identify defects or vulnerabilities in Io T firmware code.Unfortunately,because source code is often not open and repairs can be costly,identifying these problems can be difficult.In order to solve this problem,a mainstream idea is to regard the code containing defects or vulnerabilities as queries,search and match the code similarity,and scan the programs to be analyzed to finally identify defects or vulnerabilities.Because Io T devices use a variety of different instruction sets(e.g.ARM,x86),it can be challenging to analyze the firmware code for each device since they may use different encodings and optimization options.In practice,binary code similarity analysis based on program character features and program logic features has been fully developed,but these two features do not support cross-architecture binary code similarity detection sufficiently.This paper attempts to adopt the idea of code representation-based learning,through the architecture-independent instruction embedding method to measure the similarity of the semantic features of the program,and then fundamentally deal with the challenge of cross-architecture detection.The main work of this paper is reflected in the following three aspects:(1)A cross-architecture method for extracting key information of binary code is proposed.First of all,the instruction address of the program is obtained through the Python interface provided by IDA.Based on the instruction address and the jump relationship between the basic blocks,the corresponding assembly instruction code,function name,function control flow chart,function call relationship and other key information are effectively extracted,and the appropriate data structure is used to organize the extracted information.This method provides multi-dimensional information for later research.In addition,this paper encapsulates this method as an automatic information extraction tool for cross-architecture binary code,which has been applied in the team.(2)A binary code similarity detection method IXFSim based on intermediate code representation is proposed.First of all,use the code information tool to extract the start address,end address and function name of the function,and use the retdec tool to convert the binary code into LLVM IR and standardize it.As an intermediate representation independent of instruction set architecture,LLVM IR shields the differences caused by instruction set architecture and retains the semantics of the code.Then the standardized intermediate representation is embedded,and the cross-architecture sample pairs are constructed manually,and then sent into the deep metric learning model to measure the similarity between cross-architecture codes.In order to verify the effectiveness of the method,this paper constructs a dataset containing 65,968 sets of program information based on the open cross-architecture binary data set Trex,each corresponding to a function-level binary code.The experimental results show that IXFSim achieves 88%accuracy in cross-architecture code similarity detection.(3)A binary code similarity detection method CXFSim based on comparative self-supervised learning is proposed.First of all,the assembly instruction code of the function is extracted and standardized by using the code information tool.Then the compilation with different optimization options is regarded as a way of data enhancement,and the self-supervised learning is carried out according to the enhanced data samples to generate a specific instruction set architecture encoder against the interference of optimization options.Finally,the cross-architecture sample pair is constructed manually and fed into the deep metric learning model based on the specific architecture encoder,so as to fine-tune the parameters of the specific architecture encoder.The feature vector generated by the specific architecture encoder is mapped to the same feature vector embedded space,so as to measure the similarity of the feature vector.The experimental results show that CXFSim achieves 90% accuracy in cross-architecture code similarity detection. |