Font Size: a A A

Research On Multi-platform Binary Codes Classification And Similarity Detection Techniques

Posted on:2022-09-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:B G YuanFull Text:PDF
GTID:1488306734971809Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Binary codes classification and similarity detection are the basis of malware family analysis and software code traceability,which play important roles in Cyber Security.Binary codes include PE,ELF,DEX formats and involve X86,ARM,MIPS plus PPC instruction architectures on various platforms.While on the same platform,the same software code can generate different binary versions through various evolution modes such as obfuscation and packing.The attributes of multi-morphology,multi-architecture and multi-evolution lead to the problems of incompatibility,applicability and analysis in binary codes classification and similarity detection.Therefore,how to perform feature representation,classification and similarity detection for multi-platform binary codes has become a technical difficulty to be solved urgently.For the above difficult problems,this dissertation first proposes the Markov image representation based on binary bytecode,which does not require reverse analysis and dynamic analysis,and has strong compatibility with multiple platforms.Secondly,to apply for various instruction architectures,this dissertation proposes a binary codes classification method based on the lightweight convolutional neural network whose classification model size is only about1 MB.Then,to analyze the similarity of different evolution versions of binaries,this dissertation proposes a binary codes similarity detection method based on the weight sequences birthmark of the dynamic control flow graph with dynamic instruction-level instrumentation.Finally,this dissertation constructs a multi-platform binary codes classification and similarity detection prototype system based on the above research content.The main contributions are as follows:(1)A binary codes classification method is proposed based on Markov image representation and deep learning.To solve the problem that binary codes lack simple and efficient feature representation,and advanced feature analysis methods are incompatible with different platforms,the Markov hypothesis is introduced in bytecode sequence analysis.Markov image representation of binary code is built according to the bytes transfer probabilities.The classification model for Markov images is constructed based on the deep convolutional neural network.This method has the ability of multi-platform compatibility because it belongs to static low-level feature analysis based on bytecode.Moreover,the deep learning algorithm can automatically extract features,which alleviates the dependence of feature engineering on expert experience.Experimental results show that this method can represent and classify binary codes of PE and DEX formats with accuracies of 99.264% and 97.364%,respectively.And its performance is better than the method based on Gray image and similar algorithms.(2)A binary codes classification method is proposed based on a lightweight convolutional neural network.As IoT devices involve many instruction architectures and have weak computing power,complex classification algorithms are difficult to apply to multiple instruction architectures.To solve the problem,this dissertation proposes the multi-dimensional Markov image generation algorithm of binary codes and constructs a lightweight multi-dimensional Markov image classification model.Multi-dimensional Markov image representation covers more abundant bytes distribution features than Markov image representation and effectively improves classification performance.Furthermore,the lightweight model based on depthwise separable convolution and channel shuffle can greatly reduce the parameters of the classification model while keeping high accuracy.Experimental results show that this method can classify various malware such as ARM,MIPS,PPC and Android with accuracies of more than 95% when the classification model is only 1MB.On the benchmark dataset of Microsoft malware classification,the accuracy of this method is 99.356%.Compared with similar methods,it achieves higher accuracy with lower overhead.(3)A binary codes similarity detection method is proposed based on the weight sequences birthmark of the dynamic control flow graph.It is difficult to analyze the similarity of different binary versions due to the multiple evolution modes for binary codes.To solve the problem,this dissertation proposes the methods of dynamic control flow graph construction and weight sequences birthmark extraction,and determines whether the binary versions are similar according to the birthmark similarity.Combined with dynamic instruction-level instrumentation analysis,this method has a better anti-obfuscation ability.Furthermore,it only needs to record the jump instruction,and it can analyze more complicated binary evolution versions with less computation cost.Experimental results show that this method has high reliability and has strong resistance to different compilation conditions,different obfuscation tools and different packing techniques,and its F-measure is 96.814%.Compared with similar method,it is still effective for encrypted software code.
Keywords/Search Tags:Cyber Security, Binary Analysis, Malware Classification, Similarity Detection, Dynamic Software Birthmark
PDF Full Text Request
Related items