| With the rapid development of information technology,artificial intelligence and modernize based on computer network have become an important role of our daily life.But,there are various kinds of attacks in cyberspace,the most common is malicious code attack,which has become one of the most important factors threatening the current cyberspace security.Among the current malicious code attacks,APT malicious code attack is the most influential and destructive.APT malicious code attack is cancer that affects the global cyberspace security situation.At present,the detection of APT attacks mainly depends on artificial experts,and the analysis efficiency is low.The traditional detection method based on file signature no longer meets requirements,so an efficient automatic analysis method is urgently needed.In recent years,with the development of natural language processing,the representation learning model has been gradually applied to malicious code homology.In this paper,we use the representation learning method to realize the analysis of malicious code homology.Aiming at the problem that the current malicious code representation learning ignores the instruction-level context information and does not analyze the feature of assembly instructions in depth,which ignores the rich semantics existing in the instructions and the inability to combat the code confusion methods of different optimization levels and compilers,a representation learning method for malicious code assembly language is proposed.This method includes the semantic representation learning of assembly instructions based on FunInstr2Vec and a function block vectorization method combining the self-attention mechanism and the recurrent neural network.This method improves Doc2Vec to realize the explicit learning method aiming at the granularity of assembly instructions and the smallest atomicity granularity within operands and realizes the overall semantic representation learning of functions through the selfattention and recurrent neural network,which map the rich vector semantics inside the instruction.Through comparative experiments,the F1-value of the proposed method against code obfuscation ability reaches 94%,and the precision of homology analysis reaches 93.8%,which is higher than other representation learning methods.Aiming at the problem that the malicious code analysis method in the natural language processing field cannot achieve effective feature extraction,resulting in low accuracy and analysis error,we propose a malicious code homology analysis method based on family gene similarity.The unsupervised learning clustering algorithm DEC is used to cluster the function blocks generated by the learning method based on the assembly language representation of malicious code,which effectively extracts the gene clusters of the malicious code family.Homology analysis is achieved through LightGBM,which realizes the homology analysis of malicious code.Through experimental analysis,the accuracy of homology analysis in the Energetic Bear APT is up to 93.33%,which exceeds other homology analysis methods,which means we can effectively deal with unknown threats. |