| Machine learning has been widely used for malware detection,significantly improving the detection rate of malware.However,the discovery of malware adversarial examples exposes the vulnerability of machine learning-based detection models,especially neural network models.A malware adversarial example refers to the sample generated by marginally adjusting malware without affecting its function,which can escape the identification of malware detection model.The study of adversarial malware examples is helpful to optimize the malware detection strategy based on machine learning.The existing generation methods usually generate adversarial malware examples by modifying features and training with the gradient information.Windows application programming interface(API)calls are popular features due to their description of the malware behavior and activities during the runtime.But such methods usually choose the specific high-frequency API calls and encode them as a 0/1 vector,while ignoring the contextual semantics of the API calls sequence that identify the maliciousness of malware behavior.In addition,such adversarial examples generation methods require information such as weight parameters of the model,which violates the black-box scenarios constraints in reality.In response to the above problems,this paper conducts research,and the innovation is reflected in the following two aspects:(1)A malware adversarial examples generation method using Generative Adversarial Network(GAN)to perceive the semantics of API call sequences is proposed.This scheme innovatively integrates the Long Short-Term Memory(LSTM)network into the Encoder-Decoder structure as the generator of the GAN to generate the API call sequence of malware adversarial examples.Additionally,multiple malware classification models are trained as the black-box,simulating detectors based on traditional machine learning and deep learning in real scenarios.A substitute detector is trained for the black-box and pass the gradient information to the generator.An anti-attack experiment is carried out.The results show that,the scheme can reduce the True Positive Rate(TPR)of malware to0.08%~0.83% for isomorphic neural network models and 0.61%~0.91% for heterogeneous models.After evaluating the transferability of adversarial malware examples,it is proved that the adversarial examples generated in this scheme are highly effective against multiple neural network models.(2)An optimization strategy for malware detection that combines API calls’ attribute features and semantic features is proposed.In the scheme,the feature hash method is used to extract the attribute features for each API call,including category,return value and different types of API call parameters.According to the correlation of API call sequences,the timing characteristics of API calls are integrated into the attribute characteristics.And the Multi-Head Attention mechanism of the Transformer Bidirectional Encoder Representation of Transformer(BERT)is used to enhance the semantic correlation dependency between API calls.Then the semantic feature of the API call sequence is extract.Combining attribute features with semantic features to optimize and fine-tune the pre-trained BERT model for Windows malware detection.Experimental results show that the proposed detection strategy improves the detection accuracy of the classifier and outperforms other models significantly on a large dataset. |