| Malware has become a global problem in the context of today’s ever-evolving and widely used computers.Malware attackers cause huge financial losses to computer users,businesses and governments through frequent intrusions into computer systems.Therefore,it is an urgent issue to identify malware efficiently.However,due to the complexity of malicious code and the fact that new malware variants often use various methods to hide the malicious behaviour of software,the ability of the model to extract software behavioural features is reduced,and it is difficult to manually review and filter software with malicious behaviour for data annotation,making it impossible to update the model to accommodate new unknown malware The model cannot be updated to accommodate new unknown malware variants.Therefore,this thesis addresses the existing problems of malware detection and conducts an in-depth study of malware identification based on the sequence of software API calls.The details are as follows:(1)The aim is to solve the problems in existing malware classification schemes based on graph structure or sequence structure,such as insufficient accuracy of software behaviour feature extraction and redundant API sequences degrading detection performance.Firstly,a new method of constructing call graphs is used to split API sequences by sliding windows and transform them in turn to obtain multiple call graph snapshots to help the model understand software behaviour information in a more fine-grained manner.Secondly,the GAT model is used to capture the structural information of the call graph at each moment in order to obtain local behavioural information.In addition,an attention mechanism is introduced in the node feature aggregation to strengthen malicious API features and weaken obfuscating redundant API features by computing an attention factor.The last module,using the GRU network,takes advantage of its adeptness in capturing sequential time dependencies to understand the evolution of software call graph topology and thus understand software behavioural information globally.The problem of difficulty in capturing behavioural sequential features.This study conducted extensive experiments on the Alibaba Cloud dataset,and the results of the experiments show that Dy GAT can have better behavioural feature extraction capability than mainstream malware detection models,achieving better results in five classification performance evaluation metrics: accuracy,precision,recall,F1 score and AUC.(2)The aim is to address the problems of difficult malware annotation and performance degradation caused by the large difference between the distribution of samples in the training set and that of real-world malware.In this thesis,we propose FAUDA,an unsupervised domain adaptation malware detection model based on conditional generation adversarial network,which combines the source domain data samples and enhances the source domain data from the dimension of the feature space,increasing the diversity of the source domain sample features and,to a certain extent,solving the problem of sparse training samples caused by the difficult annotation work.In addition,the adversarial domain adaptation module is used to continuously reduce the difference between the distribution of samples in the training environment and the real environment,allowing the feature extractor to capture features that are common across domains,thus eliminating the need to annotate new malware samples.In this study,the friendllcc dataset was used as the source domain,representing data in the experimental environment,while the Alibaba Cloud dataset was used as the target domain,representing real-world software samples.Extensive experiments are conducted on both datasets,and the results show that FAUDA achieves the best performance on the target domain compared to other common models. |