| With the rapid development of artificial intelligence and natural language processing technology,smart speakers are becoming increasingly popular and permeate many aspects of human life.While smart speakers bring convenience to people,they also bring some security risks.To improve the security of smart speakers,voice commands transmitted over a network are encrypted;however,user privacy issues related to smart speakers continue to emerge.In fact,attackers are still able to infer the content of a user’s specific voice commands from encrypted traffic through machine learning methods to obtain private information for advertising or to carry out malicious attacks.This traffic analysis is referred to as a voice command fingerprinting.In recent years,research on improving the accuracy of voice command fingerprinting has become a hot topic and remains a challenging task.Therefore,research on encrypted traffic recognition technology for smart speaker system is of great significance.Based on the above analysis,to improve the accuracy of voice command fingerprinting,this paper designs and implements a new method.The main work of this paper is as follows:This paper uses an adaptive and dilated residual network to process spatial features.In encrypted traffic recognition,spatial features of encrypted data,such as size and direction,are usually more significant,and this paper extracts and recognizes spatial features with a new model.In addition,this paper finds that using temporal features helps improve fingerprinting attack accuracy,and therefore design an attention-based bidirectional gated recurrent unit.This paper is the first one in the study of encrypted traffic in smart speaker system to investigate the effect of temporal features on the recognition of encrypted traffic.Then,this paper effectively combines the two models.The comprehensive model in this paper enables the extraction of spatial and temporal features of encrypted data,transformation of the data format as required by the neural network input,and high accuracy recognition for spatial and temporal features.Our method achieves an accuracy greater than 93.36%in a closed-world scenario,which exceeds those of other state-of-the-art methods(2020 WiSec Wang et al.).This paper also demonstrates that in real-world scenarios,using only incoming traffic is still able to accurately recognize voice commands,taking into account the effect of differences in human voices on outgoing traffic.In a more realistic open-world setting,our model is still effective,obtaining a true-positive rate of 99.50%and a falsepositive rate of 0.1%compared to Wang et al.’s rates of 97.79%and 0.1%,respectively.Finally,this paper demonstrates that our model has good generalizability,as our model can also be applied to website fingerprinting and outperforms 2018 CCS Sirinam et al.. |