| In today’s rise of artificial intelligence technology,deepfake has gradually entered the public’s field of vision with its high-fidelity generation effect,and deepfake-related software emerges in an endless stream,which has brought severe challenges to digital media security.Therefore,it is particularly important to carry out research on deepfake face video detection technology.At present.deepfake face video detection technology is in a period of rapid development,but there are still difficulties in generalization performance in the face of emerging deepfake methods and low detection accuracy in low-quality v ideos.In this thesis.Based on the relevant theories of deep learning,two schemes are designed,as follows:(1)Proposed a deepfake face video detection method based on Visual Transformer.In CNN can only calculate the correlation of adjacent pixels and cannot make full use of the relevant spatial information of the image,considering that CNN has a better inductive bias for image data.Transformer structure can learn from long input sequences Learning meaningful associations in learning has a good global modeling ability.Therefore.a combined model of CNN and Visual Transformer is proposed to combine the advantages of the two models,which can effectively improve the final detection performance.The video frame images obtained after preprocessing are first input into the feature extraction module.and feature extraction and fusion are carried out through two kinds of CNNs,and then the obtained fusion features are input into the Vision Transformer module to make the final two classifications of true and false videos.A relative positional encoding different from the traditional absolute positional encoding in the Transformer architecure is adopted.The MAD(Multi-Attention Dropping)unit is added to the Transformer Encoder,which can further improve the model’s auention to the key areas of the real and fake faces.and effectivcly improve the accuracy of the model.At the same time,the influence of data preprocessing on the experimental results is emphasized,and a face_recognition library with higher face recognition accuracy is selected for this method.This is beneficial for ensuring the accuracy of the final test results the detection accuracy of the deepfake face video detection method based on the visual Transformer proposed in this thesis reaches more than 96%on average,which is higher than several existing representative deepfake face detection methods.When faced with the DFDC dataset,the detection AUC value reached 97.29%,This fully indicates that the model has stronger generalization ability performance to a certain extent.(2)Proposed a deepfake face video detection method based on space-frequency information fusion.Since the current feature extraction network is generally a deep structure,there is an inherent defect that the priority of high-frequency component learning is low,and the utilization rate of high-frequency information is low in the final classification decision.In view of the special synthesis method of Deepfake technology,discontinuous features will be generated between the synthesized face and surrounding pixels.This edge discontinuous feature often exists in the high-frequency information in the frequency domain information.And the essence of deepfake face video detection does not depend on the semantic information of the image,so the method of using frequency domain information for detection and classification is effective and feasible.Based on this,in the proposed dual-branch structure,one branch uses discrete cosine transform to transform the video frame images extracted from the deepfake data set into the frequency domain space,and performs feature extraction through the feature extraction module;the other branch uses RGB image input to feature extraction The feature extraction is carried out in the module,and then the DCT branch is combined with the RGB branch through the "cross-stitch" unit,and the final classification is made by making full use of different information spaces.These two types of features are helpful to distinguish between real video frame images and deepfake face video frame images,and the deep fusion of the two can obtain more distinguishing features,which ultimately improves the accuracy of the model.The experimental results show that the deepfake face video detection method based on space-frequency information fusion proposed in this thesis has improved the accuracy and AUC values on high-quality videos,and the accuracy on low-quality videos has reached 89.63%and the AUC value has reached 90.16%,better than many current advanced methods,has a certain robustness. |