| Voice lip movement consistency analysis is to judge whether the audio and video are recorded at the same time and from the same person by judging the correlation between the lip movement and voice changes in the process of speaker’s pronunciation.The existing lip consistency analysis mainly focuses on the lip data collected from the front.In practical application,the changeable perspective of the real scene is an inevitable environmental factor,while the impact of multi perspective lip data on lip consistency analysis is still lack of specific research;At the same time,the previous methods based on multivariate statistics often assume that the audio and video data are linear correlation,but the audio and video data are not simply linear correlation,which leads to the neglect of some nonlinear correlation data features in the consistency analysis,which makes it difficult to improve the efficiency of consistency analysis.In view of the above problems,this paper takes multi perspective data as the premise to study the consistency analysis of voice and lip1.Aiming at the problem of multi view,this paper proposes an improved lip image reconstruction algorithm based on the Generative Adversarial Networks.The algorithm increases the loss of self mapping test in the generator network,checks the input and output of the generator,and maintains the identity characteristics of the same domain lip graph in the reconstruction process.The generator uses u-net network structure,the discriminator uses Markov discriminator,and the whole network uses up and down sampling to speed up the convergence of the model.The experimental results show that the reconstructed lip shape and the real lip shape maintain a high correlation in different dimensions.The Peak Signal-to-Noise Ratio(PSNR)of the similarity evaluation index between the reconstructed lip shape and the real lip shape from each perspective is about 3.5%higher than view2view model[50],and the Structural SIMilarity(SSIM)is more than 7.1 on average,which effectively realizes the positive reconstruction of multi view lip shape.2.After obtaining the reconstructed frontal lip data,aiming at the problems of multivariate statistical consistency analysis,this paper proposes a lip consistency analysis method based on 3D Coupling Convolutional Neural Networks(3DCCNN)by combining the advantages of 3D Convolutional Neural Network in extracting nonlinear correlation features and spatiotemporal features of audio and video data.Firstly,the Mel-Frequency Cepstral Coefficients of the abandoned DFT are used to represent the speech modal data,and the gray-scale lip shaped continuous frames are used to represent the video modal data.Then,the two kinds of modal data are mapped to the same representation space through different networks for coupling,and the coupling process is optimized by using contrast loss.At the same time,the network automatically selects suitable data pairs for training.Finally,the multi-modal features are used to evaluate the consistency of audio and video data.The experimental results show that,compared with the multivariate statistical method,the equal error rate(EER)in different perspectives is reduced by about 5%,and the near front view is reduced by about 10%,which shows that the proposed method has better performance. |