| Lipreading is the intersection of computer vision and natural language processing,which has important theoretical significance and has a wide range of applications,such as human-computer interaction based on lipreading,public safety and speech recognition.In recent years,lipreading algorithms based on deep learning have been able to make full use of lip movement information by building complex models,but there are still challenges in visual features,extraction of time series features and visual confusion.The focus of this thesis is to construct accurate and robust lipreading algorithms.The main innovations of the thesis are as follows.(1)Aiming at the influence of many factors such as face angle,background and lighting in lipreading images,a 3D-SENet module combining Res Net V2 structure and channel attention mechanism is proposed for visual feature extraction.In order to be able to extract the sequential characteristics of the video context,the encoder was built using a combination of Bi-LSTM and position-sensitive attention,and the speech synthesis model L2W(Lip to Wav)was designed and implemented.Experimental results on the English public dataset GRID show that the accuracy of the L2 W model can reach 78.85%,which is 4.55% higher than that of the relevant algorithm GAN-based,and there are also 1.82% and 1.21% improvements on STOI and MCD indicators.Through inter-module ablation experiments,it is proved that the L2 W model with 3D-SENet can improve by 1.21% on the MCD index to 17.89%.(2)Different Chinese words may have the problem of visual confusion of the same pronunciation.This thesis takes visual speech synthesis as the intermediate bridge and proposes an LWT model(Lip to Wav to Text).To enhance the feature extraction ability of lipreading,the 3D-SENet module is designed in the short-term sequence feature extraction stage.Bi-LSTM is combined with Attention to enhance the model’s ability of context feature extraction.Finally,visual speech is used as an intermediate bridge to trace back the loss of text.The experimental results on the CMLR Chinese lipreading dataset show that the indicators of CAR and BLEU reach74.04% and 73.10%,respectively,and are 9.88% and 11.2% higher than that of the related algorithm CHSLR-VP,which proves that visual speech synthesis,as an intermediate bridge,can help solve the problem of visual confusion in Chinese lipreading.Furthermore,the applications of lipreading identity authentication and Chinese subtitle generation was studied based on the L2 W and LWT models.The long-time series character pronunciation characteristics are obtained by L2 W model,and the identity authentication application based on lip features is realized.Experiments prove that the identity authentication scheme based on L2 W model is accurate and effective.The LWT model obtains the lipreading content of short time series,realizes the Chinese subtitle generation application of lip shape and subtitle matching.Experiments proves that the matching subtitle scheme based on the LWT model is effective and feasible. |