| Lip reading has received more and more attention in recent years.It judges the content of speech according to the movement of the speaker’s lips.Lip l reading has broad application prospects and values in human-computer interaction,monitoring and security,audio-visual speech recognition and other fields.In recent years,lip l reading algorithm based on deep learning has made some achievements.By building a deep complex model,lip movement transformation can be more effectively used to capture spatio-temporal feature information.However,there are still challenges in visual feature extraction,temporal feature extraction and model lightweight.How to construct a lip reading model with high accuracy,strong robustness and lightweight is the key point of this paper.Specifically,lip reading needs to process the information of continuous video frames,and consider the relevant information between adjacent images and remote images.In addition,lip reading mainly focuses on the subtle changes of lips and their surrounding environment,so it is necessary to extract the subtle features of small size images.Therefore,the performance of lip reading is generally low,and the research progress is slow.In order to improve the performance of lip reading,a lip reading method based on 3D convolution and visual transformer(3DCVT)is proposed.This method combines the visual transformer and 3D convolution to extract the spatio-temporal feature information of continuous images,and effectively extracts the local and global features of continuous images by making full use of the characteristics of convolution and transformer.Then,the extracted features are sent to Bi GRU for sequence modeling.The accuracy and robustness of the model are further improved by using data enhancement,label smoothing and word boundary information.Finally,the validity of the proposed model is verified on large-scale public lip reading datasets LRW and LRW-1000.The experimental results show that the proposed method can achieve 88.5% and 57.5% recognition accuracy on the two datasets,and effectively improve the recognition accuracy.In addition,for the improvement of model lightweight,this paper proposes a lightweight lip reading model Mini-3DCv T based on 3DCv T,which has good performance in model acceleration and model compression.The model uses two methods in convolution and transformer structure,namely,weight conversion and weight distillation.Specifically,we share weights across layers,and transform weights to increase diversity.Weighted distillation transfers learned features from complex models to simple models.Under the condition of ensuring the accuracy of the model,the calculation amount and parameters of the model are reduced,and the overall performance of the lip reading model is effectively improved. |