| Lip recognition is a technology that analyzes the lip movement information of the speaker,so as to predict the content of his speech.As a kind of human-computer interaction technology in artificial intelligence,lip recognition technology has practical application value,and has been widely concerned in information security,assisted driving and assisted medical treatment.In recent years,great progress has been made in the research of English lip recognition algorithms based on deep learning.However,the accuracy of Chinese lip recognition is still low.The problems of inadequate feature extraction and visual ambiguity need to be solved.This paper fully considers the diversity and complexity of lip recognition in Chinese sentences,uses deep learning technology,proposes a research method of lip recognition in Chinese based on Transformer,and designs a 3DT-CHLipNet model.This paper aims to solve the problems of inadequate feature extraction,visual ambiguity and improve the accuracy of sentence level lip recognition in Chinese.The main research contents and innovations are as follows:First,introduced time masking and variable length enhancement strategies.In the input video data preprocessing stage,the introduction of time masking enhancement strategy and variable length enhancement strategy can not only reduce the image redundancy information,facilitate the subsequent feature extraction,but also enhance the robustness of the model.Second,improved time feature extraction and sequence modeling network.Aiming at the problem of insufficient feature extraction in lip recognition,a fusion model combining attention mechanism and temporal convolutional network TCN was used for temporal feature extraction for the first time.Compared with other RNN dealing with timing problems,TCN not only has larger receptive field,can capture long-term dependent information,and can carry out parallel operations to shorten model training time.At the same time,the attention mechanism can assign more weight to the keyframe,learn more visual features,and enhance the feature extraction effect of the motion sequence of continuous lip pictures.Thirdly,improved the contextual semantic feature extraction network.To solve the problem of "homogeneity" visual ambiguity in Chinese lip recognition,Transformer model is transferred to the study of Chinese lip recognition and additional language model is added to assist prediction.Transformer architecture can be used to learn different subspace representations of lip movement information,improve the ability to extract global context features,and further improve the recognition accuracy of Chinese sentences.In addition,additional language models can assist Transformer model to predict sentence sequences more in line with Chinese semantic norms.Experimental results on the public CMLR dataset show that compared with the current representative models of sentence level lip recognition in Chinese LipCH-Net and CSSMCM,the proposed 3DT-CHLipNet algorithm effectively improves the recognition accuracy and model robustness. |