| Lip reading is an ability that can recognize the speech content of a speaker based only on the dynamic information extracted from the speaker’s lip movements.It is of vital importance in the applications in the intersection areas of computer vision and natural language processing.For example,in noisy environments or long-distance communication lip reading can use visual information to predict what the speaker is trying to express.And lip reading can also work with audio recognition to enhance the accuracy of recognition or be used for audio and video alignment,using sequence matching of visual features and auditory features to modify audio and video.In addition,lip reading can be used as live detectors that resist replay attacks,as an effective complement to other biometric recognition.The difficulty of lip recognition is to use the dynamic information of lip motion.Most traditional lip reading methods directly use sequence models designed for natural language processing tasks(such as LSTM,Transformer,etc.)or loss function designed for audio recognition tasks(such as connectionist temporal classification loss)to map visual feature sequences to predicted character sequences.And these methods do not take full advantage of the dynamic characteristics of lip motion.Due to the complexity and redundancy of the models,these models usually require excessive training time so that cannot meet the needs of practical applications.The main research content of this paper is to use temporal convolution as the basic component of a sequence-to-sequence mapping model,and to construct a robust sequence mapping model by combining different convolution kernel sizes.In addition,this paper also proposes a spatio-temporal information fusion module to reduce the feature dimension while ensuring the full application of spatial information.Aiming at the problem that the existing model requires too long training time,this paper proposes a local self-attention mechanism to speed up the training process of the model by shielding too long time-dependent dependencies.In order to verify the validity and efficiency of the model,this paper conducts a series of module comparison experiments on large international datasets like GRID,LRW,LRS2-BBC,and LRS3-TED,and compares the results with a number of current state-of-the-art methods.The experimental results show that on the word-level datasets GRID and LRW the proposed method is better than the previous state-of-the-art methods,and the word accuracy is 98.3% and 83.7%,respectively.Compared with the state-of-the art method,the proposed method achieves comparable results using only about half of the training data on the sentence-level datasets LRS2-BBC and LRS3-TED.In addition,the proposed method requires far less training time under the same hardware conditions than other methods. |