| In recent years,due to the development of deep learning and the enhancement of algorithm stability,artificial intelligence technology has been rapidly growing and widely used,especially in the fields of natural language processing and image recognition,which are highly favored.Lip recognition,as an application scenario of image processing,refers to the analysis of the movement of lip when speaking,so as to recognize the content of what people are talking about.It has great prospects for development both in the technical fields of visual control,human-computer interaction and in the fields of medical treatment and health care.Traditional lip recognition technology research mainly includes the following three stages.First is the detection of the mouth region and the localization of the lip region;followed by the extraction of key features from the localized lip region;and finally is the lip recognition.However,because traditional lip recognition technique requires complex image pre-processing processes and great difficulty in training the classifier,coupled with the need to consider experience,time and other requirements when performing human feature design,this makes the progress of lip recognition slow.In contrast to the problems that arise in traditional lip recognition,deep learning can learn important and abstract features layer by layer directly from the original data,thus truly enabling end-to-end recognition.However,to address the weakness of conventional convolutional neural networks in capturing weak lip motion changes and learning features from convolutional kernels,this paper employs an attention mechanism as a way to enhance the extraction of lip motion features.Furthermore,in terms of datasets,the number of open-source Chinese Mandarin lip datasets at the current stage is small,as well as most of the methods that have achieved good recognition results in lip recognition are based on English datasets.Therefore,in this paper,we build the Chinese dataset and propose a combined two-dimensional network and three-dimensional network to recognize lip video sequences of articulators using a modified C3 D network and Bi-GRU network incorporating Multi-Head Self-Attention mechanism.And we also construct a lip-synthesis recognition system.The overall study can be divided into follows parts.The dataset is created and pre-processed.We use a mainstream algorithm that has advantages in stability and speed to extract frames from the video.The face keypoints are further segmented using the face keypoint monitoring and localization algorithm in the Dilib library to obtain the lip region,which is the subject of this paper.Feature extraction of lip sequence: Compared with the ordinary 2D convolutional neural network in which each convolutional layer loses the temporal information of the input features,C3 D network can solve this problem well.To reduce the occurrence of overfitting,we improve the C3 D network by adding a Dropout module to it.Since the video frames are continuous,they also have the timing characteristics between frames.Therefore,based on the C3 D network,the recurrent neural network structure is added in this paper to obtain the timing features between sequences.To overcome the phenomenon of gradient disappearance and gradient explosion in the case of long sequences and the lack of important information capture in the traditional RNN,we improve the Bi-GRU and fuse the multi-headed attention mechanism in the Bi-GRU and use it to obtain the temporal features between lip-activated sequences.In this paper,CNN and RNN is ultilized to get lip motion data at both spatial and temporal levels,and evaluate and validate the speed and effectiveness of our model using a self-made Chinese lip-synthesis dataset.Experiments show that the model in this paper possesses stability and effectiveness when applied to lip recognition systems. |