| After the rise of deep learning,it has gradually become popular,and multi-mode has gradually become a major trend.Therefore,various multi-mode audiovisual recognition algorithms have emerged endlessly.Multi-mode audiovisual recognition algorithms include lip reading of video stream and speech recognition of audio stream.Lip reading is a task to identify the speech content in a video based only on visual information,and infer the content of the speaker’s speech through the information of his or her mouth movements.At present,lip reading has great application value in information security,health care,public safety and so on.However,due to the diversity and difference of lip movement,lip reading caused by a large number of homophones and other phonemes is a great challenge.However,domestic lipreading started relatively late,so it is very rare for our country’s minority languages.The main research contents of this paper are as follows.(1)This thesis presents the first Tibetan lip reading dataset: TLRW100 that includes natural environment recordings.The video was selected from Kangba TV,the third largest Tibetan in China,for facial detection,landmark and lip region clipping processing to form the required continuous lip sequence image.Then,the image sample was expanded,and finally combined with TLRW50 to form the existing TLRW100 Tibetan lip reading dataset,and the data was evaluated.This dataset contains 100 common Tibetan words,totaling more than 60,000 video samples.The establishment of the Tibetan lip data set TLRW100 provides a solid foundation for natural Tibetan lip reading.(2)This thesis uses the Tibetan lip reading method based on TCN,uses Res Net18 combined with 3D CNN as the front end of the model to extract features,uses MS-TCN as the back end of the model to conduct classification recognition,and replaces the front end with Shuffle Net V2 lightweight network,and the back end with DS-TCN.The whole model frame is unchanged,which not only guarantees the recognition effect,but also speeds up the calculation speed and reduces the calculation cost.The Tibetan lip reading is realized in experimental environment and natural environment.Finally,the accuracy of TLRW100 and TLRW50 was achieved with 44.7% accuracy and 43.6%accuracy.(3)This thesis studies the multi-mode Tibetan speech recognition of lip reading under noisy environment,denoising the speech extracted from the video data set and carrying on the multi-mode recognition task.It can be seen from the experimental results that the accuracy of multi-mode Tibetan recognition based on lip reading can be effectively improved in the environment of high noise,which has important research significance. |