Research On Lip Reading Recognition Method Based On 3D Convolution And Visual Transforme

Posted on:2024-03-08

Degree:Master

Type:Thesis

Country:China

Candidate:G Q Pu

Full Text:PDF

GTID:2568307085452414

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Lip reading has received more and more attention in recent years.It judges the content of speech according to the movement of the speaker’s lips.Lip l reading has broad application prospects and values in human-computer interaction,monitoring and security,audio-visual speech recognition and other fields.In recent years,lip l reading algorithm based on deep learning has made some achievements.By building a deep complex model,lip movement transformation can be more effectively used to capture spatio-temporal feature information.However,there are still challenges in visual feature extraction,temporal feature extraction and model lightweight.How to construct a lip reading model with high accuracy,strong robustness and lightweight is the key point of this paper.Specifically,lip reading needs to process the information of continuous video frames,and consider the relevant information between adjacent images and remote images.In addition,lip reading mainly focuses on the subtle changes of lips and their surrounding environment,so it is necessary to extract the subtle features of small size images.Therefore,the performance of lip reading is generally low,and the research progress is slow.In order to improve the performance of lip reading,a lip reading method based on 3D convolution and visual transformer(3DCVT)is proposed.This method combines the visual transformer and 3D convolution to extract the spatio-temporal feature information of continuous images,and effectively extracts the local and global features of continuous images by making full use of the characteristics of convolution and transformer.Then,the extracted features are sent to Bi GRU for sequence modeling.The accuracy and robustness of the model are further improved by using data enhancement,label smoothing and word boundary information.Finally,the validity of the proposed model is verified on large-scale public lip reading datasets LRW and LRW-1000.The experimental results show that the proposed method can achieve 88.5% and 57.5% recognition accuracy on the two datasets,and effectively improve the recognition accuracy.In addition,for the improvement of model lightweight,this paper proposes a lightweight lip reading model Mini-3DCv T based on 3DCv T,which has good performance in model acceleration and model compression.The model uses two methods in convolution and transformer structure,namely,weight conversion and weight distillation.Specifically,we share weights across layers,and transform weights to increase diversity.Weighted distillation transfers learned features from complex models to simple models.Under the condition of ensuring the accuracy of the model,the calculation amount and parameters of the model are reduced,and the overall performance of the lip reading model is effectively improved.

Keywords/Search Tags:

Lip reading, 3D convolution, vision transformer, spatio-temporal feature, model lightweight

PDF Full Text Request

Related items

1	Convolutional Sequence-to-sequence Based Neural Networks For Lip Reading
2	Video Action Recognition Based On 2D Convolution Network Under Spatio-Temporal Feature Enhancement Mechanism
3	Research On Object Detection Method Based On Key Points And Graph Spatio-temporal Attention Mechanism
4	Human Action Recognition Based On Spatio-temporal Feature
5	Research On Surveillance Video Synopsis Based On Spatio-Temporal Slice
6	The Research And Application Of Spatio-Temporal Database In Road Equipment Management
7	Research On Spatio-Temporal Indexing Mechanism And Querying Strategy
8	Research On Deep Spatio-temporal Model Structure
9	Exploiting Spatio-Temporal Fusion And Perception For Video Object Segmentation
10	The Research And Implementation Of Spatio-Temporal Data Operations And Query Optimization In Spatio-Temporal Database