Font Size: a A A

Video-based Sign Language Recognition And Translation With Information Augmentation Learning

Posted on:2023-06-10Degree:DoctorType:Dissertation
Country:ChinaCandidate:H ZhouFull Text:PDF
GTID:1528306905481394Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
As a visual language,sign language(SL)conveys meanings through hand shape variation,hand movement,facial expression and body posture,which has its independent lexical system and grammatical rules.The goal of video-based SL recognition and translation is to build an automatic translation system.It translates continuous signs in the video into spoken language which is understandable to the hearing,so as to facilitate the communication between the deaf and the hearing.Previous research mostly focuses on SL recognition,aiming to recognize sign glosses from successive sign gestures.However,the grammatical rules of SL are quite different from those of spoken language.SL translation further generates natural spoken language expression,which is more suitable for actual needs.Supervised learning usually assumes complete annotations,discriminative features,and sufficient training samples.However,video-based SL recognition and translation tasks are difficult to meet the above assumptions and face many practical problems:First,the granularity of SL annotation is insufficient.No frame-level fine annotations is provided,which is not good for model optimization.Second,SL expression involves the collaboration of multiple body parts.A single visual cue is not enough to achieve feature description.Third,the difficulty of acquiring bilingual "sign-text" corpus limits the performance of SL translation models.For the above problems,this thesis proposes to design corresponding information augmentation methods and learning strategies for information of annotation,vision and corpus,to improve the performance of video-based SL recognition and translation methods.The details are as follows:(1)This thesis proposes a fine-annotation augmentation learning method based on dynamic pseudo-label decoding.In continuous SL recognition task,only annotations of ordered sign gloss sequence are provided,without detailed action boundaries.If the network only uses sequence-level labels for end-to-end optimization,the performance will be limited.This thesis adopts alternately performing fine-pseudo-label estimation and clip-level finetuning for iterative optimization,which effectively improves the performance.Among the process,the reliability of fine pseudo labels is crucial to the improvement of model performance after iteration.In particular,this thesis designs a dynamic pseudo-label decoding method,which introduces the idea of dynamic programming to efficiently estimate reliable frame-level or clip-level pseudo-label sequences.It can effectively filter pseudo labels with wrong semantics and ensure that the decoded pseudo-label sequence is consistent wi th the word order of the sign video.Experiments show that this method effectively improves the convergence efficiency and recognition performance of the network under the iterative optimization strategy.(2)This thesis proposes a visual-cue augmentation learning method based on a spatial temporal multi-cue network.The expression of SL relies on the cooperation of various parts such as the hands,face and body,which is accompanied by the variations and switching of different visual cues during the transitions.From the perspective of multi-cue augmentation,this thesis designs a spatial multi-cue(SMC)module,a temporal multi-cue module(TMC),a segmented attention(SA)mechanism and a multi-cue training strategy.The SMC module introduces the pose estimation branch to explicitly decompose the features of various visual cues.The TMC module designs intra-cue and inter-cue paths to explore uniqueness and complementarity in multi-cue information.The SA mechanism enables the network to dynamically assign attention weights to various cues during the translation stage.Experiments show that the method can effectively use multi-cue information to improve performance,and achieve significant performance superiority in both SL recognition and translation tasks.(3)This thesis proposes a bilingual-data augmentation method,using large-scale monolingual data.High-quality "sign video-Chinese text" bilingual data requires manual recording and annotation.The scale is limited.In contrast,a large amount of "Chinese" monolingual data can be collected on the Internet.For this situation,this thesis proposes an augmentation method which generates SL bilingual data from monolingual data.To realize cross-modal conversion from text to SL,this thesis designs a two-stage SL back-translation method composed of "text-to-gloss" and "gloss-to-sign feature".Chinese text is first converted into its gloss sequence through a "text-to-gloss" translator.Next,the sign feature sequence is obtained by concatenation with a preprocessed"gloss-to-sign" feature bank.Finally,the synthetic data and the real data will be jointly used to train SL translation models.To verify the effectiveness,this thesis produces an available Chinese SL video dataset with gloss sequence and spoken language annotations,and collects its corresponding monolingual corpus.Experiments show that large-scale monolingual data can effectively improve the translation quality of SL translation models.
Keywords/Search Tags:Sign Language Recognition, Sign Language Translation, Information Augmentation Learning, Pseudo Label, Multi-Cue Information, Monolingual data
PDF Full Text Request
Related items