| As a kind of visual language,sign language serves as the primary communication tool among the deaf community.It conveys meaning mainly by manual features,including hand shapes,orientations and movements,and is assisted by the non-manual features,e.g.facial expressions and lip motions.To build barrier-free communication between the hearing and deaf people,intelligent sign language understanding(SLU)has emerged.It is a multidisciplinary research topic,whose aim is to convert the sign language video into the corresponding text,or perform the reverse conversion,thus creating an interactive closed loop.The former corresponds to sign language recognition(SLR),while the latter represents sign language production(SLP).Recently,despite the promising progress in deep-learning-based sign language understanding,it remains huge challenges in this topic.Since sign data annotation is time-consuming and needs expert involvement,current annotated sign data scale is limited.Under the data-driven learning paradigm,current methods usually encounter the issues like overfitting and limited interpretability.To address these issues,this thesis attempts to incorporate hand prior and explore the mixing potential of data-driven and model-driven methods,in pursuit of enhancing visual representations and improving performance in the downstream tasks.Specifically,this thesis performs investigation into the following four aspects.Firstly,this thesis proposes a hand model-aware framework for isolated SLR.Specifically,it contains three important components,i.e.,visual encoder,hand-modelaware decoder and inference module.It leverages the hand model for better optimization and utilizes the modeling hand as the intermediate representation,thereby guiding the framework to learn discriminative features.During training,additional loss functions are added on the intermediate representation to constrain its spatial and temporal consistency.Extensive experiments demonstrate that the proposed framework achieves the state-of-the-art performance when published.Secondly,this thesis leverages both hand prior and unlabeled sign language data to propose the first hand-prior-aware self-supervised pre-training framework in the context of sign language.This framework can be applied to more SLR subtasks.Its pretraining is conducted via masking-reconstruction.Oriented at the hand pose,this work carefully designs various masked modeling strategies,jointly introducing hand prior as regularization in the decoding stage.These techniques help the framework better learn hierarchical context in the sign language domain.For downstream tasks,it designs various task-specific prediction heads to fine-tune with pre-trained encoder.Extensive experiments show that the proposed framework not only increases the task applicability but also achieves state-of-the-art performance in three SLR tasks with a notable gain.Thirdly,this thesis proposes a gesture-to-gesture translation framework with hand topology incorporated.As a key technology in SLP,gesture-to-gesture translation requires fine-grained structure understanding of hand.In response to the insuficient representation capability of sparse 2D keypoints in existing works,this work proposes a hand-topology-aware framework for this task.This framework utilizes the modelaware hand mesh as the gesture state representation,and leverages inherent topology in the hand model to enhance the framework capability of hand structure understanding.Specifically,the framework unravels the surface of the hand model into the topology space.In this space,it provides fine-grained position embedding aligning the desired image plane,thus building the topology map.Then the framework employs a spatialadaptive approach and the attention mechanism to leverage information in the topology map for better generation.Experiments validate the effectiveness of the proposed method,achieving the best performance in this task.Fourthly,this thesis proposes a hand prior-based interacting hand generation framework.During sign language expression,the scenario of hand interaction often occurs.Compared to the single-hand counterpart,the complexity of interacting hand image generation largely increases.Its challenges mainly arise from occlusion from complex spatial positions.Oriented at this novel task,this work builds baselines from the related single-hand gesture-to-gesture translation task,and establishes the evaluation protocol from multiple perspectives,including image quality and hand structure preservation.To tackle the challenges posed by this task,we propose a model-based,occlusion-aware framework.Through incorporating hand prior,the proposed framework is capable of effectively handling complex occlusions between hands.Extensive experiments demonstrate that the proposed framework outperforms the baseline methods. |