| As a typical visual language,sign language is usually expressed by the cooperation of two parts,the hand and the face.And sign language pays more attention to the continuous changes of gestures.Continuous sign language recognition recognize the input sign language video as an ordered sequence of sign language vocabulary annotations.The extraction of visual features is the core of the current main stream sign language recognition scheme,which is used to represent the process of sign language elements and gesture changes.Therefore,in the case of weakly annotated sign language videos,how to effectively model visual features is a problem worthy of research.Based on the graph structure,this paper analyzes and models the visual features from the perspective of space and temporal sequence.The specific content is as follows:1.It is difficult for continuous sign language recognition task to model facial features and hand features,and it is difficult to adaptively fuse these two parts of features.This paper proposed a method,Continuous Sign Language Recognition Based on Spatial Relationship Graph Structure and Graph Attention Network.Specifically,for the sign language demonstrator’s face,left hand and right hand,a spatial relationship graph is constructed.And a graph attention mechanism is used to dynamically assign and fuse feature weights between elements,so as to obtain more discriminative spatially visual features.The experiments shows the effectiveness of the method in this paper.Compared with the basic method,the word error rate(WER)on the German PHOENIX2014 and PHOENIX2014 T datasets is 25.1% and 22.0%,respectively.2.Aiming at the difficulty of modeling the temporal features corresponding to gesture changes in continuous sign language recognition.This paper proposed a method,Continuous Sign Language Recognition Based on Multi-Level Temporal Relationship Graph Structure and Graph Convolution.Specifically,video frames are acquired at short time intervals to construct a low-level temporal relationship graph.And,video frames are acquired at long intervals to construct a high-level temporal relationship graph.In particular,for relational graphs of different levels,graph convolution is used for fusion,and feature information of relational graph nodes containing temporal difference information is updated.At the same time,our model has a better ability to fuse short-term features and long-term features.The experiments shows the effectiveness of the method in this paper.Compared with the basic method,the word error rate(WER)on the German PHOENIX2014 and PHOENIX2014 T datasets is 25.6% and 22.1%,respectively. |