| Sign language is a bridge between deaf-mutes and normal people.In order to solve the problem of communication between deaf-mutes and normal people,Sign Language Translation(SLT)has been studied by many scholars.The automatic SLT is an interdisciplinary subject involving computer vision,pattern recognition,natural language processing and other fields.In recent years,with the development of deep learning,more and more scholars have applied deep learning in the field of sign language translation.In order to maintain the high accuracy of translation,the information of visualization and motions captured by sensors are combined in resent researches.However,the excessive reliance on sensors makes against the further expansion of data sets.In this paper,two most important tasks in the process of sign language translation has been researched by using RGB sign language videos and deep learning methods: isolated word translation and continuous sentence translation.Specifically,isolated word translation is the basis of sentence translation,only to ensure the high accuracy of isolated word translation can effectively reduce the word error rate of sentence translation.While in isolated word translation,the model cannot fully focus on the key part of the video sequence because the video length of each isolated word is too long,which leads to irrelevant hand gestures affecting the accuracy of translation.In continuous sentence translation,the model should not only extract more fine-grained visual features,but also process longer video sequences than isolated word sequences.To solve the above problems,a deep learning model combined with attention mechanism and joint learning method is presented in this paper.Aiming at the problem that long video frame sequences have too many irrelevant sequences in isolated word translation,this paper proposes a Convolutional Recurrent Neural Network with Global Attention(Global-Attn-CRNN).In this model,the global attention mechanism is embedded in the Long Short-Term Memory(LSTM)network,and the alignment vector is obtained by calculating the similarity between the current hidden state and the source hidden state.Then the model can learn from the alignment weight and pay attention to the key frames in the long video sequence of sign language,to improve the accuracy of the translation.The experiment on DEVISIGN Dateset showed that the accuracy of this model is higher than other mainstream models.In the100 categories dataset of short sign language words and long sign language words,the accuracy of the model is respectively improved by 0.87% and 1.60%,compared with that of the model without the attentional mechanism,proving that the attentional mechanism can effectively improve the accuracy of model translation.While translating the continuous sign language sentence,fine-grained visual feature should be extracted and it is difficult to align the input and output sequences,a Joint CTC and Local Attention Mechanism Seq2 Seq Model(CTC-Attn-Seq2Seq)is proposed in this paper.At the encoder side,Bi-directional Long Short-Term Memory(BI-LSTM)network is used and shared with sequential connection classification and local attention mechanism to achieve joint learning.The feature extraction part uses a3 D Residual Networks(3D Res Net)architecture to extract spatio-temporal features.And a Convolutional Block Attention Module(CBAM)has been added to focus on the key areas in feature extraction.At the decoder side,there are too much unaligned information in sentence translation due to too long sequences,also it’s difficult to learn from the early stage of training on traditional attention module with few constraints.Therefore,the Connectionist Temporal Classification(CTC)decoder based on attention mechanism is designed.CTC can help attention module to learn by constraining the attentional mechanism with monotone alignment.The word error rate of this model is reduced to 9.7% on Continuous SLR dataset,which is lower than other modules.And the dataset is divided into two parts according to whether the sentences in the test set have appeared in the training set,and the effectiveness of joint learning is verified by experiments on the two datasets after the division. |