| Sign language is the most important way for hearing impaired people to communicate with the outside world,and it undertakes the function of transmitting information and communicating emotions for the deaf community.At present,computer deep learning has been widely used in various fields,including the field of sign language,such as sign language recognition technology,face expression recognition technology,etc.have been well developed.However,until now,the technical research on sign language recognition and expression recognition has been basically carried out separately,and there is a lack of relevant data for joint research between the two.When sign language users communicate with normal people on a daily basis,as well as when doing anthropomorphic human-computer interaction,it is easy to have semantic misunderstanding due to the lack of emotional expression.In order to solve the problem of insufficient sign language combined expression data,this thesis collects a large number of sign language combined expression images,and annotates them into RL-SS dataset and RL-SS2 dataset.Secondly,aiming at the problem of correct semantic recognition obstacles of sign language users,this thesis proposes two sign language recognition network models combined with expression analysis.Firstly,a YOLOX-BS network model based on YOLOv5 improvement is constructed.Three improvements have been made to this model,one is that the model uses the C2 f module with richer gradient flow information,replacing the original C3 module in Backbone;Second,use a CPM module that works before the Detect module to achieve the purpose of fully aggregating the context characteristics between classes;Third,the semantic aggregation layer based on void convolution is used to aggregate the characteristic information in channels and spaces.Secondly,an improved YOLOX-JC network model based on YOLOv5 is constructed to detect faces and gestures.Aiming at the problems of certain information loss,affecting recognition accuracy and calculation speed in the operation of downsampling of the maximum pooling layer,the detection model proposes a method of feature reconstruction based on the idea of replacing spatial dimension with depth dimension.Finally,a lightweight Mobile Vi T-SB network model based on Mobile Vi T is constructed,which is used to classify and recognize the detection results to achieve the purpose of sign language recognition combined with expression analysis,and the network model is improved in two aspects.First,based on the improvement of feature fusion,the method of early fusion in ADD is used to comprehensively use a variety of image features to achieve complementary advantages of multiple features.Second,in view of the problem that a lot of detailed information will be lost in the downsampling and the actual receptive field is insufficient,an improvement has been made based on pyramid convolution,and Conv2 d in the MV2 Block module is replaced by Py Conv2 d.The experimental results show that the m AP of the YOLOX-BS network model reaches99.40% and the m AP of the YOLOX-JC+Mobile Vi T-SB network model reaches 99.22%,and the two recognition methods have good robustness and recognition accuracy. |