| With the continuous development of the social economy,the pace of daily life is getting faster and faster,and almost everyone is burdened with stress from all sides.Carrying a lot of mental stress for a long time can have adverse effects on the physical and mental health of individuals.Accurate recognition of the stress state,necessary relief and intervention can prevent stress from further evolving into chronic diseases.Therefore,it is of great practical significance to improve the accuracy of stress recognition.Under stress,the body will produce a variety of physiological reactions,such as accelerated breath,increased blood pressure,sweating,etc.These physiological reactions can be observed by different physiological signals,such as Electrodermal Activity(EDA)and Blood Volume Pulse(BVP).Because of their lossless,convenient and continuous characteristics,EDA and BVP have been widely concerned.It is difficult to accurately identify the stress state by only one physiological signal.The fusion of multimodal physiological signals is a promising technology,which can improve the accuracy of stress recognition.However,multiple physiological responses do not occur simultaneously,and there is asynchrony between physiological patterns observed from different perspectives.This factor should be taken into account during the multimodal fusion,otherwise the quality of the multimodal fusion will be affected.In addition,there are individual differences in physiological signals,which affect the generalization of the model.In order to solve the above problems,this paper focuses on exploring the temporal alignment relationship between EDA signals and BVP signals,and combines multinomial random sampling for multimodal fusion to improve the generalization performance of the model.The main research work and achievements of this paper include:(1)A fusion model based on the dynamic alignment of multimodal physiological signals is proposed.Firstly,different feature extractors are constructed according to the characteristics of BVP signals and EDA signals to achieve automatic feature extraction in the time domain.Secondly,cross-modal attention is used to find the alignment between the two modalities and fully integrate the cross-modal information.Self-modal attention is used to attenuate noise and redundant information,and highlight important information to obtain significant stress representations.Finally,a predictor is used to process the stress representations of the two modalities respectively to obtain two prediction labels,and the mean square error loss is used to narrow the gap between the two modalities.(2)A multimodal stress recognition model based on multinomial random sampling is proposed.Considering that the contribution of different sensor signals to fused features may be different,this paper combines the BVP signals and EDA signals with a multinomial random sampling process.After feature extraction,the cross-and selfmodal attention modules are used to find the alignment relationship between the two modalities in time domain and highlight the important information inside the single modality.First,the outputs of the two attention modules are converted into two dockable vectors by the fully connected layer,so that the feature vectors of the two modalities have the same size.Secondly,according to the sampling probability,the two modalities are randomly sampled to obtain the probability vectors obeying the multinomial distribution.Finally,the feature vectors are multiplied by the corresponding probability vectors to complete the feature selection,and the feature vectors of the two modalities are added to complete the multimodal fusion.(3)In this paper,comparative experiments are conducted on two multimodal emotion datasets.Compared with the representative neural network models and the multimodal fusion methods based on attention mechanism,the fusion model proposed in this paper based on the dynamic alignment of multimodal physiological signals has better classification ability in the stress recognition task,and the classification accuracy can reach 81.8%.The validity of temporal alignment investigated in this study is demonstrated.In addition,we visually analyze the cross-modal attention weights and potential stress representations,further demonstrating the effectiveness of cross-modal attention mechanism in aligning multimodal physiological signals and the role of self-modal attention mechanism in highlighting important information.After the multinomial random sampling process is introduced for feature selection,not only the mean value of classification accuracy is improved,but also the standard deviation is reduced,which indicates that the multimodal fusion based on multinomial random sampling is helpful to improve the generalization performance of the model.By adjusting the sampling probability,this paper further explores the contribution of different sensor signals to multimodal fusion.In addition,we visually analyze the potential fusion representations generated by the model,and the results show that the proposed model can learn more differentiated stress representations and has good interpretability.In conclusion,the effectiveness of temporal alignment of multimodal physiological signals is deeply explored in this paper.Through the combination of the cross-modal attention mechanism and the self-modal attention mechanism,the temporal alignment of the BVP signals and EDA signals is carried out,and the more distinguishable and significant stress representations are obtained to improve the accuracy of stress recognition.Based on this,multinomial random sampling is used for feature selection and fusion to further improve the generalization performance of the classification model,which provides a new idea for stress recognition modeling based on multimodal physiological signals. |