| Affective computing aims to create a computing system that senses,recognizes and understands human affect and can respond to human affect intelligently,sensitively and naturally.Affective computing is a fundamental technology and an important prerequisite for naturalistic,anthropomorphic and personalized human-computer interaction,as well as providing an optimized path for artificial intelligence decision-making which is of great value in opening up the era of intelligence and digitalization.In affective computing,each modal conveys the amount of information about human affect via different magnitude and dimensionality.As different dimensions are still missing and imperfect in human-computer interaction,affective computing should approach from multiple dimensions as much as possible to fill in the single imperfect affect channel and finally determine affect tendency through multi-result fitting.The paper examines two aspects of multimodal affect computing: multimodal sentiment analysis and multimodal emotion recognition,using tensor networks,graph convolutional networks and multi-task learning methods to analyze data for prediction.Our data are all in three modalities(text,audio,and video)which changing the settings of the tuning network to improve the prediction accuracy.The main aim of the work in this paper is to investigate the effect of multi-task learning on the overall learning outcomes of the network.Our strategy is to set up the network into two parts—multimodal fusion and multi-task learning,with separate but generic datasets selected to validate the method.To building on their strengths and avoiding their weaknesses,the fusion fully exploits the strengths of the networks,which aim to reduce computational effort while serving the purpose of breaking the only catastrophe and making it easy for modal rollout.The multi-task learning uses pre-fusion data rather than additional data,with the aim of providing a priori knowledge for multimodal learning.For all hyperparameters established in the study,we adjusted the optimal values by checking the mean error of the network search validation set.The main research results are as follows:(1)We propose a low-rank tensor self-supervised network(LTSN).In this work,a low-rank tensor network is used to fuse the different modalities and three unimodal labels are automatically generated using a self-supervised method to assist the tensor network in the prediction of sentiment.Specifically,the model can be divided into two modules: The first module involves decomposing the processed features into low-rank factors using a modified CP tensor to make them fully fused,with the aim of fully mining the information within the modalities;The second module takes the same features separately into a feed-forward biblical network and generates labels by a self-supervised module designed to assist tensor fusion in exploring inter-modal information.The LTSN retains the low computational complexity and breaks the curse of dimensionality effectively of the traditional low-rank tensor network.In addition,the unimodal labels are applied to construct the improved tensor network that can acquire commonalities and characteristics among various modalities.Our method achieves competitive results on MOSI,MOSEI,and SIMS datasets while confirming the effectiveness of unimodal assistance in ablation studies.(2)We propose a multimodal emotion recognition model,Multimodal Emotion Recognition via Multi-task Deep Graph Convolution Network(MDGCN).The model is divided into three modules: The first module is a modality encoder,which outputs the input features as the same dimension;The second module is an intra-and inter-modal linking of the features output from the previous module,with the aim of capturing distant conversational information and fully mining the modal information;The third module connects only nodes of the same modality to provide a priori knowledge for fusion.We scale linearly in the number of edges of the graph and learn hidden layer representations,encoding local graph structure and features of nodes to achieve contextual information for long-distance conversations.To further explore multimodal information,we utilize multi-task learning to provide prior knowledge for fusion,which improved the robustness of the model.Our model achieves competitive results on the IEMOCAP and MELD datasets,while its performance is significantly better than other state-of-the-art methods. |