| With the advent of various social media platforms,humans have diversified their ways of expressing opinions online.As a result,sentiment analysis has become a thriving subfield in natural language processing,aiming to predict speakers’ emotional states based on limited modal data.As artificial intelligence develops,researchers are exploring ways to equip computers with the ability to analyze human emotions,making it a hot topic in fields such as human-computer interaction and recommender systems.Given the multiple modal data present in social media,researchers face challenges in obtaining implicit information across various modalities.Therefore,exploring sentiment features across different modalities and improving the fusion of multimodal information features has become a significant area of interest in multimodal sentiment analysis.In this thesis,we investigate feature extraction and fusion for multimodal sentiment analysis tasks,including the following.(1)This thesis proposes a novel method to address the challenges differences and complementarity of data from various modalities in multimodal sentiment analysis.The method utilizes nonverbal representation optimization networks and contrasts learning,with a neural network adapted to the different modal data.Additionally,the post-fusion stage incorporates inter-modal interaction comparison learning to facilitate the model in learning the complementarity and difference information of the modalities.To address the poor quality of nonverbal sequences,this thesis designs two representation learning networks based on a self-attentive mechanism for nonverbal modal features.This approach provides better representations for fusion,allowing the method to extract data from different modalities in a targeted manner.The proposed method effectively learns modality complementarity and similarity information for multimodal sentiment analysis.Extensive experiments on two publicly available multimodal datasets show significant advantages over previous robust baseline models.(2)This thesis addresses the problem of excessive redundant information and loss of key information in the fusion of multimodal data.It proposes a gated recurrent hierarchical fusion network for multimodal sentiment analysis,which dynamically interacts with the information of the three types of representation combinations: text and acoustics,text and vision,and acoustics and vision.This enables full inter-group interaction learning between each representation combination,effectively eliminating redundant information between modal combinations,and maximizing the retention of the representation that is effective for modal prediction.At the same time,inspired by distribution matching,this paper considers the mutual influence between different modalities.At the stage of modality representation acquisition,this method enables the non-language sequence and the text sequence to enter the cross-modal attention channel,which can induce the potential representation information contained in the peer modality while making the modality representation closer to real emotional expression.Extensive experiments on two popular multimodal datasets show that this approach has strong competitiveness compared to previous complex baseline models.In summary,this thesis investigates multimodal feature extraction and multimodal data fusion in the field of multimodal sentiment analysis,designing a nonverbal representation optimization network for feature extraction to extract high-quality representations,and proposing a gated cyclic hierarchical fusion network for data fusion to improve the fusion method.This study aims to deepen our understanding of the future development of multimodal sentiment analysis and provide new research methods for feature extraction and data fusion in a multimodal domain,which has important scientific significance and social value. |