Font Size: a A A

Key Technologies And Applications Of Multimodal Signal Analysis And Understanding:A Research Study

Posted on:2024-01-07Degree:DoctorType:Dissertation
Country:ChinaCandidate:C Y LeiFull Text:PDF
GTID:1528306932962769Subject:Electronics and information
Abstract/Summary:PDF Full Text Request
With the increasing popularity of mobile devices and mobile internet,various aspects of people’s work and life have entered an era of online service interaction.Online services include daily information communication,online shopping,video entertainment,professional consulting,and other scenarios.In the explosively growing and popular online services,users and service providers interact online in various ways.Unlike traditional offline services and early internet services,the information carrier of these online interaction modes is multimedia data,which includes complex signals of different modalities,including text signals,visual signals,speech signals,and user signals.Therefore,semantic understanding and parsing of these complex multimodal signals have become a cornerstone for online services to function well.Only with the good semantic abstraction of these multimodal signals,including fusion,transformation,alignment,transfer,and representation,can the materials be truly organized,allowing users to obtain information and services more efficiently,service providers to understand user needs more clearly,and platforms to conduct data structured storage and product improvement better.In response to the aforementioned background,the primary focus is to conduct indepth analysis and modeling of multimodal signals.Based on a massive amount of multimodal data on the mobile internet,this dissertation explores and applies advanced techniques and research paradigms such as self-supervised learning,pretraining-finetuning,etc.,to develop modeling approaches for multimodal pretraining of large-scale models.This enables high-level semantic understanding and relational representation of multimodal signals,providing a fundamental technological basis for key applications in multimodal signal analysis and understanding.Furthermore,in order to better evaluate and validate the semantic understanding and representation capabilities of the learned multimodal pre-trained models,this dissertation selects question-answering systems and personalized recommendation systems based on multimodal signal understanding as representative and critical applications for practical implementation.These two applications require the model to possess a strong ability for multimodal signal comprehension and reasoning,as well as deal with complex signals involving user structural information and structured inputs.Thus,they serve as effective means for the application and evaluation of multimodal signal analysis and understanding models.Additionally,this research work is supported by the National Key R&D Program of China,"Online Consultation and Services based on Question-Answering System"(Project No.2020YFC0832505).The modeling of multimodal signals based on pretraining techniques,as well as the question-answering systems and personalized recommendation systems based on multimodal signal understanding,are key technologies and applications within this project.Overall,the main research work,innovative achievements,and core contributions of this dissertation include:(1)This dissertation proposes a Chinese multimodal signal analysis and understanding algorithm based on large-scale pre-training.This dissertation constructs a massive Chinese multimodal video dataset through real-world internet data,which includes over 10 million complete videos with manual text descriptions,greatly enriching the Chinese multimodal data corpus.Based on this high-quality dataset,this dissertation proposes a novel video-language learning framework based on the pre-training paradigm.This dissertation introduces a variety of innovative proxy tasks and learning mechanisms,which not only make the multimodal pre-training model more robust but also capture complex multimodal semantic signals and structural relationships from different perspectives.Moreover,this dissertation also explores targeted model compression algorithms to reduce model size for multimodal pre-training models,making them suitable for online deployment.This dissertation applies the proposed pre-trained model to a series of downstream tasks and demonstrate its superiority over state-of-theart pre-training methods.The evaluation includes multiple datasets from general and professional fields.(2)This dissertation proposes a multimodal question-answering algorithm based on multi-question joint learning.Due to the multi-round interactive nature of online question-answering services,multiple questions may arise simultaneously during the question-answering process.These questions often exhibit strong semantic relevance,which can better infer the questioner’s intention and help the question-answering system provide better replies.Therefore,this dissertation proposes a novel and practical multimodal and multi-question training framework based on an attention mechanism,which effectively improves the accuracy of the multimodal question-answering system.The experimental results on multiple public datasets demonstrate that the proposed algorithm provides more accurate answers to the questioner in the multimodal questionanswering system,exhibiting excellent performance.(3)This dissertation proposes a personalized video recommendation algorithm based on multi-modal signal transfer learning.First,a novel pre-training technique based on the contrastive learning paradigm is employed to map users’ behavioral interests in different scenarios into the same interest parameter space by leveraging the semantic generalization ability of multi-modal signals,thus providing the possibility of transfer learning and joint learning for users’ interests in different scenarios.Then,combined with the upper-level designed multi-scenario transfer learning algorithm,the users’ interest intentions in different scenarios are better captured,providing a better user experience and personalized recommendation service.This dissertation conducts experiments on a real internet user behavior dataset and compares and analyzes the proposed algorithm with multiple classic algorithms in recent years,thus verifying the effectiveness of the proposed algorithm.(4)The algorithms and systems proposed in this dissertation have undergone targeted optimizations and have been validated in multiple tasks within the judicial domain.On the one hand,this further validates the generalizability of the methods presented in this paper and their practical significance in important scenarios.On the other hand,it verifies the theoretical,technical,and practical support provided by this paper to the project that supports this dissertation.Specifically,this includes illegal content recognition and judicial dialogue generation,covering both discriminative and generative tasks in the judicial field.The experimental results and analysis confirm the support provided by this research to the relevant technologies and applications in the judicial domain.Furthermore,this paper has undergone system deployment verification in other realworld application scenarios,and relevant research and analysis have been conducted on system construction and implementation in practical applications.In summary,the research achievements of this dissertation contribute to the practice and innovation of multimodal signal semantic understanding,downstream applications,and multimedia-related topics.Furthermore,the deployment and practice in real online systems demonstrate that this research has significant application value and practical significance,which can improve the user experience and service level of downstream critical applications.Finally,the research results of this dissertation also contribute to the technical challenges and application promotion of multimodal signal understanding,question-answering technology,and personalized business requirements in the National Key R&D Program of China,"Online Consultation and Services based on Question-Answering System"(Project No.2020YFC0832505).
Keywords/Search Tags:Multimodal Signal Analysis, Multimodal Datasets, Pre-Training Techniques, Visual Question Answering, Personalized Recommendation Systems, Contrastive Learning, Representation Learning
PDF Full Text Request
Related items