Font Size: a A A

Unified Vision-Language Representation Learning For Multimodal AI

Posted on:2024-04-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y P HuangFull Text:PDF
GTID:1528307349985529Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the booming development of smart devices,social media,and e-commerce,the rapid growth of image and text data has brought about new demands for multimodal AI technology.In the intersection of computer vision and natural language processing,multimodal AI technologies focus on collaboratively understanding,creating,and applying vision and language information.This technology enhances production efficiency,inspires creativity,and offers broad social value.Utilizing general techniques for multi-source data and tasks allows the exploration of deep connections between multimodal data,thereby improving the performance and generality of models in various tasks.In the field of multimodal pre-training,existing models separate image representation extraction from cross-modal representation learning and lack effective methods for masked vision modeling,or the masked modeling methods for images and text differ significantly.These issues result in limited learning of visual representations and increased difficulty in learning cross-modal alignment,thereby affecting the model’s performance in downstream tasks.Moreover,the bidirectional generation models for images and text in multimodal applications often use specific task design frameworks and undergo independent training,while multimodal dialogue models usually cannot support multiple image-text interactions,limiting the application scenarios and reducing the generality of multimodal intelligence technology.To address these challenges,this thesis focuses on unified vision-language representation learning.It unfolds in the following four aspects,aiming to narrow the modal differences and enhance the performance and generality of multimodal AI models:1.In response to the separation of image representation extraction and visionlanguage representation learning in pre-trained models,and the lack of effective masked vision modeling,this thesis proposes an end-to-end pre-training method based on image tokenization.This method jointly extracts grid-based convolutional visual features and optimizes vision-language representations.By converting images into image tokens for masked vision modeling,it directly extracts joint representations of vision and language from natural images and textual data,effectively capturing their complex interrelations,and thereby enhancing model performance.2.This thesis proposes a unified vision and language masked modeling method to address the significant differences in images and text in pre-training models due to the differences in vision and language representations and optimization objectives.Based on discrete masked tokens,this method unifies vision-language pre-training objectives and adopts a self-attention mechanism for representation.This approach reduces the differences in cross-modal representation learning and enhances model performance,adapting to various document tasks with text or images as the primary focus.It has been effectively extended to Chinese datasets for Chinese tasks.3.Regarding the non-unified structure of bidirectional generation models for images and text,this thesis proposes a unified bidirectional image-text generation model.Based on the Transformer architecture,the model unifies images and text into a sequence of tokens,achieving cross-modal bidirectional generation.Additionally,it introduces two-level granularity representation and sequence-level training methods,aiming to enhance the performance of the bidirectional general-purpose model.This approach simplifies the design of task-specific models,optimizes storage utilization,and enhances model versatility.The unified bidirectional image-text generation model has also been extended into a framework capable of generating diversified image descriptions and enriching image content to enhance generation diversity.4.In response to the limited image-text interaction methods and dataset forms in the multimodal dialogue,this thesis proposes methods to expand image-text interaction modes and construct new types of datasets.These methods aim to train dialogue models that support fine-grained image-text interactions,enhancing the flexibility and versatility of the model in processing image-text inputs.This paper constructs a dialogue dataset combining fine-grained image and text interactions and trains a multimodal open-ended dialogue instruction-following model based on this dataset.Additionally,a benchmark test set assisted by GPT-4 has been constructed to quantitatively assess the model’s ability to handle multi-turn imagetext dialogues.These methods adapt to complex image-text interaction scenarios and possess high flexibility and scalability.The four parts of this paper are interconnected,aiming to unify vision and language representation learning,jointly advancing the development and innovation in multimodal AI.The work of this thesis not only enhances the performance of multimodal models in understanding natural image and document image-related tasks but also expands the model’s generality in image-text generation and multimodal dialogue applications.The resources and code of this paper have been open-sourced,providing references and support for subsequent research and practical applications.
Keywords/Search Tags:Vision-Language Representation Learning, Multimodal AI, Pre-trained Models, Masked Vision Modeling, Image-Text Generation
PDF Full Text Request
Related items