Font Size: a A A

Research And Application Of Multimodal Learning On Clothing Image Understanding And Generation

Posted on:2022-11-23Degree:MasterType:Thesis
Country:ChinaCandidate:X R LiFull Text:PDF
GTID:2481306779463104Subject:Computer Software and Application of Computer
Abstract/Summary:PDF Full Text Request
Since the rise of deep learning in 2010,it has revolutionized the areas in speech recognition,computer vision,and natural language processing.In classical deep learning tasks,each task is focused on a single modality of inputs and outputs.However,with the development of artificial intelligence,more and more work consider the application of multimodal intelligence,since multimodal intelligence not only relies on the development of individual modalities,but also needs to deal with the fusion between multiple modalities.In respect of deep learning applications,the fashion industry is showing its important value.The fashion industry is entering a rapid explosive phase due to the maturation of e-commerce platform for its products and services.As a result,thousands of clothing styles are designed every year,which in return causes new problems for the fashion industry.For sellers,they not only need to find the popular items in every seaon,but also register all commodities manually and recommend ideal items that satisfy every customer from a multitude of clothing.To address these new problems,our work takes the vision language of multimodal as the starting point and designs the following solutions to address different needs for several new problems faced by the fashion industry.First,we designed an end-to-end clothing image caption network that includes clothing attribute detection and visual attention modules.We migrate the pre-trained convolutional neural network on the Image Net dataset to the clothing attribute detection task,so that the model acquires visual awareness of clothing attributes.This attribute detection network is then added to the vision language encoding decoding model,which is responsible for the clothing image encoding part.In the decoding module,the initial generation of linguistic information is done using an LSTM network,followed by the refinement of feature information using a spatial attention module and an attention gating unit to generate the final text description.Related experiments show that these methods substantially improve the performance of the model and effectively reduce the interference of error messages.We next briefly investigate the graphical text matching module for the fashion domain,where we designed a modular graphical matching framework using a classical two-stream network.In detail,we experimented several visual and language extraction modules and evaluated the performance of several mainstream contrast learning modules on multimodal alignment on the Fashion IQ dataset.These works provided solid foundation for the subsequent clothing image retrieval.Finally,we designed an interactive clothing image retrieval framework based on the previous work.The framework is able to retrieve clothing images that meet the requirements from the image library based on the reference images and user feedback.The whole framework is based on modular structure and is divided into a semantic extraction module,an image feature encoding module and a multimodal feature fusion module.The semantic module uses a lightweight version of the Bert network trained by knowledge distillation,which can flexibly handle keyword feedback or natural language feedback from users;The image feature encoding module uses a pre-trained convolutional neural network to effectively extract information such as edges and textures from the clothing images,while the multimodal feature fusion network uses a gating unit to rewrite the visual features of the reference image and match the modified features with the target image features.The whole end-to-end model can be trained jointly,so the submodules of the feature extraction module can also be optimized to extract features that the model wants.Extensive experimental results show that our model is superior to other state-of-theart methods in retrieval performance.It also has excellent scalability and can achieve approximately retrieval performance of large pre-trained models.
Keywords/Search Tags:Vision Language, image retrieval, multimodal, image caption, fashion AI
PDF Full Text Request
Related items