Research On Cross-modal Retrieval And Recognition Of Visual And Text

Posted on:2022-04-15

Degree:Master

Type:Thesis

Country:China

Candidate:W B Wang

Full Text:PDF

GTID:2518306560955289

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Multi-modal data refers to the same thing can have many forms of expression,including text,image,audio,and so on.Although multi-modal data of the same class express the same things,there is a huge semantic gap between them with completely different forms of expression.With the rapid development of information technology,multi-modal data has been increasing,which brings a lot of problems while enriching people's information life.The problems of how to search for needed information and identify specific data in the massive and disordered multi-modal data need to be solved urgently.Multi-modal research analyzes and studies the internal relations between different modalities by means of techniques in order to cross the semantic gap between them.In this thesis,we have done the following two main works from two fields of cross-modal retrieval and cross-modal recognition.(1)Information retrieval is an important way to solve the problem of information explosion.With the growth of multimedia data,cross-modal retrieval has become a hot branch of information retrieval.To this end,a cross-modal hash retrieval algorithm is proposed in this thesis.It finds a common low-dimensional semantic space for heterogeneous data of different modalities,and then completes the retrieval task in it.Recently,some work has focused on supervised cross-modal hashing methods,and high retrieval accuracies have also been obtained.However,how to maintain the local geometric structure and the similarity of the data in the original space in the public space and how to utilize the supervised information effectively remain challenges.To address these issues,our method improves the retrieval results by using its similarity in the original space obtained by modeling intra-modal and inter-modal similarity of data and the category information in the supervised information as constraints when finding the public space by matrix factorization.Through adequate experiments on two publicly available datasets,our approach is effective and outperforms state-of-the-art methods.(2)Lip-reading,also known as visual speech recognition,is an image-to-text modality recognition task that recognizes what a speaker is saying from consecutive image frames of the lip region.In this thesis,we first propose a process to collect Chinese lip-reading data for the problem of the lack of sentence-level Chinese lipreading dataset,and collect certain data for experiments,and verify its normality.At present,most lip-reading methods capture temporal information through recurrent neural networks.However,there are multiple scales of association information in sentences,and RNNs can not mine time sequences at multiple scales.To this end,we propose a sentence-level lip-reading method based on temporal convolutional networks to improve the recognition effect by mining temporal information of different sequence lengths in sentences through multi-scale temporal convolutional networks.The effectiveness of our proposed method is verified by experimental comparison with several baseline methods.

Keywords/Search Tags:

cross-modal retrieval, matrix factorization, hash, deep learning, lip-reading

PDF Full Text Request

Related items

1	Research On Collective Matrix Adaptive Factorization For Cross Modal Retrieval
2	Research On Discrete Hashing Method Based On Matrix Factorization
3	Study On Kernel Based Hashing For Cross-modal Retrieval
4	Design And Implementation Of A Cross-modal Retrieval System Based On Deep Hashing
5	Deep Network For Image-Text Cross-Modal Retrieval
6	Research On Cross-modal Retrieval Based On Matrix Factorization
7	Cross-modal Retrieval Using Deep Neural Network
8	Semantic Transfer Hashing Based On Deep Learning For Cross-modal Retrieval
9	Research On Deep Hashing Method And Security For Cross-Modal Retrieval
10	Attention-aware Deep Cross-modal Hashing