Font Size: a A A

Research On Cross-modal Retrieval And Recognition Of Visual And Text

Posted on:2022-04-15Degree:MasterType:Thesis
Country:ChinaCandidate:W B WangFull Text:PDF
GTID:2518306560955289Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Multi-modal data refers to the same thing can have many forms of expression,including text,image,audio,and so on.Although multi-modal data of the same class express the same things,there is a huge semantic gap between them with completely different forms of expression.With the rapid development of information technology,multi-modal data has been increasing,which brings a lot of problems while enriching people's information life.The problems of how to search for needed information and identify specific data in the massive and disordered multi-modal data need to be solved urgently.Multi-modal research analyzes and studies the internal relations between different modalities by means of techniques in order to cross the semantic gap between them.In this thesis,we have done the following two main works from two fields of cross-modal retrieval and cross-modal recognition.(1)Information retrieval is an important way to solve the problem of information explosion.With the growth of multimedia data,cross-modal retrieval has become a hot branch of information retrieval.To this end,a cross-modal hash retrieval algorithm is proposed in this thesis.It finds a common low-dimensional semantic space for heterogeneous data of different modalities,and then completes the retrieval task in it.Recently,some work has focused on supervised cross-modal hashing methods,and high retrieval accuracies have also been obtained.However,how to maintain the local geometric structure and the similarity of the data in the original space in the public space and how to utilize the supervised information effectively remain challenges.To address these issues,our method improves the retrieval results by using its similarity in the original space obtained by modeling intra-modal and inter-modal similarity of data and the category information in the supervised information as constraints when finding the public space by matrix factorization.Through adequate experiments on two publicly available datasets,our approach is effective and outperforms state-of-the-art methods.(2)Lip-reading,also known as visual speech recognition,is an image-to-text modality recognition task that recognizes what a speaker is saying from consecutive image frames of the lip region.In this thesis,we first propose a process to collect Chinese lip-reading data for the problem of the lack of sentence-level Chinese lipreading dataset,and collect certain data for experiments,and verify its normality.At present,most lip-reading methods capture temporal information through recurrent neural networks.However,there are multiple scales of association information in sentences,and RNNs can not mine time sequences at multiple scales.To this end,we propose a sentence-level lip-reading method based on temporal convolutional networks to improve the recognition effect by mining temporal information of different sequence lengths in sentences through multi-scale temporal convolutional networks.The effectiveness of our proposed method is verified by experimental comparison with several baseline methods.
Keywords/Search Tags:cross-modal retrieval, matrix factorization, hash, deep learning, lip-reading
PDF Full Text Request
Related items