| With the widespread popularity of the Internet and the rapid development of multimedia technology,Internet users generate different types of massive media data(text,image,voice,and video)in real life.These different types of data are usually called multi-modal data.Users’ demand for retrieval of these multi-modal data is increasing,and cross-modal retrieval has become one of the research hotspots of high interest in academia.Cross-modal retrieval refers to searching samples of one modality by querying other modalities,for example,retrieving images by text.Cross-modal retrieval currently faces two challenges: the "heterogeneity gap" and "semantic gap" between multi-modal data,making it challenging to measure cross-modal similarity directly.In order to solve these two problems,many cross-modal retrieval methods have been proposed in recent years.The core idea is to map the multi-modal data to a common subspace,mine the correlation in different modal data,and realize the cross-modal similarity measurement.Existing cross-modal retrieval methods are divided into two main categories:real-valued representation retrieval methods and binary-valued representation retrieval methods.Real-valued representation retrieval methods aim to map multi-modal data to a common real-valued representation subspace to achieve cross-modal semantic similarity metrics.However,the existing methods suffer from poor feature extraction ability,weak modal association,less data interaction,and weak ability to maintain modal consistency of data,which leads to a large room for improvement of cross-modal retrieval performance.To this end,this paper proposes a new real-valued representation retrieval method to fully exploit the semantic association rows within and between modalities through a dual-attention mechanism to improve the cross-modal retrieval accuracy.Unlike the real-valued representation-based retrieval method,the binary representation retrieval method has the features of lower representation dimension,lower storage cost,and faster similarity calculation,which makes the binary representation more suitable for big data application scenarios.However,the existing binary representation retrieval methods suffer from the deficiencies of difficulty in thoroughly learning the structural associations in the hash space and improving the semantic discriminability of cross-modal binary representations.A cross-modal hash retrieval method based on multi-label semantic fusion is proposed in this paper to address these problems.The main work and innovation points of this paper are as follows:(1)This paper presents a cross-modal retrieval method based on dual attention and generative confrontation learning.The method is an adversarial semantic representation model with a dual attention mechanism(i.e.,intra-modal attention and inter-modal attention).Intra-modal attention is used to focus on critical semantic features within a modality,while inter-modal attention is used to explore the semantic interactions between different modalities to more accurately represent high-level semantic relevance;the consistent cross-modal feature distributions are learned by intra-modal and inter-modal adversarial loss,effectively reducing cross-modal heterogeneity differences.The effectiveness of the method is verified by experimental comparison on a public multi-modal dataset.(2)This paper proposes a cross-modal hash method based on deep multi semantic fusion.This method uses two deep neural networks to extract cross-modal features respectively.It introduces the multi-label semantic fusion module to fuse the multi-label semantics into the cross-modal feature learning process so that the learned feature representation contains more potential label category information.Finally,a graph regularization method maintains the semantic similarity of cross-modal hash codes in Hamming space.The method’s effectiveness and superiority are verified by comparing the performance with the benchmark method on the cross-modal dataset. |