| Image classification is a basic research task in the field of computer vision,and the practical application of its research results will bring great convenience to human life.At present,although the research of image classification technology in the fields of automatic driving,medical image analysis,and intelligent security has been greatly developed,it also faces many difficulties.The fine-grained image classification is an important part of the image classification task.The research goal is to accurately identify the subcategories of the target object in the image.Fine-grained images have small differences between sub-categories,and objects within sub-categories have large differences due to shooting angles and other reasons,making classification very difficult.Therefore,the task of fine-grained image classification also has higher challenges and research value.Most of the existing fine-grained image classification models only focus on a single modality.These works use deep learning neural networks to classify the visual features of the image.With the rapid increase in the amount of multi-modal data on the Internet,complicated multi-modal data has increased the difficulty of data management and at the same time provided more data information,providing more data and opportunities for scientific research.In the classification tasks,the features of various modalities(visual,sound,text,etc.)of multimodal data can complement each other in the same semantic space,providing richer information for classification and improving classification accuracy.his article considers that real data contains a large number of natural language descriptions,combined with fine-grained images.This article considers that real data contains a large number of natural language descriptions,combined with the difficulty of local area detection in fine-grained image classification methods.The fine-grained image data set containing two modalities of image and text is studied,and a multimodal fine-grained image classification method based on mutual attention alignment mechanism is proposed,which uses a large amount of unstructured text in the natural language text corresponding to the image.Extract the adjectives that can describe the attribute information of the object in the image,take the adjectives that can describe the attribute information of the object in the image,and the adjectives that can describe the attribute information of the object in the image-noun group,noun group,noun group,noun group,through word embedding,Word embedding,word embedding,cyclic neural network to encode phrases,obtain the low-dimensional vector representation of text in the text space,use visual and text two modal features to jointly learn,obtain more discriminative local features and use them for training The classifier obtains more discriminative local features and uses it to train the classifier,and obtains more discriminative local features and uses it to train the classifier.The main work of this paper is as follows:(1)Aiming at the complementarity between the two modalities of image and text,in order to fully mine the more discriminative regional information in images and text,a fine-grained image classification method based on mutual attention alignment mechanism is proposed to simultaneously learn images and the local features with more discriminative degree in the text,improve the classification accuracy.(2)Using text information to guide the model to obtain distinguishable local features in the image,effectively avoiding the use of complex local area detection networks and reducing the complexity of the algorithm.(3)Extract the adjective-noun phrases in the natural language text,use the combination of these phrases to accurately express the image features,and indiscriminately balance the characteristics of small inter-class differences between the fine-grained image sub-categories and the large intra-class differences within the subcategories.The experimental results show that the large amount of semantic information contained in the text can complement the visual information in the image,enhancing the feature expression of fine-grained images.On the two fine-grained image classification datasets of CUB-200-2011 and Oxford 102 Flowers,compared with the case where text information is not used for auxiliary classification,accuracy improvements of 2.03%and 2.23%were obtained respectively. |