| Scene recognition is a kind of image recognition problem,which aims to identify the location and scene where a given image is taken.It is one of the most important problems in the field of computer vision.Scene recognition can provide a basic description of the image content instead of listing the objects in the scene,which can better help the computer understand the surrounding environment.Scene recognition has important application value in many areas,such as robotics and autonomous driving.However,the category of scene is not only determined by objects,but actually determined by semantic regions,hierarchical structure and spatial layout.Scene images generally show the characteristics of large intra-class variety and high inter-class similarity because of complicated appearances,subtle differences,and ambiguous categorization,which makes scene recognition more challenging than other image recognition problems.With the development of deep learning in recent years,image recognition has made an important breakthrough and the recognition accuracy has been greatly improved.Represented by convolutional neural network,the methods based on deep learning have become the main methods of scene recognition.However,compared with other image recognition problems,such as object recognition and semantic segmentation,the accuracy of deep learning models in scene recognition is still insufficient.The subject of this thesis is research on multi-label scene recognition based on meta-learning and multi-attention spatial layout learning.This thesis focuses on the impact of attention mechanism,scene spatial layout structure and meta-learning on scene recognition.Firstly,this thesis introduces various structures of convolutional neural network,objective functions,optimization algorithms and methods to improve the generalization ability of convolutional neural network.Two general convolutional neural network architectures Goog Le Net and Res Net are compared,and their core modules and design ideas are introduced in detail.Then this thesis summarizes the development and main challenges of current scene recognition algorithms.In order to improve the accuracy of deep learning model in scene recognition,a new neural network model suitable for scene recognition task is designed in this thesis.The model consists of two independent branches: scene recognition branch and spatial layout branch.The scene recognition branch is improved from Res Net-50.In order to further improve the performance of the network,this thesis makes improvements in the following five aspects: Firstly,some design ideas in Transformer are applied to convolutional neural network,and the original bottleneck structure is replaced by inverted bottleneck with an expansion ratio of 4,same as the expansion ratio of the fully connected layer in the Swin-T.At the same time,depthwise convolution is used to make the network more lightweight without reducing the recognition accuracy? Then,following the guiding principle of “use more groups,expand width”in Res Ne Xt,this thesis increases the network width from 64 to 96,which is consistent with the number of channels of the Swin-T? In order to obtain larger receptive field,this thesis studies the effect of size of convolution kernel on model’s performance in scene recognition tasks.The experiment results show that the model achieves the best performance when using 7 × 7 convolution kernel,and the larger convolution kernel does not bring further gain? Fourthly,this thesis combines the channel attention module SE block with the inverted bottleneck to improve the performance of the model by strengthening the formation of feature attention? Lastly,this thesis fuses the outputs of different layers of the network based on the global attention upsampling module to enrich the multi-scale information of the image.The experiment results show that the scene recognition accuracy of the improved model has been effectively improved.The spatial layout branch combines feature extraction ability of convolutional neural network and relationship modeling ability of Transformer to learn the spatial layout information of scene image.The network uses convolutional neural network to extract the features of the image in advance,and then uses Transformer to model the spatial structure relationship.In order to represent the spatial layout structure of image,a randomized partitioning pooling method of feature map is proposed in this thesis,which divides feature map extracted by convolutional neural network in multiple patterns,and the generated results are used as the input of Transformer to calculate the potential key spatial layout information.As a supplement to the scene recognition branch,the spatial layout branch further completes the feature representation of scene recognition.The experiment results show that the proposed method achieves 61.76% of top-1 accuracy and 89.92% of top-5accuracy on Places365-Standard dataset,which outperforms other current scene recognition models.Finally,this thesis studies the performance of scene recognition in few-shot learning,and introduces meta-learning and few-shot learning in detail.Meta-learning has become the main method of few-shot learning in supervised learning.Then this thesis analyzes and compares different few-shot learning algorithms,including MAML,prototype network and relation network.The performance on scene recognition of different few-shot learning methods based on meta-learning and the impact of different backbone networks on few-shot scene recognition are compared.The experiment results show that deeper backbone networks can reduce the performance differences among different fewshot learning methods,and the method proposed in this thesis achieves the optimal few-shot scene recognition accuracy on the prototype network. |