| In recent years,with the continuous development of Internet technology and the emergence of large-scale data sets,deep learning has achieved great success in many traditional computer vision tasks because of its excellent computing capabilities.The wide application of deep learning technology is inseparable from the convolutional neural network algorithm and the recurrent neural network algorithm.Nowadays,most researches divide the image into one or more discrete tags,which can specifically describe the object category in the image,But it does not describe the relationship between the various objects in the image and what is happening in the image.Video is a continuous image.In addition to spatial information,The static spatial information of the video contrast image also includes timing information.Video descriptions must not only describe the content of the current frame but also the continuous actions of adjacent frames.Traditional image descriptions and video descriptions The model’s analysis of visual and sentence fusion information is not deep enough.With the development of convolutional neural networks and recurrent neural networks,convolutional neural networks have demonstrated their amazing ability to analyze image information in the field of computer vision.So far,advanced computer vision intelligent algorithms are inseparable from convolutional neural networks,and they are in major international competitions.And the convolutional neural network has become the core of computer vision algorithms in international conferences.With the development of network structure and GPU hardware,the cyclic neural network has made breakthroughs in deep learning.The cyclic neural network has a cyclic characteristic due to its structure.Memorability,in dealing with text,speech,video and other sequence problems,cyclic neural networks have efficient and practical value,especially in the field of natural language processing.Based on the theoretical basis of convolutional neural network and cyclic neural network,this paper aims to propose a more efficient coding and decoding framework and introduce a new attention mechanism under limited computing resources and diverse application scenarios to find some indicators for image description and video description There are certain methods and techniques to improve,enrich the joint system of convolutional neural networks and recurrent neural networks,and at the same time improve the practicality of the algorithm.The thesis research mainly includes:(1)Image description model based on convolution and recurrent neural network.The model in this paper consists of two parts,one is the encoding part,which mainly uses convolutional neural networks to extract image features.This paper uses two types of convolutional neural networks:deep networks and deep residual networks,and the other part is the decoding part using cyclic neural networks.The network models natural language and generates sentences.The encoding part of this article removes all the last fully connected layers of the convolutional neural network,and sequentially passes through the adaptive pooling layer and the fully connected layer,retaining the image features extracted by the network,and the decoding part adopts It is a long and short-term memory network that better handles the dependence between natural sentences through gating units.Finally,it compares the traditional image description model on the public data set.The experimental results verify the feasibility and the feasibility of the connection method in this article.The image description models in this paper have achieved certain evaluation indicators.(2)Image description model based on attention mechanism.From the perspective of the diversity of the image content and the corresponding regions of natural sentences,a multi-local fusion method is adopted to strengthen the correlation.On this basis,the multi-local attention mechanism is analyzed in detail to find the possibility of further improvement.In order to make full use of image features corresponding to related words to provide a reference idea to generate the optimal sentence,the classic image description model and different attention mechanism models are compared in the same data set.The analysis of the experimental results shows that the introduction of the new attention mechanism is The combination of convolutional neural network and recurrent neural network brings effective performance improvement.(3)Video description model based on attention mechanism.Considering from the perspective of strengthening the correlation between video content and natural sentences,through the study of traditional algorithms,in response to problems such as weak timing correlation,a multi-local fusion attention mechanism is introduced into the model with a deeper and wider network,which not only reduces the extraction of video information To a certain extent,it is related to the content of the video and the description sentence.The model in this paper compares the classic model with the same type of model on a large data set.The experimental results prove that the model in this paper guarantees the correctness of the description sentence.Effectively improve the description index. |