| Chinese Image Caption is a cross-disciplinary research problem.The essence is to enable a computer to automatically generate a descriptive Chinese sentence for an image.This is easy for humans but challenging for machines.Computers need to extract information such as object characteristics,spatial connections,and semantic relationships of images,so that they generate human-readable sentences that clearly express the image content,and strive to be accurate and smooth.In the Chinese image caption task,image caption based on neural networks has become the mainstream research method.Most of them use encoder and decoder structures.Convolutional neural networks act as encoders.The encoders are responsible for the extraction of image visual features.Recurrent neural networks act as a decoder,and the decoder is responsible for sentence generation.Aiming at the problems in current Chinese image caption,this thesis mainly does the following research work:In order to reduce the loss of visual features of images extracted by convolutional neural networks,a method of attention mechanism is added to the encoder and decoder structure.The attention mechanism pays attention to each Chinese word and its corresponding image content,and it focuses on the area where the image content is located.Experiments show that adding the attention mechanism effectively improves all evaluation indicators.At the same time,for the problem of the gradient disappearance of the recurrent neural network(RNN),as the RNN time slice grows,the sentence generation will lack the guidance of the previous information.This thesis proposes a method of memory aid,which extracts important Chinese words.Then adding these words to the process of RNN predicting each round of Chinese word information.The name of the memory aid is inspired by the memory network in the sentence question and answer task.A new model constructed by combining the method of memory aid with the attention mechanism can further improve all evaluation indicators.The experiment compares the effects of different convolutional neural networks at the encoder side on Chinese image caption,such as Inceptionv3,Inception-v4,and Inception-Res Net-v2,then compares the effects of different recurrent neural networks at the decoder side on chinese image caption,such as LSTM and GRU.The experiments found that different convolutional neural networks have significantly different impacts on various evaluation indicators.Deep convolutional neural network models usually have many parameters,and large calculations,to solve this problem,this thesis proposes a lightweight convolutional neural network named BCNN(Bifurcate CNN),BCNN is named because it uses modules with more bifurcation paths.This model has a total of 36 convolutional layers with a total parameter of 22015628.Compared to the Inceptionv3 model with 47 convolutional layers and a total parameter amount of 24734048,even more compared to the Inception-v4 and Inception-Res Net-v2 model with hundreds of convolutional layers and more parameter,it called a lightweight model,this model structure follows the ideas of Resnet and Inceptionv4,more important,a transition module is proposed in this model,whose main role is to transit from a stacked convolutional layer module to a bifurcation module.Experiments show that the BCNN model improves all evaluations indicators. |