| This paper studies the issue of generating regional-level captions for video clips using the video-level annotation training set.The dense video subtitle model consists of three parts:the first is the visual model part,which mainly processes the video frame image to generate the region feature map;the second part is the region sequence generation part,which is responsible for the region of each frame image.Choosing the proper regions to generate a sequence of regions,this will be introduced in Chapter 6 Video Captioning Model;the third part is the language model part,which mainly encodes the regional features and then outputs sentence captions.Firstly,the fast target detection of Faster R-CNN is analyzed.Based on the Faster R-CNN and VGG-16 structure,a convolutional neural network is designed to build the model for the visual part.in the Faster R-CNN,the defects in the Ro I pooling layer used to generate the fixed-size region feature vector are analyzed.First,the accuracy loss problem exists in the coordinates when the RoI region is used for window division on the original feature map to generate a fixed-size output feature map;second,the gradient cannot be Coordinate conduction problems.By replacing the RoI Pooling layer with a bilinear sampling layer,the two problems in the Ro I Pooling layer are solved.The bilinear sampling layer constructs a sampling grid to calculate the corresponding coordinates when the RoI region is mapped onto the original feature map,and uses bilinear interpolation to solve the problem of precision loss caused by the non-integer coordinates.Pre-training of the improved Faster R-CNN results in a pre-training model that later becomes the basis for the visual model portion of the dense video captioning model.Then a Lexical R-CNN visual model is designed based on the improved Faster R-CNN.The visual model uses multi-instance multi-label learning to use the set of regions obtained by processing each frame of the video clip via Faster R-CNN as an example package.The lexical tags extracted from the sentence tags are used as tag sets,and further training is based on the Faster R-CNN pre-training model.Through this multi-instance and multi-label learning approach,the sentence tags and the dense regions in the video clips are weakly related,that is,the lexical information contained in the sentence tags is combined with the regional features after training.By setting the number of candidate regions sampled in the Faster R-CNN as B=16,each frame image of the video segment generates 16 regions.How these regions generate multiple suitable sequence of regions is a difficult problem.Since the video segment is composed of multiple frame images,each image has 16 regions,for a 30-frame video segment,the possible number of region sequences is up to 1630,which is a huge search space.This paper transforms the region sequence generation problem into a subset selection process.By constructing the sub-module function,the subset selection process is performed using the sub-module maximization process,and the CELF greedy algorithm is used to perform the subset selection process,thereby generating multiple Regional sequence.In the process of model training,it is necessary to associate the generated sequence of regions with sentence tags.Here,a winner-takes-all approach is used to select a sentence tag that best fits the content description for each region sequence,so that each video segment has multiple(16)Region-Sequence-Sentence-Tag pairs.This part will also be introduced in Chapter 6.After the regional features and sequence of regions are obtained,they are put into language models to generate sentence subtitles.The language model uses the LSTM model of the Encoder-Decoder structure.However,because the one-way LSTM model only considers the impact of the information contained above on the following words,it does not consider that the following information will also influence the choice of the above words.Therefore,the bidirectional LSTM replaces the one-way LSTM model of the Encoder section.Experiments show that the Lexical R-CNN visual model based on Faster R-CNN performs well in dense video captioning tasks and has high AP accuracy.However,since the Faster R-CNN is a two-stage target detection method,although the detection accuracy is high,the inherent reason of this model causes the detection speed to be very slow.Experimental results show that the intensive video caption model completely fails to meet the requirements for real-time video caption generation.The application of a one-stage target detection method in the intensive video captioning model will be used in future studies and work,and strives to achieve high accuracy while ensuring speed. |