| With the rapid development of Internet technology and the improvement of public consumption level,short video platforms have been developed as never before,and a huge amount of videos emerge every day,and the video content is diverse and of varying quality,which makes the platform for short video content audit demand more and more vigorous,but manual completion is time-consuming and laborious,and needs to be combined with video description technology to automatically conduct intelligent analysis of video.Most of the existing short video description methods are based on the fusion of static and dynamic video features,and lack the mining of rich video information.In this paper,we propose a multi-view feature extraction method to address the above issues,interpreting the video from multiple perspectives and extracting key information that is effective for video description models.At the same time,a fusion method based on attribute semantic information is proposed to fuse the extracted multimodal information for characterisation,in order to reduce the interference between the modal information.Through the above methods,the video review efficiency of the short video platform can be improved and video content management can be facilitated.The specific work is as follows.(1)A video multi-view feature extraction method is proposed.Starting from a global-local perspective,the entity,action and connection logic are regarded as local perspective,short-term perspective and long-term perspective,and the scene features,target features,action features and key frame text semantic features of the video are extracted to fully consider the rich information of the video.The experimental results show that the baseline model improves the CIDEr evaluation index by 4.3%by comparing with other models in video information mining.(2)Based on the above multi-view features,a multimodal fusion method based on attribute semantic information is proposed.The method generates noun and verb attribute semantic information by applying an attention mechanism to the combination of different modal information,builds an attribute detector,converts the attribute semantic information into higher-order attribute semantics,and then embeds the higher-order attribute semantics into the LSTM weight matrix during decoding to guide the generation process of description statements and improve the accuracy of video description.The experimental results show that the model improves the CIDEr evaluation metrics by 8.8%and 3.6%compared with the conventional multimodal feature stitching method and the attention mechanism-based feature fusion method,respectively. |