| Video object retrieval is a relatively complex field in computer vision,which aims to use computer vision technology to determine whether a specific object exists in a video.In the current situation where the monitoring system is becoming more and more perfect,it will be very time-consuming and labor-intensive to retrieve videos only by artificial means.Intelligent video target retrieval technology becomes extremely important.Early video retrieval methods such as keyword tagging and video content-based methods were complex and inefficient.In recent years,the development of deep learning has greatly promoted the research of video object retrieval.From the perspective of video structuring,this paper conducts structured processing on surveillance video,extracts structured information in surveillance video to build a corresponding database,and realizes the retrieval of objects of interest in surveillance video based on the database.For the structured processing of surveillance video,this paper mainly focuses on pedestrian structuring,which is divided into two parts,one part is the facial feature information of pedestrians,and the other part is attribute information of pedestrians,such as gender,age group and clothing,etc.This paper conducts in-depth research based on these two parts.For the faces of pedestrians in the surveillance video,there is a phenomenon that the near is large and the far is small.In order to better detect faces of different sizes,based on MTCNN(Multitask Convolutional Neural Networks),a face detection algorithm by MBMTCNN(Multi-branch Multitask Convolutional Neural Networks)is proposed.By adding multiple branch module and dilated convolution module to the model,the same neural layer of the model has different receptive fields,so as to extract the information of different scales in the image,and better synthesize the global features and local features in the image.Experimental results show that the proposed method can effectively improve the detection accuracy.For the recognition of pedestrian attributes,due to the diversity of pedestrian attributes and the uneven distribution of attributes,this paper proposes a ConvNeXt-AM(ConvNeXt with an Attention Mechanism)pedestrian attribute recognition method,in order to effectively utilize the correlation between pedestrian attributes,we take ConvNeXt,which has excellent characterization ability,as the backbone network,and add ECA(Effificient Channel Attention)into the backbone network to realize local cross-channel information interaction,extract the dependency relationship between channels,and enable the model to learn the correlation relationship between pedestrian attributes.It is verified by experiments that in the face detection problem,the method proposed in this paper has different degrees of accuracy improvement compared with the original method in different datasets,and can better detect faces in surveillance videos.In the problem of pedestrian attribute recognition,the accuracy of the method proposed in this paper reaches 77.58% and 77.82% in the two datasets of PETA and PA100 K,which is within the acceptable range and can identify the attribute information of pedestrians more accurately.Finally,based on the actual surveillance video scene,using the obtained MB-MTCNN face detection method and ConvNeXt-AM pedestrian attribute recognition method to extract the structured information in the surveillance video,and build a corresponding structured information database.Based on two input methods of image or text,we can realize the rapid retrieval of interested targets in the surveillance video.Experimental results show that the video object retrieval scheme based on video structure has high feasibility,can effectively improve the efficiency of video object retrieval,and reduce the cost of video storage resources. |