| With the expanding demand of security and the rapid development of Internet technology,the image and video data generated by surveillance cameras,film and TV media,social networks,etc.,have shown an explosive growth,which is very challenging to process and analyze."Person" is the core object of image and video processing and analysis.In recent years,person-centric tasks such as face recognition,person reidentification(Re-ID),person search,and action recognition have received extensive attention.Existing research has made encouraging progress in ideal experimental scenarios,but most of the related technologies are far from being practical.Therefore,how to effectively recognize and analyze people in real scenes is crucial to promoting the implementation and diffusion of related applications,which is of great theoretical and practical significance.Toward improving the intelligence of image and video processing for person analysis,this thesis explores the recognition and analysis of person identity and action in real scenes.In summary,the main contributions of this thesis are as follows:(1)To address the problems of low resolution,loss of spatio-temporal information,as well as the lack of full frames in existing Re-ID datasets,this thesis collects and labels a large-scale video-based pedestrian dataset,named Campus4K,based on a 4K video surveillance network in real scenes.In contrast to traditional datasets that mainly consist of image or video data,Campus4K not only provides clearer videos of people,but also retains full frames with time-stamp information,and records the spatial distribution of different camera views in real world.Campus4K is able to provide a variety of content,including visual data,spatio-temporal information,etc.,which is very close to realworld scenarios,and is also one of the data foundations of this thesis.(2)Aiming at the problem of performance degradation caused by the lack of training data in real scenes,this thesis proposes an unsupervised person Re-ID method based on raw videos and spatio-temporal constraints.Traditional unsupervised Re-ID methods are based on person images cropped from bounding boxes for learning.Although the identity labels are removed,the data distribution still differs from the real scenes.Based on the Campus4K dataset with full frames,this thesis employs person tracklet data automatically generated by detection and tracking algorithms for model training,and utilizes the spatio-temporal information to screen out reliable positive matching pairs from noisy data,which significantly improves the quality of training data and the final performance of unsupervised person Re-ID.The unsupervised learning framework based on raw videos makes full use of the massive data captured by the surveillance network,which has broader application prospects in real-world scenarios.(3)For the diverse retrieval needs and complex application scenarios in the real world,this thesis explores person search by portrait and person Re-ID based on multicue joint retrieval and multi-cue information ensemble,respectively.Person search by portrait aims to search the person in a large-scale database where the face is not necessarily visible only through the portrait,which is a special case of person Re-ID and is in great demand in the real world.This thesis firstly uses the face information to explore the database iteratively,and then combines it with the appearance of the body,the mutual exclusivity of person identity,etc.,to improve the overall performance of portrait search;In addition,due to the limitation of dataset clarity,the direct introduction of relatively poor face information in person Re-ID has insignificant performance improvement for the existing methods,and even brings performance degradation.Considering the rich information contained in the relationships of people in database,this thesis fuses multi-cue information of face and body based on graph model and graph convolutional network.Experiments on CG rendered high-definition data and the proposed Campus4K show that with the improvement of clarity,the proposed fusion model is able to capture effective information from weaker face cues to assist person Re-ID and improve the performance of Re-ID methods.(4)The above mainly focus on the recognition and analysis of person identities,which is to passively search the person after the incident.The last part of this thesis focuses on spatio-temporal action detection and proposes an action detection method based on local-context cross attention,which actively detects and recognizes the actions of people in videos.Understanding the action of people is inseparable from understanding the surroundings.In this thesis,a cross-attention network based on Transformer is used to model the relations between people and the local context area,which improves the accuracy of action recognition and detection while reducing the computational effort compared to the manner of global relation modeling.Experiments show that the action detection model proposed in this thesis has good robustness to common problems in real scenes such as small target,fast movement,and background clutter.In conclusion,this thesis focuses on the recognition and analysis of person identity and action,and dedicates to improve the automation and intelligence of image and video processing.Meanwhile,this thesis presents novel solutions for the problems of difficult data annotation,diverse retrieval needs,complex application environments,etc.,in real-world scenarios,which helps promote the implementation and spread of related technologies in real scenes. |