| Multi-object tracking(MOT)aims to find all objects of interest(e.g.,pedestrians,vehicles)in video sequences and predict their trajectories.It has been a longstanding topic in computer vision since many higher-level tasks such as autonomous driving,humancomputer interaction and video surveillance are built upon it.Due to the progress of object detection,Tracking by Detection(TBD)and Joint Detection and Tracking(JDT)have been the main paradigms for MOT.The TBD first uses a separated object detector to locate objects in video frames,then uses separated appearance,motion and interaction models to extract the feature representations for the similarity measurement.Based on the similarities,the data association model makes corresponding predictions of trajectories.In this setting,however,the occlusions among objects,pose and illumination variations,detector errors(False Positives and False Negatives)in complex scenes usually have seriously negative effects on the feature modeling,similarity measurement and data association,therefore it is necessary to find more robust appearance model and data association to alleviate that challenges.Compared to the TBD,the JDT unifies sub-models of detection,appearance/motion,etc.into a single network thus achieving better training and inferencing efficiency,while it is not only faced with the classical challenges but also needs to deal with the difficulties brought by the multi-tasking learning.Moreover,the networks of JDT usually go deeper,which impedes the inference speed and should be improved.In this thesis,a series of innovative methods are proposed to overcome the aforementioned key challenges.The main contributions are listed as follows:1.Enhancing the appearance modeling in multi-object tracking via neighbor graph.The appearance features of individual targets are susceptible to the negatives such as occlusions and illumination variations.To remedy that,this work proposes to make full use of the neighboring information.The motivation derives that people tend to move in a group.As such,when an individual target’s appearance is remarkably changed,the observer can still identify it with its neighbor context.To model the contextual information from neighbors,this work first utilizes the spatio-temporal relations among trajectories to efficiently select suitable neighbors for targets.Subsequently,the proposed neighbor graph is constructed for each target and corresponding neighbors,and the graph convolutional networks(GCNs) are employed to model their relations and aggregate features.Finally,standardized evaluations on several benchmark datasets demonstrate the effectiveness of the proposed method.2.An optimizing framework for the tracklet re-identification in multi-object tracking.This framework runs upon the tracking results of other trackers and aims to re-identify the tracklets separated by occlusions and missing detections.Specifically,the tracklets and their inter-relations are cast into a multi-label energy function,the optimal solution is to assign a unique label for the tracklets from the same target.The proposed framework employs α-expansion algorithm to solve the energy function and innovatively introduce the label cost mechanism to reduce the label number,thereby achieving better re-identification.Moreover,an appearance model using the spatial transform network is proposed to improve appearance features,a hierarchical cluster method for labels is proposed to further boost the tracklet matching.Evaluations on the MOTChallenge benchmark show that the proposed framework is generic and can improve existing trackers significantly.3.A novel label assignment and loss function for the joint training of detection and re-identification in multi-object tracking.Previous works usually use the label assignment and loss function from object detection to perform the joint training of detection and re-identification(Re ID).However,these practices make the training biased to detection because they ignore the characteristics of the Re ID task,meanwhile are likely to produce ambiguous assignments,i.e.,the same positives shared by different ground-truth objects.To remedy that,this thesis first proposes an identity-aware label assignment,which jointly considers the assignment cost of detection and Re ID to select positive samples for each instance without ambiguities.Moreover,this thesis proposes a novel discriminative Focal Loss that integrates Re ID predictions with Focal Loss to focus the training on the discriminative samples.Evaluations on three benchmarks MOT16/17/20 show that the proposed techniques can effectively alleviate the bias problem in the joint training,and significantly improve the tracking performance.4.Compressing the multi-object model via knowledge distillation.Recent multiobject tracking methods usually use very deep neural networks to achieve competitive accuracy,inevitably resulting in degraded inference speed.To strike a better balance between tracking accuracy and speed,this study first proposes to compress the MOT model via knowledge distillation(KD),i.e.,enabling the more lightweight student network to obtain similar performance as the teacher network.In particular,during the knowledge distillation for feature learning,spatial attention is adopted to guide the student network focus on the foreground of the feature map.More importantly,this study innovatively proposes to model the difference between teacher and student in terms of spatial and channel attention,and uses the difference cues to enable better distillation.Besides,the knowledge from the teacher network is utilized to construct the foreground mask,which is used to reduce the negative effects of low-quality soft labels for KD.Evaluations on several benchmarks indicate that the proposed KD method can make the student network achieve leading performance,meanwhile running faster than the teacher network 20.0% ~ 27.4% and reducing the parameters 28.5% ~ 87.1%. |