| Visual object tracking is a crucial research direction in computer vision field.It has been wildly used in many practical scenes,such as video surveillance,humancomputer interaction,intelligent transportation,intelligent diagnosis,ocean exploration,battle field scouting,and so forth.The primary task of visual tracking is to mark the position and size of the target in the first frame of a given video by handwork or a detection algorithm,then predict its position and size in each subsequent frame of the video using a visual tracking algorithm.Although visual tracking algorithms have been studied for many years,some problems still need to be studied in depth:(1)Weak ability of target representation.Facing complex and diverse targets,to model the appearance of a target accurately,it is necessary to choose an appropriate feature representation way.On the one hand,different representations should be selected for different type targets,for they usually possess different characteristics.On the other hand,the representation of a given target should be self-adapted accordingly if its appearance continually changes during its movement.However,most existing visual object tracking algorithms use a single feature or simple fusion of several features,which result in weak ability of representation in generated target features.(2)Simple updating strategies of models.The appearance of a target is generally continually changing in a video,and may change dramaticlly between two adjacent frames.To adapt to such changes,the observation model of a tracking algorithm should be updated.If an update is made for each frame of a video,not only the burden of computation will be increased greatly,but also the model could be polluted and degraded under some situations,like object occlusion.As a contrast,if the model is not updated for a long time,the tracking algorithm could not be able to adapt to the rapid changes of the target appearance.The two updating strategies,“update per frame” and “no update”,are employed by most existing visual tracking algorithms,which cannot meet the requirement of robust object tracking.We study the discriminative visual tracking from two perspectives in this paper,the object deep feature representation and the model updating strategy.The main work is summarized as follows:(1)A deep feature channel selection method is proposed to solve the problem of weak object feature representation in the framework of correlation filtering.According to the ratio between the average feature energy in the target salient region and that in the search area,multi-channel features are pruned,i.e.invalid and interfering channels are removed.As a result,the accuracy and speed of tracking are improved.A Res Net is introduced to extract and fuse target features,to tackle the weak ability of manual feature representation.A Dense Net is also used to extract target features,due to its advantage of deeper layers.In detail,target features are obtained from a specific layer of the Dense Net.To sum up,pruning invalid channels in deep features using channel selection methods,both of the effectiveness of target feature channels and the expression ability of target features are improved.(2)A method of multiple feature fusion by game and model updating in high confidence is proposed to solve the problem of poor fusion of multiple target features and failure of model updating in time under the framework of correlation filtering.The method uses a multi-expert system to construct multiple feature combinations and selects the two most critical ones,then fuse them by game theory.Such a way can improve the quality of multiple feature fusion and thus obtain robust fused features.A novel tracking quality evaluation index is proposed,and based on the index,this method designs an effective timely model updating strategy.In summary,multi-complementary feature fusion by game can solve the singularity feature problem caused by manual or deep features,and construction of multiple feature combinations by a multi-expert system and filtering the best two can give full play to the advantages of HOG,CN and deep features.Therefore,the effect of multiple feature fusion is improved and the expression ability of fused features is enhanced.(3)A method of lightweight spatial attention mechanism and connected domain template updating method is proposed to solve the problem of weak ability of target representation in the backbone network and failure of target template updating under the Siamese network(Siam FC)framework.Based on the Siam FC,the method replaces its feature extraction backbone network Alex Net with a deeper VGG-19,and concatenates a newly designed lightweight spatial attention module LSAM after its template branches.Meantime,the connected domain template updating strategy is used in the method.In result,the feature extraction ability of the backbone network is enhanced and the target template is renewed selectively.The Non-local attention and channel attention,as well as the global context attention and coordinate attention are sequentially connected to the tail of the backbone network.They can help the network to focus on the target area and adapt to significant changes of target appearance.The double template strategy can solve the problem that the target template cannot be updated.In addition,the usage of lightweight spatial attention module LSAM and connected domain template updating strategy can better deal with both the problems of weak ability of target representation in the backbone network and failure of target template update.In this paper,we mainly study discriminative visual tracking algorithms that have better tracking performance from two aspects,the feature representation and the model updating.Experimental results on several benchmark datasets show that the proposed methods can solve many challenges of object tracking in video sequences,such as scale variation,illumination variation,object occlusion,deformation,background clutters,low resolution,and the like.Our algorithms can achieve continuous and stable tracking under complex environment,and thus further promote the application of visual tracking in the actual scene. |