| Object detection is a common fundamental problem in the field of computer vision,which has important theoretical significance and application value in many areas,such as public safety,intelligent manufacturing,and intelligent transportation.However,there are complex distribution characteristics in the real scene,including the huge number of object categories,variable scales of objects,background noise interference and differences of modals,which makes object detection suffer missed detection,category confusion and difficult localization.Therefore,it is urgent to build an efficient object detection model for computer vision and multimedia applications.Based on above considerations,this dissertation aims to develop the research on key technologies of visual object detection to solve the above problems.To build an efficient visual object detection model,this dissertation conducts research from three aspects: object feature extraction,detection network structure and network learning optimization.Moreover,this dissertation further discusses multi-modal object detection task based on images and natural languages in different scenarios.The detailed research content and main innovations can be summarized as follows:(1)Due to the background noise interference in the object area,it is easy to cause the semantic category confusion for different objects.To solve this problem,this dissertation proposes a multi-level context feature extraction method for object detection.Firstly,it designs a dynamic encoder-decoder segmentation network to capture more precise pixel-level object segmentation information.Then,it collects the context information from various distance ranges,including local object regions,non-local object regions and the surrounding environment,and establishes the semantic dependencies between different context information.Finally,the multi-level context information is used to effectively mine auxiliary segmentation features and suppress the noise interference in the object area,thereby improving multi-category object detection performance.(2)Since there are large differences in object size and aspect ratio,it is possible to cause missed detection and false detection of objects.To solve this problem,this dissertation proposes a multi-scale gate fusion feature extraction method for object detection.Firstly,it constructs a gate fusion module to calculate the semantic importance of channels at adjacent scale,which can adaptively control the information flow of different scale features,so as to assign appropriate corresponding features for each object.Meanwhile,based on the aspect ratio of the current object,this method can flexibly select the shape region features related to this object,thereby avoiding the object feature distortion problem caused by fixed region pooling.This method can effectively improve the object detection performance for different sizes and aspect ratios.(3)To deal with the change of object appearance,this dissertation proposes a crossline-based object detection network.It firstly designs a set of flexible and learnable cross lines to represent objects,which can effectively perceive the feature changes along the horizontal and vertical directions.Then,it constructs a axis-query growth module to find the surrounding pixels semantically related to the current line features along the axis direction.This processing can be directly supervised by the bounding box annotation to flexibly determine the growth direction of the current line,so as to cope with the changes of object appearance in the visual scene.Finally,it proposes a semantic-guided label assignment and a decoupled regression optimization mechanism,which can adaptively select cross lines with higher semantic richness as optimization targets,thereby further improving the flexibility and accuracy of the object detection network structure.(4)To address the difficult regression optimization of bounding box,this dissertation proposes an offset bin probability optimization method for object detection.By analyzing the existing bounding box regression optimization problems,the method innovatively quantifies the continuous coordinate offset value into multiple discrete offset bins.Then,it adopts a distance-aware offset bin classifier to predict the coordinate offset distributions corresponding to the current sample,including single label distribution and distance-aware label distribution.In addition,it proposes a expected estimation offset generation method to transform these discrete offset intervals into precise coordinate offset values.Meanwhile,it also proposes a hierarchical focusing offset generation method to gradually refine the discrete offset bin range,thus improving the quality of object detection bounding boxes.(5)Since it is difficult to map the appearance details of objects in the multi-modal object detection task,this dissertation proposes a progressive deformable object representation for muti-modal object detection.Firstly,the method proposes a language-aware deformable object model to adaptively sample a set of object keypoints related to the input language in the image,thereby capturing the object details described by natural language.Then,it establishes a bidirectional interaction between linguistic and visual features to further enhance the semantic relationships between cross-modal features.Finally,it carefully maps the objects and their relations involved by the input language to the image from local words to global sentence levels,so as to accurately detect the object described by the language in the input image.(6)To meet the practical application of cross-modal mapping in complex crowd scenes,the dissertation is the first to explore multi-modal object detection in crowd scenes.It first constructs a more challenging multi-modal crowd object detection dataset(Ref Crowd),which contains rich and diverse crowd scene images and language descriptions with attribute details.To address this challenge,it proposes a fine-grained multi-modal attribute contrast learning model.By establishing an attribute-aware multi-modal decomposition module,it can decompose complex image and language features into explicit multimodal attribute features.Finally,it designs a fine-grained attribute contrastive module to effectively distinguish the subtle differences between similar persons,so as to realize the fine-grained mapping from the language to vision in crowd scenes,and further promote the research and development in the filed of object detection. |