| Benefiting from excellent flexibility and portability,low-altitude UAV vision is widely used in energy,infrastructure,agriculture,commerce,public safety,etc.The automatic processing of low-altitude UAV visual data has become an urgent need in the industry.Object detection in low-altitude UAV vision has become a current research hotspot.Compared with general object detection data,low-altitude UAV visual data has more small objects,lower object average resolution,and smaller object relative scale.Small object detection is the focus and difficulty of object detection in low-altitude UAV vision.Although the deep learning method has achieved great success in the field of general object detection,existing methods still have several shortcomings when detecting small objects in low-altitude UAV vision: 1)Data preprocessing.Existing image-level data preprocessing methods enlarge the input image to a single scale or image pyramids to improve the detection accuracy of small objects.However,detecting objects at a single scale is affected by the object scale diversity;while using image pyramids will increase training and inference time,and introduce false alarms.2)Backbone networks.Static convolution limits the capacity and flexibility of the backbone network,while the huge overhead makes it difficult to use the dynamic filter.On the other hand,despite Swin Transformer achieving excellent performance,this thesis finds that its local self-attention(LSA)performance is limited that is only equivalent to Depth-wise Convolution(Dw Conv).3)High-level network.The high-level network of the existing detection model does not fit for low-altitude UAV visual object detection.Specifically,existing neck networks do not focus on enhancing large-scale features,while large-scale features are particularly critical for small object detection.The anchor hyperparameters of the region proposal network do not fit the object distribution of low-altitude UAV vision,which restricts model training and inference,and it isn’t easy to set appropriate hyperparameters manually.To address the above issues,this thesis works on the following aspects:· Proposes scale-adaptive image cropping.It is observed that the object scale in lowaltitude UAV vision is closely related to the shooting distance,but weakly related to the perspective phenomenon.Based on this characteristic,this thesis presents a new image-level data enhancement method,called Scale Adaptive Image Cropping(SAIC).SAIC defines the Normalized Average Object Relative Scale(NAORS)level that reflects the shooting distance,designs a scale level classification model,resizes and crops images based on the scale level,so as to enlarge small objects,alleviate the diversity of object scales,and avoid false alarms.The SAIC-based FPN detector won third place in the 2018 Vis Drone competition.· Proposes the Decoupled Dynamic Filter backbone network.Static convolutions limit the capacity and flexibility of backbone networks,while dynamic filters are expensive.To this end,this work studies and proposes the lightweight Decoupled Dynamic Filter(DDF).The key idea of ?DDF is to generate decoupled spatial/channel filters rather than the original one.During filter application,DDF combines spatial/channel filters at the corresponding pixel/channel.DDF can seamlessly replace standard convolutional layers in ResNets,consistently improving the accuracy of ResNets,while reducing the number of model parameters and FLOPs.Finally,this work applies DDF-ResNet as the backbone network in a low-altitude UAV visual object detector.Experiments show that the detector based on DDF-ResNet can achieve higher detection accuracy with fewer parameters.· Proposes enhanced local self-attention.Although the Swin Transformer backbone network has achieved great success,this work finds that the performance of LSA in Swin Transformer is somehow limited and is only equivalent to Dw Conv.By comparing Dw Conv,dynamic filter,and LSA,this work points out that the relative position embedding and neighborhood attention application are the key factors limiting the performance of LSA.On this basis,this thesis further proposes Enhanced Local Self-Attention(ELSA)to improve the performance of Swin Transformer.Experiments show that the proposed ELSA-Swin can significantly improve the model accuracy in multiple tasks.· Proposes dense neck network and anchor adaptation strategy.The high-level network of existing detection models does not fit low-altitude UAV visual object detection.Specifically,the neck network does not focus on enhancing large-scale features;anchor hyperparameters of the region proposal network do not fit the object distribution of low-altitude UAV visual data.Aiming at the shortcomings of neck networks,this thesis introduces a contentaware upsampling operator,applies multiple feature fusion units with dense connections,constructs a dense neck network to enhance large-scale features.For anchor hyperparameters,this thesis studies and proposes an anchor adaptation strategy,which automatically optimizes the anchor hyperparameters during model training,making the detection model insensitive to the manually set initial hyperparameters.The experimental results show that both components can effectively improve the baseline model,and the joint use of them can further greatly improve the detection accuracy. |