| The perception based on various computer vision tasks is of great significance for fields such as robotics,autonomous driving,the Internet of Things,and automation.Accurate and robust perception is the foundation for a stable system to work.Tasks such as target detection,semantic segmentation and traffic sign classification all need multi-scale feature coding in the backbone network as the basic feature extractor,and the performance of this encoder will largely determine the upper limit of a model.In the past,CNN,the mainstream solution to computer vision tasks,failed to achieve the expected performance improvement and performance under the current increasingly complex application scenarios and task requirements.As a network architecture capable of long-distance dependency modeling,Transformer with attention module as the core has better performance when tasks require large receptive field and dynamic attention capability.However,traditional attention faces an exponential explosion in computational complexity during the process of input changing from text sequences to images,and the network structure and training methods are not highly adaptable to computer tasks.In view of the above problems,this topic proposes a feature extraction backbone network,called ConMW Transformer,which combines Transformer and convolutional neural network.This network introduces a new fusion window attention mechanism with linear complexity.At the same time,through the overall lightweight transformation of the network,it can better meet the needs of small data set tasks.The main works of this paper are as follows:(1)In response to the problem of excessive redundancy in attention modules leading to slow operation of long image sequences,this paper designs a fusion window attention module with a computational complexity proportional to the length of the image sequence.Firstly,by limiting attention to the window,the overall computational load is reduced;Secondly,by combining convolutional projection with linear projection,more local contextual information is incorporated into each feature sequence;Then,using large kernel convolution to connect windows increases information exchange between windows;Finally,connect different sub attention heads to reduce the impact of low rank bottlenecks and mix a single distribution to enhance the dynamic fitting ability of attention for more appropriate joint representation.By modifying the attention model,the overall accuracy of the model can be improved under linear complexity.When only ImageNet-1K pre training is used,replacing the original attention module of the visual transformer with this attention module can improve the image classification accuracy by 2.2%and improve the inference speed by 203%.(2)In response to the problem of slow start of model training with high demand for data and difficulty in adapting the output mode to downstream tasks,this article has redesigned the overall architecture of the entire network while including the fusion window attention proposed earlier,including pooling,normalization,multi-scale feature map interface,position encoding,feature projection,and original feature conversion methods.Firstly,introducing convolution into Transformer helps it converge quickly and improve accuracy.Then,inductive bias is introduced in patching and projection.Finally,after windowing the feature map,convolution is used to extract representative vectors and perform global attention calculation.The overall architecture transformation enables the model to well adapt to downstream tasks.Target detection experiments were carried out on the COCO2017 dataset.After replacing the backbone network of Cascade Mask RCNN,the accuracy was improved to 52.1 box AP.The semantic segmentation experiment was conducted on the ADE20K dataset.UperNet was used as the infrastructure and its backbone network was replaced.The small model was 2.6 mIoU and 1.3 mIoU higher than the advanced methods Swin-T and Twins-S,respectively.The parameters of the basic model were reduced by 34%when it exceeded Swin-B 0.2 mIoU.The classification task accuracy on ImageNet reached 83.7%,and the small model improved by 1.7%compared to Swin-T under similar computational complexity.The basic model improved accuracy by 0.2%while reducing computational complexity by 24%compared to Swin-B.(3)To solve the problem that the network is still relatively large on the hardware of the robot platform,which is not easy to deploy and the training process is unstable,this paper has carried out an overall lightweight experiment on the network,and explored a variety of training methods to study the impact of data set size and data enhancement methods.Weight-sharing and distillation experiments were carried out for lightweight respectively,and the convolution neural network was used to replace the original Transformer layer at the bottom of the network,which reduced the model parameters by 50%,accelerated the reasoning speed,and reduced the burden of hardware.At the same time,the overall accuracy of the model was only 0.3%lost.Finally,a small network camouflage detection task adaptation experiment is carried out on a small self-collected data set with the overall lightweight model,which makes Transformer no longer limited to the use of large data sets,improves the versatility of Transformer,and achieves the overall recognition accuracy of 52mIoU. |