| In recent years,image semantic segmentation,a challenging research topic in the field of computer vision,has played a crucial role in real-world scenarios such as autonomous driving,medical image analysis,UAV placement determination,and remote sensing of space satellites.Benefiting from the great success of convolutional neural networks in deep learning in image processing,current semantic segmentation tasks have achieved significant performance improvements.However,most semantic segmentation methods only improve the segmentation effect by increasing the model complexity,but ignore the problems of hardware resource memory,video memory consumption and inference delay.In response to the above problems,based on the deep convolutional neural network,this paper proposes a lightweight real-time semantic segmentation model that takes into account algorithm accuracy,reasoning speed and memory ratio.The specific research contents are as follows:(1)This paper designs a novel lightweight network(MSCFNet)based on a multi-scale context fusion scheme,and explores an asymmetric encoder-decoder architecture.In the encoder part,an effective asymmetric residual convolution module composed of decomposed convolution,depthwise separable convolution and hole convolution is used,and after three downsampling,the original image information is used to supplement the corresponding scale detail information;At the same time,the attention branches at different stages of the network are used to capture multi-scale contextual information,and they are fused in the decoding part to improve the representation of image features.MSCFNet achieves 71.9% m Io U accuracy on Cityscapes dataset and can run at over50 FPS forward inference speed on a Titan XP GPU configuration;achieves 69.3% m Io U accuracy on Cam Vid dataset,which is well implemented A balance between segmentation efficiency and segmentation accuracy.(2)Aiming at the problem that downsampling in the coding part of image semantic segmentation will lose information,this paper proposes a fast bilateral symmetric network FBSNet.Specifically,FBSNet adopts a symmetric encoder-decoder structure with two branches,the semantic information branch and the spatial detail branch.The semantic information branch is the main branch with a deep network structure,which is used to obtain rich contextual information of the input image and obtain sufficient receptive fields;the spatial detail branch is a shallow and simple network used to establish local dependencies between each pixel.to save details.Meanwhile,a feature aggregation module is designed to effectively combine the output features of these two branches.Tested on an RTX2080 Ti GPU configuration,the segmentation accuracy of 70.9% m Io U is achieved at the inference speed of 90 frames per second on the Cityscapes dataset,and the segmentation accuracy of 68.9% is achieved at the inference speed of 120 frames per second on the Cam Vid dataset.,and the overall model size is only 0.62 M,and the computational complexity is only 9.7G,which is an efficient segmentation method under the condition of limited hardware resources.(3)Using the Transformer is not restricted to the weight ratio relationship between local information,but focuses on the global interdependence,and combines the convolutional neural network CNN with the Transformer,and proposes a convolutional neural network.Encoder-Decoder Structured Network(LETNet).First,the encoder is used to extract the features of the input image,and then the feature map is reshaped and cut into a one-dimensional sequence and input to the efficient Transformer for global feature modeling.The decoding part is responsible for restoring to the original resolution for pixel-by-pixel classification prediction and finally forming a segmentation map.LETNet achieved 71.6% segmentation results on the Cityscapes dataset and70.5% on the Cam Vid dataset under the conditions of only one RTX3090 GPU hardware foundation,only 0.91 M parameters and 12.6G computational complexity.This method fully demonstrates the effectiveness of convolutional neural network CNN combined with Transformer in the field of real-time semantic segmentation.In summary,this paper proposes three effective networks for real-time semantic segmentation,combining convolutional neural network and Transformer,starting from solving practical problems,combining dilated convolution,depth-wise separable convolution,and attention mechanism in segmentation.Accuracy,model size and inference speed are better balanced,and experiments show that the method proposed in this paper achieves the expected effect and has the ability to be deployed in practical scenarios. |