As China’s satellite remote sensing monitoring system continues to improve,the acquisition channels for high-resolution remote sensing images are becoming more abundant.These images play a significant role in disaster assessment,soil and environmental protection,national military security,and other applications.The realization of these applications requires the processing and analysis of remote sensing images,transforming complex image information into more directly applicable structural and statistical information.Semantic segmentation,as a pixellevel image processing technology,is the basis for computers to deeply understand image content.However,due to the diverse types of land cover,varying scales,and complex structural details of high-resolution remote sensing images,current image semantic segmentation models cannot achieve satisfactory accuracy when directly applied to them.This article mainly focuses on the problems of semantic segmentation of high-resolution remote sensing images and proposes a multi-modal fusion model to improve the recognition ability of different scales of land cover.The main research content is as follows:(1)Due to the difficulty of inputting the experimental dataset directly into the model,this article designs experiments to determine the influence of different segmentation methods on the experimental results.Based on the results,appropriate segmentation sizes and coverage are selected,and various data augmentation methods are used to enrich the diversity of the dataset.(2)To improve the accuracy of high-resolution remote sensing image segmentation results,a multi-modal fusion semantic segmentation network model is designed based on the characteristics of high-resolution remote sensing images.The model uses a Transformer-based encoder to fully extract multi-scale features of different modalities and designs a multimodal fusion module based on attention mechanisms to fully integrate these features.Finally,the obtained results are output as segmented images through a lightweight decoder.(3)To verify the effectiveness of the proposed model,we conducted experiments on two publicly available datasets,Vasihingen and Potsdam,released by ISPRS.Using mIoU and MPA as evaluation standards,we compared the accuracy of single-modal and multi-modal inputs and the accuracy of the improved model and the basic model.The results show that the proposed model improves the mIoU and MPA by 1.64%and 1.17%,respectively,compared to the basic model. |