Semantic segmentation-based visual perception technology is a critical technological component in fields such as autonomous driving and industrial automation.However,in adverse weather conditions and complex lighting environments,single-modal semantic segmentation often faces issues of sensor degradation leading to segmentation errors.Multimodal visual perception technology,which comprehensively utilizes data from multiple sensors to enhance robustness in challenging working conditions,is an important direction for the development of visual perception technology.This dissertation focuses on the semantic segmentation technology that integrates RGB and thermal(RGB-T),leveraging the characteristics of light signals of different wavelengths.RGB images offer rich texture,while thermal images boast advantages such as insensitivity to lighting,passive imaging without light sources,and good penetration.RGB-T semantic segmentation harnesses the strengths of both rich texture in RGB images and the unique features of thermal images to achieve enhanced robustness in complex environments.Challenges in semantic segmentation technology based on multimodal fusion include limited availability and small size of public datasets,simple fusion strategies that cannot efficiently integrate information,alignment issues affecting semantic segmentation accuracy,and the need for additional computational resources for pre-alignment.In response to these challenges,the main contributions of this dissertation are as follows:(1)This dissertation presents a multi-modal camera calibration method based on the Maximum Index Map(MIM)derived from phase consistency for online calibration of RGB-T extrinsics in vehicle environments.By establishing the MIM from phase consistency features,matching feature point pairs are obtained,and an improved eight-point algorithm is used for the extrinsic calibration of the RGB-T camera.To address the issue of ineffective imaging of common checkerboards in thermal images,A checkerboard made of materials with different emissivities is proposed to create targets suitable for RGB-T intrinsic and extrinsic calibration,eliminating the need for continuous heating or infrared sources.Based on the proposed multi-modal camera calibration method,a multi-modal data acquisition platform is designed.This platform integrates an RGB image acquisition system with a resolution of 1280 × 960,a thermal image acquisition system with a resolution of 640×512,and a laser radar point cloud acquisition system.The synchronization of RGB data acquisition and thermal image acquisition through the same hardware clock alleviates data misalignment issues caused by time asynchrony.Experimental results demonstrate that the proposed intrinsic and extrinsic calibration method meets the requirements for RGB-T camera parameter calibration,with a root mean square reprojection error controlled within 1.2 pixels.The proposed camera calibration method can be used for online calibration of multi-modal cameras in a vibrating working environment,reducing the downtime problems caused by laboratory calibration and providing platform guarantee for multi-modal fusion visual perception technology.(2)This dissertation proposes a data augmentation method based on the Segment Anything Model(SAM)and multimodal joint augmentation.The augmentation method includes three processing operations:label optimization,annotation,and augmentation.To address the issue of poor quality labels in existing datasets,the dissertation introduces an automatic label optimization algorithm and implements interactive annotation software,resolving problems such as mislabeling,missing labels,and data misalignment.In response to the labor-intensive nature of traditional polygon annotation,the proposed annotation approach only requires point and box labels,simplifying the labeling process and improving annotation efficiency.To tackle the problem of misalignment in multimodal data,the dissertation presents a semantic annotation scheme with multiple labels for multiple modalities,assigning corresponding labels to different modalities.These multiple labels can be flexibly used for semantic segmentation and registration network training.In response to the issue of overfitting and the emergence of a memory effect during model training on limited datasets,this dissertation proposes a multimodal joint augmentation data enhancement scheme,significantly enhancing the model’s performance.The data augmentation method proposed in this dissertation significantly improves the model’s generalization ability,with an increase of 6.1%in mIoU metric and 4.7%in mAcc.The proposed data augmentation method can be used for datasets generation and enhancement,providing a data basis for multi-modal semantic segmentation.(3)This dissertation proposes a multimodal feature fusion method based on the Mixture of Experts(MoE)mechanism,which decomposes the multi-channel multimodal feature attention matrix into modality weights and channel attention.Modality weights are utilized to enhance effective modality information and suppress noise interference,and the channel attention is used to select key feature channels.The feature fusion mechanism introduced in this dissertation reduces memory consumption during computation and focuses more on the differences in modality and channel that affect feature effectiveness.Comparative experimental results show that the proposed feature fusion approach in this dissertation achieves an mIoU of 62.6%and mAcc of 72.8%compared to existing RGB-T semantic segmentation methods.The model size is 7.68E+07,and the inference time is 15.79fps.Compared to current state-of-the-art works,the proposed method improves the mIoU metric by 5.3%and reduces the model size by 50%.The proposed feature fusion method can be used for multi-modal feature fusion,which enhances the contribution of effective modalities to the final segmentation result,suppresses the interference information brought by degraded modalities,and enhances the robustness of semantic segmentation to environmental changes.(4)This dissertation introduces a non-registration-required RGB-T semantic segmentation method based on a shared encoder for multi-task learning.It employs two separate encoders to extract features from RGB and thermal images,utilizes a registration decoder to generate a deformable field for aligning the unregistered thermal features.By leveraging the positive correlation between registration accuracy and semantic segmentation accuracy,the network is trained using an auxiliary semantic segmentation loss function.Experimental results demonstrate that the proposed semantic segmentation method exhibits good self-registration effects on unregistered RGB-T data,further enhancing the segmentation accuracy,achieving an mIoU of 61.1%and mAcc of 76.0%.The proposed RGB-T semantic segmentation method without pre-registration can be applied to semantic segmentation problems with misaligned multi-modal data,reducing the dependence of multimodal semantic segmentation on data registration and enhancing the anti-interference ability against misaligned data. |