| Due to the different imaging principles of medical images with different modes,the information representation effects of different organs or tissues are also different.For example,CT images have better imaging effects on organs such as bones and liver,but poor contrast performance between different soft tissues,while MR Images have higher resolution soft tissue details.It responds well to blood and metabolic changes in the brain and spinal cord,but is not as good at spatial resolution.Each of these modes has its own advantages and limitations.It is difficult for a single mode image to contain all the key information of the current focal area.However,by synthesizing complementary and redundant information between different modal medical images,multi-modal medical image fusion technology effectively solves the limitations of single modal imaging for human tissue and organ information,improves the utilization efficiency of medical image information,and helps medical workers realize more accurate diagnosis and treatment.Through in-depth study of multimodal medical image fusion theory,Transformer network structure and MAE mask pre-training strategy,this paper clarified the existing problems and made improvements.The main contents are as follows:Aiming at the shortage of global feature representation in the existing multi-modal medical image fusion methods based on deep learning,this paper proposes a medical image fusion method based on local global feature coupling and cross-scale attention.The method consists of encoder,fusion rule and decoder.In the encoder,the parallel CNN and Transformer double branch networks are used to extract the local features and global representation of the image respectively.At different scales,the feature coupling module is used to embed the local features of CNN branch into the global feature representation of Transformer branch to combine complementary features to the maximum extent.At the same time,cross-scale attention module is introduced to effectively use multi-scale feature representation.The encoder extracts the local,global and multi-scale feature representations of the original image to be fused,fuses the feature representations of different source images by fusion rules,and then injects them into the decoder to generate fusion images.In this paper,the encod-decoder network is trained for image reconstruction tasks,which avoids the need for large amounts of pre-registration of medical image data sets.Aiming at the problem that the training task in the multi-modal medical fusion method of local global feature coupling and cross-scale attention is relatively simple,the network cannot obtain high-level feature representation to meet the needs of the image fusion task,a multimodal medical image fusion method based on MAE pre-training is proposed.The method is divided into two stages: pre-training and fusion.In the pre-training stage,MAE mask reconstruction task is used to train the codec network.In the fusion stage,a feature fusion module based on self-attention is designed to replace the manual fusion rule,and the network parameters are fine-tuned for the multi-modal medical image fusion task.Finally,the complete image fusion model is obtained.The encoder is used to extract the features of the source image,and the feature fusion module combines the features of different images.Finally,the fusion image is obtained by the decoder reconstruction.For the two networks proposed above,relevant experiments were carried out using Py Charm and Matlab respectively,and the experiments were compared with the newly proposed image fusion methods.The fusion results obtained by the method in this paper have good performance in terms of objective indicators and subjective vision,which reflects the effective implementation of multi-modal medical image fusion work. |