计算机科学 ›› 2025, Vol. 52 ›› Issue (8): 188-194.doi: 10.11896/jsjkx.240600106

• 计算机图形学&多媒体 • 上一篇    下一篇

MTFuse:基于Mamba和Transformer的红外与可见光图像融合网络

丁政泽, 聂仁灿, 李锦涛, 苏华平, 徐航   

  1. 云南大学信息学院 昆明 650091
  • 收稿日期:2024-06-17 修回日期:2024-09-26 出版日期:2025-08-15 发布日期:2025-08-08
  • 通讯作者: 聂仁灿(rcnie@ynu.edu.cn)
  • 作者简介:(dingzhengze@stu.ynu.edu.cn)
  • 基金资助:
    国家自然科学基金(61966037);云南省基础研究计划重点项目(202301AS070025,202401AT070467);国家重点研发项目(2020YFA0714301);云南省科技厅项目基金资助项目(2012105AF150011);云南省教育厅科学研究基金项目(2024Y031)

MTFuse:An Infrared and Visible Image Fusion Network Based on Mamba and Transformer

DING Zhengze, NIE Rencan, LI Jintao, SU Huaping, XU Hang   

  1. School of Information Science and Engineering,Yunnan University,Kunming 650091,China
  • Received:2024-06-17 Revised:2024-09-26 Online:2025-08-15 Published:2025-08-08
  • About author:DING Zhengze,born in 2000,postgra-duate.His main research interests include deep learning and image fusion.
    NIE Rencan,born in 1982,Ph.D,professor,doctoral supervisor.His main research interests include neural networks,image processing and machine learning.
  • Supported by:
    National Natural Science Foundation of China(61966037),Key Project of Yunnan Basic Research Program(202301AS070025,202401AT070467),National Key Research and Development Project of China(2020YFA0714301),Science and Technology Department of Yunnan Province project Fundation(202105AF150011) and Yunnan Provincial Department of Education Science Foundation(2024Y031).

摘要: 红外与可见光图像融合旨在保留红外图像的热辐射信息和可见光图像的纹理细节,以表示成像场景并全面促进下游视觉任务。基于卷积神经网络的融合模型由于专注于局部卷积运算,在捕获全局图像特征方面遇到限制。基于Transformer的模型虽然在全局特征建模方面表现出色,但也面临着二次复杂性带来的计算挑战。选择性结构化状态空间模型(Mamba)在具有线性复杂性的远程依赖建模方面表现出了巨大的潜力,为解决上述问题提供了一条有希望的路径。为了高效建模图像远程依赖,设计了一个残差选择性结构化状态空间模块(RMB)提取全局特征。同时,为了对多模态图像之间的关系进行建模,设计了一个跨模态查询融合注意力模块(CQAM)用于特征的自适应融合。此外,设计了一个由两项组成的损失函数,包括梯度损失和亮度损失,旨在以无监督的方式训练所提出的模型。与大量其他先进的方法在融合质量的对比实验和消融实验上证明了所提出的方法的有效性。

关键词: 选择性结构化状态空间模型, Transformer, 无监督学习, 红外与可见光图像融合

Abstract: Infrared and visible image fusion aims to retain the thermal radiation information from infrared images and the texture details from visible images to represent the imaging scene and comprehensively promote downstream visual tasks.Fusion models based on convolutional neural networks(CNNs) encounter limitations in capturing global image features due to their focus on local convolutional operations.Although Transformer-based models excel in global feature modeling,they also face computational challenges posed by quadratic complexity.Recently,the selective structured state-space model(Mamba) has shown great potential in modeling long-range dependencies with linear complexity,providing a promising path to address the aforementioned issues.To efficiently model long-range dependencies in images,this paper designs a residual selective structured state space module(RMB) for extracting global features.Simultaneously,to model the relationship between multimodal images,a cross-modal query fusion attention module(CQAM) is designed for adaptive feature fusion.Furthermore,a loss function consisting of two terms,including gradient loss and brightness loss,is designed to train the proposed model in an unsupervised manner.Comparative experiments on fusion quality and efficiency with numerous other state-of-the-art methods and ablation studies demonstrate the effectiveness of the proposed MTFuse method.

Key words: Selective structured state space model, Transformer, Unsupervised learning, Infrared and visible image fusion

中图分类号: 

  • TP391
[1]CHEN H,DENG L,ZHU L,et al.ECFuse:Edge-Consistent and Correlation-Driven Fusion Framework for Infrared and Visible Image Fusion [J].Sensors,2023,23(19):8071.
[2]KAUR H,KOUNDAL D,KADYAN V.Image fusion tech-niques:a survey [J].Archives of Computational Methods in Engineering,2021,28(7):4425-4447.
[3]ZHAO W,XIE S,ZHAO F,et al.Metafusion:Infrared and visible image fusion via meta-feature embedding from object detection[C]//Proceeding of the IEEE/CVF Conference on Compu-ter Vision and Pattern Recognition.2023.
[4]ZHAO Z,XU S,ZHANG J,et al.Efficient and model-based infrared and visible image fusion via algorithm unrolling [J].IEEE Transactions on Circuits and Systems for Video Technology,2021,32(3):1186-1196.
[5]MA J,MA Y,LI C.Infrared and visible image fusion methods and applications:A survey [J].Information Fusion,2019,45:153-178.
[6]TANG W,LIU Y,CHENG J,et al.A phase congruency-based green fluorescent protein and phase contrast image fusion me-thod in nonsubsampled shearlet transform domain [J].Microscopy Research and Technique,2020,83(10):1225-1234.
[7]ZHANG Q,LIU Y,BLUM R S,et al.Sparse representationbased multi-sensor image fusion for multi-focus and multi-modality images:A review [J].Information Fusion,2018,40:57-75.
[8]KONG W,LEI Y,ZHAO H.Adaptive fusion method of visible light and infrared images based on non-subsampled shearlet transform and fast non-negative matrix factorization [J].Infrared Physics & Technology,2014,67:161-172.
[9]MA J,TANG L,FAN F,et al.SwinFusion:Cross-domain long-range learning for general image fusion via swin transformer [J].IEEE/CAA Journal of Automatica Sinica,2022,9(7):1200-1217.
[10]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need [C]//Advances in Neural Information Processing Systems.2017.
[11]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.Animage is worth 16x16 words:Transformers for image recognition at scale [J].arXiv:2010.11929,2020.
[12]LIU Z,LIN Y,CAO Y,et al.Swin transformer:Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021.
[13]ZAMIR S W,ARORA A,KHAN S,et al.Restormer:Efficient transformer for high-resolution image restoration[C]//Procee-dings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022.
[14]LU J,BATRA D,PARIKH D,et al.Vilbert:Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks [C]//Advances in Neural Information Processing systems.2019.
[15]SUN Y,DONG L,HUANG S,et al.Retentive network:A successor to transformer for large language models [J].arXiv:2307.08621,2023.
[16]GU A,DAO T.Mamba:Linear-time sequence modeling with selective state spaces [J].arXiv:2312.00752,2023.
[17]LIU Y,TIAN Y,ZHAO Y,et al.Vmamba:Visual state space model [J].arXiv:2401.10166,2024.
[18]HAMILTON J D.State-space models [J].Handbook of Econometrics,1994,4:3039-3080.
[19]ZHAO D,SHU X,ZHANG L,et al.Sensor interrogation technique using chirped fibre grating based Sagnac loop [J].Electronics Letters,2002,38(7):312-313.
[20]HAN Y,CAI Y,CAO Y,et al.A new image fusion performance metric based on visual information fidelity [J].Information Fusion,2013,14(2):127-135.
[21]XYDEAS C S,PETROVIC V.Objective image fusion perfor-mance measure [J].Electronics Letters,2000,36(4):308-309.
[22]WANG Z,BOVIK A C,SHEIKH H R,et al.Image quality assessment:from error visibility to structural similarity [J].IEEE Transactions on Image Processing,2004,13(4):600-612.
[23]ESKICIOGLU A M,FISHER P S.Image quality measures and their performance [J].IEEE Transactions on Communications,1995,43(12):2959-2965.
[24]CUI G,FENG H,XU Z,et al.Detail preserved fusion of visible and infrared images using regional saliency extraction and multi-scale image decomposition [J].Optics Communications,2015,341:199-209.
[25]JAGALINGAM P,HEGDE A V.A review of quality metrics for fused image [J].Aquatic Procedia,2015,4:133-142.
[26]ZHAO Z,XU S,ZHANG C,et al.DIDFuse:Deep image decomposition for infrared and visible image fusion [J].arXiv:2003.09210,2020.
[27]XU H,MA J,JIANG J,et al.U2Fusion:A unified unsupervised image fusion network [J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2020,44(1):502-518.
[28]MA J,ZHANG H,SHAO Z,et al.GANMcC:A generative adversarial network with multiclassification constraints for infrared and visible image fusion [J].IEEE Transactions on Instrumentation and Measurement,2020,70:1-14.
[29]ZHANG H,MA J.SDNet:A versatile squeeze-and-decomposi-tion network for real-time image fusion [J].International Journal of Computer Vision,2021,129(10):2761-2785.
[30]TANG W,HE F,LIU Y.YDTR:Infrared and visible image fusion via Y-shape dynamic transformer [J].IEEE Transactions on Multimedia,2022,25:5413-5428.
[31]LIU J,FAN X,HUANG Z,et al.Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022.
[32]LIANG P,JIANG J,LIU X,et al.Fusion from decomposition:A self-supervised decomposition approach for image fusion[C]//European Conference on Computer Vision.Springer,2022.
[33]HUANG Z,LIU J,FAN X,et al.Reconet:Recurrent correction network for fast and efficient multi-modality image fusion[C]//European Conference on Computer Vision.Springer,2022.
[34]TANG W,HE F,LIU Y,et al.DATFuse:Infrared and visible image fusion via dual attention transformer [J].IEEE Transactions on Circuits and Systems for Video Technology,2023,33(7):3159-3172.
[35]LI H,XU T,WU X J,et al.LRRNet:A novel representationlearning guided fusion network for infrared and visible images [J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2023,45(9):11040-11052.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!