Computer Science ›› 2026, Vol. 53 ›› Issue (2): 227-235.doi: 10.11896/jsjkx.241200082

• Computer Grapnics & Multimedia • Previous Articles     Next Articles

Multimodal Visual Detection for Underwater Sonar Target Images

HUANG Jing1,2, WANG Teng1, LIU Jian1, HU Kai1, PENG Xin1, HUANG Yamin3,4, WEN Yuanqiao3,4   

  1. 1 School of Computer Science and Artificial Intelligence,Wuhan University of Technology,Wuhan 430070,China
    2 R&D Center of New-Generation Artificial Intelligence Technology Application for Transportation Industry,Zhejiang Provincial Institute of Transportation Science,Hangzhou 310013,China
    3 National Engineering Research Center for Water Transport Safety,Wuhan 430063,China
    4 Intelligent Transportation Systems Research Center,Wuhan University of Technology,Wuhan 430063,China
  • Received:2024-12-10 Revised:2025-03-31 Published:2026-02-10
  • About author:HUANG Jing,born in 1977,Ph.D,associate professor,master supervisor.His main research interests include machine learning,data mining,intelligent transportation systems,pattern recognition and computer vision.
  • Supported by:
    Transportation Industry R&D Center Open Fund for New-Gen AI Applications(202302H),Science and Techno-logy Project of Zhejiang Provincial Department of Transportation(2024006) and National Natural Science Foundation of China(52072287).

Abstract: Due to the limited underwater sonar image data and supervisory information for underwater targets,existing object detection algorithms are challenging to apply directly.To address this issue,this paper proposes an open-set underwater sonar image object detection method,USD(Underwater Sonar Detection),based on DETR.In the cross-modal feature fusion encoding module,it employs a multi-scale deformable attention mechanism to process image features iteratively,enabling the network to selectively focus on important information while reducing computational load.Simultaneously,it designs a multi-head self-attention mechanism to iterate text features,enhancing the model’s global modeling capability for sequences.Next,it utilizes a bidirectional attention mechanism to fuse text and image features,emphasizing the bidirectional relationships within the input sequences and enabling the network to capture more complex text-image interactions.Additionally,in the image-text feature decoding module,it uses image features to initialize queries,which are output from the Encoder module,and applies the DN method to address the issue of slow model convergence during training.Experiments show that the proposed method achieves a mean average precision of 77.5% on a custom underwater sonar image dataset,outperforming other detection methods in terms of precision,meanwhile successfully implements open-set object detection with robust performance.

Key words: Deep learning, Underwater sonar images, Open-set object detection, Feature fusion, Multimodal

CLC Number: 

  • TP399
[1]GU Y S,JIANG Q P,SHAO F,et al. A quality evaluation dataset for real underwater image enhancement[J].Journal of Image and Graphics,2022,27(5):1467-1480.
[2]CHEN L,DING D D.Underwater image enhancement based on multi-residual joint learning[J].Journal of Image and Graphics,2022,27(5):1577-1588.
[3]GUO J C,YUE H H,ZHANG Y,et al.A study on the impact of image enhancement on salient object detection[J].Journal of Image and Graphics,2022,27(7):2129-2147.
[4]WANG K Y,HUANG S R,LI Y S.Research progress on underwater optical image reconstruction methods[J].Journal of Image and Graphics,2022,27(5):1337-1358.
[5]QIAN X Q,LIU W F,ZHANG J,et al.Degradation feature enhancement algorithm for underwater image object detection[J].Journal of Image and Graphics,2022,27(11):3185-3198.
[6]LIANG X M,LI R,YU H F,et al.Improved YOLOv7 algorithm for underwater object detection[J].Computer Engineering and Applications,2024,60(6):89-99.
[7]YAN X H.Research on underwater object detection methodbased on deep learning[D].Harbin:Harbin Engineering University,2021.
[8]CHEN X L.Deep learning-based underwater litter detection[D].Guizhou:Guizhou Normal University,2023.
[9]YU Y,ZHAO J,GONG Q,et al.Real-Time Underwater Maritime Object Detection in Side-Scan Sonar Images Based on Transformer-YOLOv5[J].Remote Sensing,2021,13(18):3555.
[10]GUO Y L.Research on deep learning-based underwater sonar image object detection method[D].Jinan:Shandong Jiaotong University,2023.
[11]LIANG H,JIN L L,YANG C S.Research on underwater object recognition based on deep learning under small sample conditions[J].Journal of Wuhan University of Technology(Transportation Science & Engineering Edition),2019,43(1):6-10.
[12]VARGHVARGHESE R,SAMBATH M.YOLOv8:A NovelObject Detection Algorithm with Enhanced Performance and Robustness[C]//Proceedings of IEEE International Conference on Advances in Data Engineering and Intelligent Computing Systems.New York:IEEE Press,2024:1-6.
[13]FENG J J,LI B,TIAN L F,et al.Semi-supervised surface object detection based on multi-view cross-consistency learning[J].Journal of Harbin Institute of Technology,2023,55(4):107-114.
[14]AMIN R A,HASAN M,WIESE V,et al.FPGA-Based Real-time Object Detection and Classification System Using YOLO For Edge Computing[J].IEEE Access,2024,12:73268-73278.
[15]WANG C Y,BOCHKOVSKIY A,LIAO H Y M.YOLOv7:Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors[C]//Proceedings of IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition.New York:IEEE Press,2023:7464-7475.
[16]LUO F,LI J W,HE D S.Ship object detection based on scale-adaptive receptive field[J].Application Research of Computers,2024,41(8):2521-2527.
[17]LONG Y,WEN Y,HAN J,et al.CapDet:Unifying Dense Captioning and Open-World Detection Pretraining[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE Press,2023:15233-15243.
[18]SCHEIRER W J,DE REZENDE A,SAPKOTA A,et al.TowardOpen Set Recognition[J].IEEE Transactions on Pattern Ana-lysis and Machine Intelligence,2012,35(7):1757-1772.
[19]RADFORD A,KIM J W,HALLACY C,et al.Learning Transferable Visual Models From Natural Language Supervision[C]//Proceedings of the 38th International Conference on Machine Learning.Virtual:PMLR,2021:8748-8763.
[20]GU X,LIN T Y,KUO W,et al.Open-Vocabulary Object Detection Via Vision and Language Knowledge Distillation[J].arXiv:2104.13921,2021.
[21]ZHONG Y,YANG J,ZHANG P,et al.RegionCLIP:Region-Based Language-Image Pretraining[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New Orleans:IEEE Press,2022:16793-16803.
[22]MINDERER M,GRITSENKO A,STONE A,et al.SimpleOpen-Vocabulary Object Detection[C]//Proceedings of European Conference on Computer Vision.Cham:Springer Nature Switzerland,2022:728-755.
[23]YAO L,HAN J,WEN Y,et al.DetCLIP:Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection[J].Advances in Neural Information Processing Systems,2022,35:9125-9138.
[24]YAO L,HAN J,LIANG X,et al.DetCLIPv2:Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).Vancouver:IEEE Press,2023:23497-23506.
[25]KENTHAPADI K,SAMEKI M,TALY A.Grounding andEvaluation for Large Language Models:Practical Challenges and Lessons Learned(survey)[C]//Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.New York:ACM Press,2024:6523-6533.
[26]LI Z,XU Q,ZHANG D,et al.Groundinggpt:Language En-hanced Multi-modal Grounding Model[C]//Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers).Bangkok:Association for Computational Linguistics,2024:6657-6678.
[27]LI L H,ZHANG P,ZHANG H,et al.Grounded language-image pre-training[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE Press,2022:10965-10975.
[28]LIU S,ZENG Z,REN T,et al.Grounding DINO:Marrying DINO with Grounded Pre-training for Open-Set Object Detection[C]//Proceedings of the European Conference on Computer Vision.Cham:Springer,2025:38-55.
[29]CARION N,MASSA F,SYNNAEVE G,et al.End-to-End Object Detection with Transformers[C]//Proceedings of the European Conference on Computer Vision.Cham:Springer International Publishing,2020:213-229.
[30]MENG D,CHEN X,FAN Z,et al.Conditional DETR for Fast Training Convergence[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.Montreal:IEEE Press,2021:3651-3660.
[31]LI F,ZHANG H,LIU S,et al.DN-DETR:Accelerate DETRTraining by Introducing Query DeNoising[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New Orleans:IEEE Press,2022:13619-13627.
[32]WANG Y,ZHANG X,YANG T,et al.Anchor DETR:QueryDesign for Transformer-Based Object Detection[J].arXiv:2109.07107,2021.
[33]ZHANG H,LI F,LIU S,et al.Dino:Detr withImproved Denoi-sing Anchor Boxes for End-to-End Object Detection[J].arXiv:2203.03605,2022.
[34]REDMON J.You Only Look Once:Unified,Real-time ObjectDetection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Las Vegas:IEEE Press,2016.
[35]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of Deep Bidirectional Transformers for Language Understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies(NAACL-HLT).Minneapolis:Association for Computational Linguistics,2019:1-2.
[36]LU J,BATRA D,PARIKH D,et al.VILBERT:PretrainingTask-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks[C]//Proceedings og the 33rd International Conference on Neural Information Processing Systems.2019,13-23.
[37]ZHU X,SU W,LU L,et al.Deformable Detr:DeformableTransformers for End-to-End Object Detection[J].arXiv:2010.04159,2020.
[38]REN S,HE K,GIRSHICK R,et al.Faster R-CNN:Towardsreal-time object detection with region proposal networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2016,39(6):1137-1149.
[39]CHEN Q,CHEN X,WANG J,et al.Group DETR:Fast DETR Training with Group-Wise One-to-Many Assignment[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.Paris:IEEE Press,2023:6633-6642.
[1] XI Penghui, WU Xiazhen, JIANG Wencong, FANG Liangda, HE Chaobo, GUAN Quanlong. Review of Personalized Educational Resource Recommendations [J]. Computer Science, 2026, 53(2): 1-15.
[2] ZHUO Tienong, YING Di, ZHAO Hui. Research on Student Classroom Concentration Integrating Cross-modal Attention and Role
Interaction
[J]. Computer Science, 2026, 53(2): 67-77.
[3] CHEN Haitao, LIANG Junwei, CHEN Chen, WANG Yufan, ZHOU Yu. Multimodal Physical Education Data Fusion via Graph Alignment for Action Recognition [J]. Computer Science, 2026, 53(2): 89-98.
[4] LIU Chenhong, LI Fenglian, YANG Jia, WANG Suzhe, CHEN Guijun. Boundary-focused Multi-scale Feature Fusion Network for Stroke Lesion Segmentation [J]. Computer Science, 2026, 53(2): 264-272.
[5] BU Yunyang, QI Binting, BU Fanliang. Multimodal Sentiment Analysis for Interactive Fusion of Dual Perspectives Under Cross-modalInconsistent Perception [J]. Computer Science, 2026, 53(1): 187-194.
[6] FAN Jiabin, WANG Baohui, CHEN Jixuan. Method for Symbol Detection in Substation Layout Diagrams Based on Text-Image MultimodalFusion [J]. Computer Science, 2026, 53(1): 206-215.
[7] DUAN Pengting, WEN Chao, WANG Baoping, WANG Zhenni. Collaborative Semantics Fusion for Multi-agent Behavior Decision-making [J]. Computer Science, 2026, 53(1): 252-261.
[8] ZHANG Xiaomin, ZHAO Junzhi, HE Hongjie. Screen-shooting Resilient Watermarking Method for Document Image Based on Attention Mechanism [J]. Computer Science, 2026, 53(1): 413-422.
[9] HUANG Miaomiao, WANG Huiying, WANG Meixia, WANG Yejiang , ZHAO Yuhai. Review of Graph Embedding Learning Research:From Simple Graph to Complex Graph [J]. Computer Science, 2026, 53(1): 58-76.
[10] WANG Cheng, JIN Cheng. KAN-based Unsupervised Multivariate Time Series Anomaly Detection Network [J]. Computer Science, 2026, 53(1): 89-96.
[11] XUE Jingyan, XIA Jianan, HUO Ruili, LIU Jie, ZHOU Xuezhong. Review of Retinal Image Analysis Methods for OCT/OCTA Based on Deep Learning [J]. Computer Science, 2026, 53(1): 128-140.
[12] ZHOU Bingquan, JIANG Jie, CHEN Jiangmin, ZHAN Lixin. EvR-DETR:Event-RGB Fusion for Lightweight End-to-End Object Detection [J]. Computer Science, 2026, 53(1): 153-162.
[13] LIU Wei, XU Yong, FANG Juan, LI Cheng, ZHU Yujun, FANG Qun, HE Xin. Multimodal Air-writing Gesture Recognition Based on Radar-Vision Fusion [J]. Computer Science, 2025, 52(9): 259-268.
[14] GAO Long, LI Yang, WANG Suge. Sentiment Classification Method Based on Stepwise Cooperative Fusion Representation [J]. Computer Science, 2025, 52(9): 313-319.
[15] YIN Shi, SHI Zhenyang, WU Menglin, CAI Jinyan, YU De. Deep Learning-based Kidney Segmentation in Ultrasound Imaging:Current Trends and Challenges [J]. Computer Science, 2025, 52(9): 16-24.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!